Compiling Apertium morphological dictionaries with HFST and using them in HFST applications\footnotepubrightsThis article was published in saltmil workshop in LREC 2011 in Malta. Original version \urlhttp://ixa2.si.ehu.es/saltmil/.

Tommi A Pirinen    Francis M. Tyers
University of Helsinki
   Universitat d’Alacant
FI-00014 University of Helsinki Finland
   E-03071 Alacant Spain
\urltommi.pirinen@helsinki.fi
   \urlftyers@dlsi.ua.es
Last modifications: September 29, 2017
Abstract

In this paper we aim to improve interoperability and re-usability of the morphological dictionaries of Apertium machine translation system by formulating a generic finite-state compilation formula that is implemented in HFST finite-state system to compile Apertium dictionaries into general purpose finite-state automata. We demonstrate the use of the resulting automaton in FST-based spell-checking system.
Keywords: finite-state, dictionary, spell-checking

1 Introduction

Finite-state automata are one of the most effective format for representing natural language morphologies in computational format. The finite-state automata, once compiled and optimised via process of minimisation are very effective for parsing running text. This format is also used when running morphological dictionaries in machine-translation system Apertium [3]11\urlhttp://www.apertium.org. In this paper we propose a generic compilation formula to compile the dictionaries into weighted finite state automata for use with any FST tool or application. We implement this system using a free/libre open-source finite-state API HFST [7]22\urlhttp://hfst.sf.net. HFST is a general purpose programming interface using a selection of freely-available finite-state libraries for the handling of finite-state automata.

While Apertium uses the dictionaries and the finite-state automata for machine translation, HFST is used in multitude of other applications ranging from basic morphological analysis [7] to end-user applications such as spell-checking [10] and predictive text-entry for mobile phones [13]. In this article we show how to generate automatically a spell-checker from an Apertium dictionary and evaluate roughly the usability of the automatically generated spell-checker.

The rest of the article is laid out as follows: In section 2 we describe the generic compilation formula for the HFST-based compilation of Apertium dictionaries and the formula for induction of spell-checkers error model from Apertium’s dictionary. In section 3 we introduce the Apertium dictionary repository and the specific dictionaries we use to evaluate our systems. In section 4 we evaluate speed and memory usage of compilation and application of our formula against Apertium’s own system and show that our system has roughly same coverage and explain the differences arise from.

2 Methods

The compilation of Apertium dictionaries is relatively straight-forward. We assume here standard notations for finite-state algebra. The morphological combinatorics of Apertium dictionaries are defined in following terms: There is one set of root morphs (finite strings) and arbitrary number of named sets of affix morphs called pardefs. Each set of affix morphs is associated with a name. Each morph can also be associated with a paradigm reference pointing to a named subset of affixes. As an example, a language of singular and plural of cat and dog in English would be described by root dictionary consisting of morphs cat and dog, both of which point on the right-hand side to pardef named number. The number affix morphs are defined then as set of two morphs, namely s for plural marker and empty string for singular marker.

Each morph can be compiled into single-path finite-state automaton33the full formula allows any finite-state language as morph, compiled from regular expressions, the extension to this is trivial but for readability we present the formula for string morphs containing the actual morph as string of UTF-8 arcs m. The morphs in the root dictionary are extended from left or right sides by joiner markers iff they have a pardef definition there and each affix dictionary is extended on the left (for suffixes) or right (for prefixes) by the pardef name marker. In the example of cats, dogs language this would mean finite state paths c a t NUMBER, d o g NUMBER, NUMBER s and NUMBER ϵ, where ϵ as usual marks zero-length string44In the current implementation we have used temporarily a special non-epsilon marker as this decreases the local indeterminism and thus compilation time. These sets of roots and affixes can be compiled into disjunction of such joiner delimited morphs. Now, the morphotactics can be defined as related to joiners by any such path that contains joiners only as pairs of adjacent identical paradigm references, such as c a t NUMBER NUMBER s or d o g NUMBER NUMBER ϵ, but not c a t NUMBER d o g NUMBER or NUMBER s NUMBER s. The finite-state formula for this morphotactics is defined by

Mx=(Σxpxx), (1)

where p is set of pardef names and Σ the set of symbols in morphs not including the set of pardef names. Now the final dictionary is simply composition of these morphotactic rules over the repetion of affixes and roots:

(MaMr)Mx, (2)

where Ma is the disjunction of affixes with joiners, Mr the disjunction of roots with joiners, and Mx the morphotactics defined in formula 1. This is a variation of morphology compilation formula presented in various HFST documentation, such as [7].

2.1 Implementation Details

There are lot of finer details we will not thoroughly cover in this article, as they are mainly engineering details. In this section we shortly summarise specific features of HFST-based FST compilation that result in meaningful differences in automaton structure or working. One of the main source of differences is that HFST automata are two-sided and compiled only ones from the source code whereas Apertium generates two different automata for analysis and generation. In these automata the structure may be different, since Apertium dictionaries have ways of marking morphs limited to generation or analysis only, so they will only be included in one of the automatons. Our approach to this is to use special symbols called flag-diacritics [1] to limit the paths as analysis only or generation only on runtime, but still including all paths in the one transducer that gets compiled.

Another main difference in processing comes from the special word-initial, word-final and separate morphs that in Apertium are contained in separate automata altogether, but HFST tools do not support use of multiple automata for analysis, so these special morphs will be concatenated optionally to beginning or end of the word, or disjuncted to the final automata respectively. These special morphs include things like article l’ in French as bound form.

2.2 Creating a Spell-Checker Automatically

To create a finite-state spell-checker we need two automata, one for the language model, for which the dictionary compiled as described earlier will do, and one for the error model [10]. A classic baseline error model is based on the edit distance algorithm [6, 2], that defines typing errors of four types: pressing extra key (insertion), not pressing a key (deletion), pressing wrong key (change) and pressing two keys in wrong order (swap). There have been many finite-state formulations of this, we use the one defined in [12, 10]. The basic version of this where the typing errors of each sort have equal likelihood for each letters can be induced from the compiled language model, and this is what we use in this paper. The induction of this model is relatively straightforward; when compiling the automaton, save each unique UTF-8 codepoint found in the morphs55The description format of Apertium requires declaration of exemplar character set as well, but as this is only used in the tokenisation algorithm [4] , which is not going to be used, we induce the set from the morphs. For each character generate the identities in start and end state to model correctly typed runs. For each of the error types the generate one arc from initial state to the end state modelling that error, except for swap which it requires one auxiliary state for each character pair.

3 Materials

The Apertium project hosts a large number of morphological dictionaries for each of the languages translated. From these we have selected three dictionaries to be tested: Basque from Basque-Spanish pair as it is released dictionary with the biggest on-disk size, Norwegian Nynorsk from the Norwegian pair as a language that has some additional morphological complexity, such as compounding, and Manx from as a language that currently lacks spell-checking tools to demonstrate the plausibility of automatic conversion of Apertium dictionary into a spell-checker66We also provide a Makefile script to recreate results of this article for any language in Apertium’s repository.

To evaluate the use of resulting morphological dictionaries and spell-checkers we use following Wikipedia database dumps77\urlhttp://download.wikipedia.org/: euwiki-20120219-pages-articles.xml.bz2, nnwiki-20120215-pages-articles.xml.bz2, and gvwiki-20120215-pages-articles.xml.bz2. For the purpose of this article we performed very crude cleanup and preprocessing to Wikipedia data picking up the text elements of the article and discarding most of Wikipedia markup naïvely88For details see the script in \urlhttp://hfst.svn.sourceforge.net/viewvc/hfst/trunk/lrec-2011-apertium/..

4 Test Setting and Evaluation

To get one view on differences made by generic compilation formula instead of direct automata building used by Apertium we look at the created automata, this will also give us a rough idea of what its efficiency might be. In table 1 we give the counts of nodes and edges, in that order, in the graphs compiled from the dictionaries. Note, that in case of Apertium it is the sum of all the separate automata states and edges that is counted. The small differences in sizes of graphs are mostly caused by the different handling of generation vs. analysis mode. The difference in sizes of automata on disk in is shown in table 2. The size of HFST automata can be attributed to the clever compression algorithm used by HFST [14].

Lang. Apertium LR Apertium RL HFST
Basq. 30,114 34,005 34,824
59,321 68,030 68,347
Norg. 56,226 55,722 56,871
138,217 132,475 139,259
Manx 13,055 12,955 12,920
28,220 27,062 27,031
Table 1: Size of HFST-based system against original (count of nodes first, then edges)
Lang. Apertium LR Apertium RL HFST
Basq. 252 KiB 289 KiB 1,7 MiB
Norg. 558 KiB 535 KiB 3,7 MiB
Manx 108 KiB 110 KiB 709 KiB
Table 2: Size of HFST-based system against original (as B on disk)

To test efficiency we measure times of running various tasks. The times and memory usage have been measured using GNU time utility and getrusage system call’s ru_utime field, averaged over three test runs. The tests were performed on quad-core Intel Xeon E5450 @ 3.00 GHz with 64 GiB of RAM.

First we measure speed of analysing a full corpus with the result automaton. The speed is measured in the table 3, in seconds to precision that was available in our system. Curiously the results do not give direct advantage to either of the system but it seems to depend on the language which system is a better choice for corpus analysis.

Language Apertium HFST
Basque 32.0 s 18.4 s
Norwegian 2.4 s 5.5 s
Manx 1.6 s 2.2 s
Table 3: Speed of HFST-based system against original in corpus analysis (as s in user time)

Similarly we measure the speed of current compilation process in table 4. In here there’s an obvious advantage to manual building of the automaton (see [11] for the precise algorithm used) over the finite-state algebra method, as is in line with earlier results for lexc building in [8].

Language Apertium time HFST time
Basque 35.7 s 160.0 s
Norwegian 6.6 s 200.2 s
Manx 0.8 s 11.2 s
Table 4: Speed of HFST-based system against original in compilation (as seconds of user time)

Finally we evaluate the usability of dictionaries meant for machine translation as spell-checkers by running the finite-state spell checkers we produced automatically through a large corpus and show the measure both speed and quality of the results. The errors were automatically generated to Wikipedia text’s correct words using simple algorithm that may generate one Levenshtein error per each character position at probability of 133. This test shows only rudimentary results on the plausibility of using machine translation dictionary for spell-checking; for more thorough evaluation of efficiency of finite-state spell-checking see [5].

Language Speed (words/sec)
Basque 7,900
Norwegian 9,200
Manx 4,700
Table 5: Efficiency of spelling correction in artificial test setup, average over three runs.

5 Conclusions

In this article we have shown a general formula to compile morphological dictionaries from machine-translation system Apertium in generic FST system of HFST and using the result in HFST-based application of spell-checking.

6 Future Work

In this article we showed a basic method to gain more inter-operability between generic FST system of HFST and a specialised morphological dictionary writing formalism of machine-translation system Apertium by implementing a generic compilation formula to compile the language descriptions. In future research we are leveraging this and other related formulas into automatic optimisation of the final automata using the information present in the language description to optimise instead of relying generic graph algorithms for the final minimised result automata.

We demonstrated importing the compiled dictionary as a language model and inducing error model for real-world spell-checking applications. Further development in this direction should aim for interoperable formalisms, formats and mechanisms for language models and end applications of all relevant language technology tools.

Acknowledgements

We thank the HFST and Apertium contributors for fruitful internet relayed chats, and the two anonymous reviewers for their helpful suggestions.

References

  • [1] K. R. Beesley and L. Karttunen (2003) Finite state morphology. CSLI publications. External Links: ISBN 978-1575864341 Cited by: 2.1.
  • [2] F. J. Damerau (1964) A technique for computer detection and correction of spelling errors. Commun. ACM (7). Cited by: 2.2.
  • [3] M. L. Forcada, M. Ginestí-Rosell, J. Nordfalk, J. O’Regan, S. Ortiz-Rojas, J. A. Pérez-Ortiz, F. Sánchez-Martínez, G. Ramírez-Sánchez and F. M. Tyers (2011-07) Apertium: a free/open-source platform for rule-based machine translation. Machine Translation. External Links: ISSN 0922-6567, Link, Document Cited by: 1.
  • [4] A. Garrido-Alenda, M. L. Forcada and R. C. Carrasco (2002) Incremental construction and maintenance of morphological analysers based on augmented letter transducers. In Proceedings of TMI 2002 (Theoretical and Methodological Issues in Machine Translation, Keihanna/Kyoto, Japan), pp. 53–62. Cited by: 2.2.
  • [5] A. Hassan, S. Noeman and H. Hassan (2008) Language independent text correction using finite state automata. In Proceedings of the Third International Joint Conference on Natural Language Processing, Vol. 2, pp. 913–918. Cited by: 4.
  • [6] V. I. Levenshtein (1966) Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics—Doklady 10, 707–710. Translated from Doklady Akademii Nauk SSSR, pp. 845–848. Cited by: 2.2.
  • [7] K. Lindén, M. Silfverberg, E. Axelson, S. Hardwick and Pirinen (2011) HFST—framework for compiling and applying morphologies. In Systems and Frameworks for Computational Morphology, C. Mahlow and M. Pietrowski (Eds.), Communications in Computer and Information Science, Vol. Vol. 100, pp. 67–85. External Links: ISBN 978-3-642-23137-7 Cited by: 1, 1, 2.
  • [8] K. Lindén, M. Silfverberg and T. Pirinen (2009) HFST tools for morphology—an efficient open-source package for construction of morphological analyzers. See Workshop on systems and frameworks for computational morphology, sfcm 2009, zürich, switzerland, september 2009, proceedings, Mahlow and Piotrowski, pp. 28–47. Cited by: 4.
  • [9] C. Mahlow and M. Piotrowski (Eds.) (2009) Workshop on systems and frameworks for computational morphology, sfcm 2009, zürich, switzerland, september 2009, proceedings. Lecture Notes in Computer Science, Vol. 41, Springer. External Links: ISBN 978-3-642-04130-3 Cited by: 8.
  • [10] T. A. Pirinen and K. Lindén (2010) Finite-state spell-checking with weighted language and error models. In Proceedings of the Seventh SaLTMiL workshop on creation and use of basic lexical resources for less-resourced languagages, Valletta, Malta, pp. 13–18. External Links: Link Cited by: 1, 2.2.
  • [11] S. O. Rojas, M. L. Forcada and G. R. Sánchez (2005) Construcción y minimización eficiente de transductores de letras a partir de diccionarios con paradigmas. Procesamiento del Lenguaje Natural (35), pp. 51–57. Cited by: 4.
  • [12] K. Schulz and S. Mihov (2002) Fast string correction with levenshtein-automata. International Journal of Document Analysis and Recognition 5, pp. 67–85. Cited by: 2.2.
  • [13] M. Silfverberg, M. Hyvärinen and T. Pirinen (2011) Improving predictive entry of finnish text messages using irc logs. pp. 69–76. Cited by: 1.
  • [14] M. Silfverberg and K. Lindén (2009-13 July) HFST runtime format—a compacted transducer format allowing for fast lookup. See Pre-proceedings of the eighth international workshop on finite-state methods and natural language processing (fsmnlp 2009), pretoria, south africa, july 21st - 24th 2009, Watson et al., External Links: Link Cited by: 4.
  • [15] B. Watson, D. Courie, L. Cleophas and P. Rautenbach (Eds.) (2009-13 July) Pre-proceedings of the eighth international workshop on finite-state methods and natural language processing (fsmnlp 2009), pretoria, south africa, july 21st - 24th 2009. , Vol. , . Note: Handout CD-ROM containing the accompanying papers for the presentations during the FSMNLP 2009 workshop. Published by the University of Pretoria, Pretoria, South Africa. External Links: ISBN 978-1-86854-743-2 Cited by: 14.