Development and Use of Computational Morphology of Finnish in the Open Source and Open Science Era—Notes on Experiences with Omorfi Development\footnotepubrightsThe official publication in 2016, from doc format version, will differ quite a bit in formatting.

Tommi A Pirinen
Dublin City University
ADAPT—School of Computing
Abstract

This article describes a contemporary system for the computational modeling of the morphology of Finnish word-forms called Omorfi. The purpose of this article is to present new developments and open development model of the morphological analysis of Finnish to the linguistic audience. We also describe some of the advances from the earlier analysers. The article shows Omorfi as a full-fledged, stable system for real-world usage in linguistic research and computational linguistics applications, as opposed to presenting a new technological feature in a limited experiment. We describe results from experimenting the system in large-scale linguistic analysis setting. We also present a selection of real-world use cases for the morphological analysers that have affected the development of Omorfi.

Our system is based fully on free and open source data and methods. This has brought forth some new possibilities to the lexicographical data collection using crowdsourced dictionary Wiktionary, as well as statistical modelling with e.g., the Wikipedia data. One of the reasons of this article is to describe the successful use of the community-driven development model in Omorfi, including crowd-sourcing as well as independent expert contributions.

We evaluated our analyser to give a rough idea of its usefulness and applications in linguistic work. To do this, we processed large text corpora with Omorfi and verified that only 1 % of the word-forms remains unknown to the system. Omorfi was also verified to match the FinnTreeBank 3.1 standard analyses at 93 % faithfulness per token count.

\titleformat

*

1\titleformat

*

1.1

2 Introduction

Computational morphological models and management of lexicographical data are a central component for most of the computational applications of linguistic analysis. Computational morphology of the Finnish language was first described some 30 years ago (Koskenniemi, 1983). The aim of this article is to present Omorfi11 https://github.com/flammie/Omorfi/ as a matured scientific project involving contributions from scientific community as well as crowd-sourced lexicographical additions, as a full-fledged project for managing lexicographical database on one hand and its natural language parser on the other hand. We will discuss our approach to lexicography and parser building in collaboration with crowds and experts. On technical side, we highlight some of the new features in the parser, especially from the point of view of linguists and end-users. The new features of the system at large, that we bring to focus in the article consists of two items: the inter-operation of statistical and rule-based parsing methods and the open development model.

This article records a state of the state-of-the-art morphological analysis of Finnish. For a system overview in the article to be interesting and usable, we only highlight the long-term design goals of the system instead of transitional and volatile features of a fast-moving computer software that is developed by a base of open-source and language enthusiasts.22for up-to-date documentation for implementation details and rapidly changing features, the project web site is the place to go: https://github.com/flammie/omorfi/wiki

The scientific advances within the development of the various features of Omorfi have been documented in scientific publications in various fora. The main advance to previous systems is the introduction of statistical language parsing component (c.f. Manning and Schütze (1999)), including it’s combination with traditional rule-based model. The novelty in this article is not in singular experiments gone into Omorfi but a large-coverage system composed of all the state-of-the-art results in the field of computational morphology in weighted finite-state and relate technologies. This is, to our knowledge, one of the only on-going, mature, high-coverage statistical-rule based finite-state natural language parser, developed and used jointly by scientists, engineers and open source contributors via crowd-sourcing.

One notable practical distinction in our system is its licensing policy. Omorfi analyser is a free and open source product. In contemporary computational linguistics, freeness of systems and data is rightly seen as a cornerstone of properly conducted science, as it fulfills the requirement of repeatability by not setting unnecessary fences for the repetition of the scientific results. There is a large base of recent research supporting this, specifically for Finnish the latest is by Koskenniemi (2008). For computer-savvy end users this means that the tools necessary to perform linguistic analysis with Omorfi can be downloaded to and used on any average PC. There is an installation in the data centers of CSC maintained by CSC staff,33http://www.csc.fi/english/research/sciences/linguistics that is available for researchers.

3 Prior and Related Work

Omorfi is based on the tradition of finite-state morphologies, a theoretical framework laid out by Koskenniemi (1983). While our implementation is not directly related and was written from the scratch, Omorfi was created in the context of University of Helsinki, parallel to a project to update, open-source and maintain the software necessary to build systems akin original two-level morphology (Lindén et al., 2012).44http://hfst.sf.net Omorfi roots are in a master’s thesis project (Pirinen, 2008) based on the newly released open source word list from kotus at the time.55Nykysuomen sanalista, http://kaino.kotus.fi/sanat/nykysuomi From a typical single-author project of that time, Omorfi has become a large coverage multi-author project with crowd-sourced lexical data sources.

Many of scientific advances made by research groups in language technology department of University of Helsinki have directly or indirectly affected Omorfi. The research on sub-word n-gram models (Lindén and Pirinen, 2009, 2009) has been transferred to Omorfi compound disambiguation schemes. The methodology for semi-automatic lexical data harvesting, e.g. by Lindén (2008), has been largely influential on gathering of the huge lexical database in Omorfi. Finally, the work on coupling statistical and rule-based approaches for disambiguation (Pirinen, 2015), based on a grammar and a parsing approach by Karlsson et al. (1995), and is included in the recent versions of Omorfi.

There have been competing and complementary approaches to computational parsing of Finnish. For example, in machine learning,  Durrett and DeNero (2013)66We thank the anonymous reviewer for bringing this recent research to our attention show, that unsupervised learning from Wiktionary data will create an analyser with recall in prediction of inflected word-forms in the ballpark of 83—87 %, However, their goal was to learn to predict Wiktionary’s example inflection table’s 28 forms per noun and 53 forms per verb, and they only performed intrinsic evaluation on held-out Wiktionary pages. Our approach to the usage of Wiktionary data is to collect the lexemes and their inflectional patterns already confirmed and written down by human language users,77in our opinion, trying to machine learn data that is already available and verified by humans is not largely useful and use hand-written rules to inflect, which yields to virtually 100 % recall (bar bugs in our code) for the full paradigms. For this reason, it is hard to directly compare these two approaches, at all. On the other hand, statistical language parsing systems have been built on top of Omorfi that go far and beyond the language parsing capabilities of a morphological parser, such as the universal dependency parser of Finnish (Pyysalo et al., 2015).

One source of development in related works is the applications, Omorfi has been used in many real-world scientific applications to handle Finnish language. For example spell-checking (Pirinen, 2014), language generation (Toivanen et al., 2012), machine translation (Clifton and Sarkar, 2011; Rubino et al., 2015), and statistical language modelling (Haverinen et al., 2013; Bohnet et al., 2013). On top of adding lexical data and statistical models, the vast array of applications has necessitated for Omorfi to take strong software engineering best common practices in use, in order to keep different end-applications usable. This is one of the key developments we wish to highlight in this article, the concept of continuous development by cooperation of computer scientists, linguists and common crowds via crowd-sourcing is as far as we know unique and underdocumented for such a long-term free and open-source project as Omorfi is. The development by linguists and language technologists has been studied, e.g. by Maxwell and David (2008), and we have done our best to adapt and extend it to large open source development setting described in this article.

4 Methods

The implementation of our analyser follows the traditional works on Finite State Morphology by Beesley and Karttunen (2003). On top of that we have applied recent extensions from the research of finite-state morphology, such as weighted finite-state methods (Allauzen et al., 2007; Lindén et al., 2012). What this means in practice is basic unigram probabilities of word-forms composed88using the finite-state algebra operation composition that is well-defined in terms of weighted automata over the analyser from a corpus. Finally, probabilities are used in conjunction with constraint grammar rules Karlsson et al. (1995) to disambiguate. This brings the traditional rule-based language analyser towards the statistical language analysers that are widely popular in the handling of morphologically less complex languages. A diagram of the combination is shown in figure 1, the figure is much simplified version of real implementation, just to show how few forms of select words interact in the system, the statistical component also omits the existence of known compounds to simplify the presentation. The flow of the system is following: from database we generate a rule based analyser. The statistical data is counted from the corpora, and applied over the automaton using the formula by Lindén and Pirinen (2009). The resulting automaton is used to analyse word-forms and the sentence context is used by constraint grammar to further select the best analyses.

Figure 1: Diagram of Omorfi technology showing a few example words (vesi ’water’, and käsi ’hand’) and forms in the database, analyser and statistical training. Not shown in the automaton but used are also words putous ’fall’ and jakaja ’divider’ used to demonstrate compound formation and probablity calculations for vesiputous ’waterfall’ and vedenjakaja ’watershed’. In finite-state representation, the double circle marks the end state, and the arrow leading away from the figure is cropped out of the example. The sub-strings in automaton drawing were compacted to single transitions where possible.

The implementation of finite-state morphology in Omorfi is based on arrangement of stems, stem variations and suffix morphs, without intermediate morphographemic processing. This relies on word classification to include data about stem patterns and vowel harmony for example. The classified dictionary words are stripped of their varying stem parts, and then concatenated with the variations and then stems, followed by all suffixes and optionally extended by compounding. This is done using standard finite-state morphology approach. E.g. in figure 1, we have a dictionary words vesi ‘water’ and käsi ‘hand’ with stem invariants ve- and resp., and stem variation in -si -de- …, and respectively suffixes (nominative) -n (genitive, ‘water’s’) -ssä (inessive ‘in water’) -stä (elative, ´from water’) …and so forth. This simple concatenation forms altogether some thousands of word-forms per dictionary word, as well as returns back to new words for compounding where applicable.

The baseline statistical methods for morphological models are applied over the finite-state formulation within the same framework, as is shown in the example in figure 1. The formulation we use is the schoolbook unigram training (c.f. Manning and Schütze (1999)): get the likelihood P(w) for the surface form w, by counting the amount of word-forms f(w) in a corpus and divide it by the number of word-forms in the whole corpus CS: P(w)=f(w)CS. To get around the problems with the probability of 0 for unseen word-forms, we use additive smoothing (Chen and Goodman, 1999), which estimates frequency of each type as 1 larger than it is and the size of corpus as number of types larger P(w^)=f(w)+1CS+TC, where TC is a type count. The acquired likelihoods are combined to the finite-state morphological analyser by producing a weighted finite-state automaton for language model and composing it over the analyser to create a morphological analyser capable of producing both analyses and their likelihoods as shown in the last frame of figure 1.99The availability of accurate probabilistic data on the analyser is dependent on acquisition of suitable huge corpora, the default build includes “toy” weights that estimate probabilities by linguistic insights. The probability-weighted analysis can be combined with rule-based probability-aware constraint grammars to produce robust disambiguating analysers Pirinen (2015).

5 Data

There are a few freely available open resources for lexicographical data of Finnish. The first one we used is based on lexicographical data of the dictionary from institute of languages of Finland, that has been available under free software licence GNU LGPL since 2007, in a column titled Kotus1010http://kaino.kotus.fi/sanat/nykysuomi. The second source of lexical data we acquired from the internet is a free, open source database named Joukahainen.1111http://joukahainen.puimula.org For another source of lexical data we use the popular crowd-sourced Wiktionary project. We have used data from FinnWordNet (Lindén and Carlson, 2010), as well as gathered data from students and various yet unpublished projects of University of Helsinki, and finally a number of contributors within project have added word-forms and attributes specifically for Omorfi using semi-automatic and manual approaches. The current dictionary includes 424,259 lexemes, classified in over 17 categories, including semantic features like biological gender, proper noun categories as well as morphosyntactic features like argument structures and defective paradigms.1212figures change nearly weekly, up-to-date information is on the project web site

6 Experimental Setup and Evaluation

In this section we evaluate Omorfi to give an impression of its usefulness in various tasks and potential caveats when using for linguistic research. For evaluation we use only freely available corpora. The size of the corpora is detailed in table 1. Following corpora are included: ebooks of project Gutenberg1313http://gutenberg.org, the data of Finnish Wikipedia1414http://fi.wikipedia.org, and the JRC acquis corpus1515http://ipsc.jrc.ec.europa.eu/index.php?id=198. For downloading and preprocessing these corpora we use freely available scripts 1616https://github.com/flammie/bash-corpora. The scripts retain most of the punctuation and white-space as-is. The resulting token counts are given in table 1. Some further tests were made with fully tokenised and analysed FinnTreeBank (Voutilainen et al., 2012) version 3.1. The scripts used for this evaluation are part of Omorfi source code and are usable for anyone.

Feature: Tokens Types
Corpus
Gutenberg 36,743,872 1,590,642
Wikipedia 55,435,341 3,223,985
JRC Acquis 42,265,615 1,425,532
FTB 3.1 76,369,439 1,648,420
Table 1: Corpora used for evaluations. Tokens are all strings extracted from corpus and types are unique strings, both include punctuation and some codified expressions like URLs, addresses etc.

First we measure how big proportion of data in the material are out-of-vocabulary items. This gives us naive coverage, formally defined as Coverage=AnalysedCorpussize. The results are presented in table 2 for all the corpora we have.

Corpus: Gutenberg Wiki JRC acquis FTB 3.1
Coverage (tokens): 97.2 % 93.3 % 92.2 % 96.8 %
Coverage (types): 90.9 % 87.6 % 82.9 % 87.6 %
Table 2: Naive coverages when analysing common corpora

Faithfulness is measured as a proportion of equal analyses, formally Faithfulness=MatchedCorrect+Missing. In table 3 we show the results for FTB3.1 corpus and analyses, first by proportion of all tokens in data then by unique tokens.

Corpus Faithfulness
FTB (tokens) 93.3 %
FTB (types) 77.0 %
Table 3: The amount of FTB3.1 analyses Omorfi can analyse with exact match in results, from FTB3.1 reference corpus

The sizes and processing speeds for automata built from the data described in section 5 using Debian packaged HFST software version 3.8.31717http://wiki.apertium.org/wiki/Prerequisites_for_Debian on a Dell XPS 13 laptop are given in table 4. The speed was averaged over three runs using 1 million first tokens from europarl.

Feature Value
Size 22M
Speed 11,099 wps
Table 4: Size of Omorfi analyser as measured by ls -lh, speed of analysis using hfst-lookup in words per second averaged over three runs

This result is in line with previous research on speed of optimised finite-state automata in natural language processing by Silfverberg and Lindén (2009).

7 Discussion and Future Work

We have showed mature, jointly developed open source natural language analyser using both rule-based and statistical analysis approaches, and crowd-sourced lexicography development.The techniques of statistical language parsing in Omorfi are quite modest at modern standards, while successful combination of statistical parsing and rule-based disambiguation is shown to be usable for a range of NLP applications, it would be interesting to see the effects of more representative corpora used with different methods to parsing quality of Omorfi. In particular, it would be interesting to see some development on an end-user application that necessitates a use of high-quality disambiguated morphological analyses. We expect that development towards universally recognised and comparable linguistic resources by projects like Universal dependencies will be crucial to future development of Omorfi to the direction of state-of-the-art language processing.

One of the key components to recent success of Omorfi that this article also outlines is its adaptability and usefulness over various end uses. While it seems from the number of end users that it is in fact possible for independent researchers to use and develop Omorfi, it would be interesting to see how more linguists and lexicographers using Omorfi might improve the description as well as end application quality.

7.1 Error Analysis

The coverage of the analyser is systematically around 98 %, this is virtually at the upper limits of reasonable results with the given corpora. This can be noticed by analysing the errors or the out-of-vocabulary word-forms left in the current corpora. For Wikipedia, we get word-forms codes like , amp, English like of, The, and so forth. In the Gutenberg corpus, we get, among some missing proper nouns, archaic and dialectal forms like: nämät, kauvan, sitte. While these can be added to the analyser quite easily, the examples will show what is known as Zipfian distribution of language data: rare word-forms and phenomena get exponentially rarer, thus the effect of collecting and classifying further lexemes will become insignificantly small (compare to Manning (2011)). For applications requiring higher, potentially 100 % coverage, using guessing techniques, e.g. Mikheev (1997), should be investigated.

FTB3.1 evaluation is presented here as an example of customising Omorfi for an end user, and the faithfulness evaluations are based on comparison against an unknown closed source commercial tagger of FTB3.1. While we have mostly done our best to match the reference analyses, we have not degraded the analyser quality to match analyses what we view as bugs in the corpus. As an example of mismatched analyses right now: top wrong word-forms oli ‘was’, olivat ‘were’ are analysed as present tense in their annotations. We feel this is incorrect and does not warrant such analysis. In the near future we will use a free and open source, human-verified reference corpora instead, such as UD Finnish (Pyysalo et al., 2015), to gain stable high-quality analysis.

8 Conclusion

In this article we present a new fully open source Finnish morphological lexicon. We confirm that it is full-fledged and mature lexical database that can be used as a baseline morphological analyser with large coverage, suitable for linguistic research, as well as in external applications such as spelling correction and machine translation. We have shown some approaches that make available use of modern natural language processing techniques like statistics in conjunction with analysers built from our data and paved a way forward for researchers interested those topics. We also provide some easy-to-access ways for linguists and researchers to use and extend our database via publicly maintained servers and crowd-sourced web-based services.

References

  • C. Allauzen, M. Riley, J. Schalkwyk, W. Skut and M. Mohri (2007) OpenFst: a general and efficient weighted finite-state transducer library. In Proceedings of the Twelfth International Conference on Implementation and Application of Automata, (CIAA 2007), Lecture Notes in Computer Science, Vol. 4783, pp. 11–23. External Links: Link Cited by: 4.
  • K. R. Beesley and L. Karttunen (2003) Finite state morphology. CSLI publications. External Links: ISBN 978-1575864341 Cited by: 4.
  • B. Bohnet, J. Nivre, I. Bouguavsky, R. Farkas, Ginter and J. Hajič (2013) Joint morphological and syntactic analysis for richly inflected languages. (1), pp. 415–428. Cited by: 3.
  • S. F. Chen and J. Goodman (1999) An empirical study of smoothing techniques for language modeling. Computer Speech & Language 13 (4), pp. 359–393. Cited by: 4.
  • A. Clifton and A. Sarkar (2011) Combining morpheme-based machine translation with post-processing morpheme prediction. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1, HLT ’11, Stroudsburg, PA, USA, pp. 32–42. External Links: ISBN 978-1-932432-87-9, Link Cited by: 3.
  • G. Durrett and J. DeNero (2013) Supervised learning of complete morphological paradigms. In Proceedings of the North American Chapter of the Association for Computational Linguistics, External Links: Link Cited by: 3.
  • K. Haverinen, J. Nyblom, T. Viljanen, V. Laippala, S. Kohonen, A. Missilä, S. Ojala, T. Salakoski and F. Ginter (2013) Building the essential resources for Finnish: the Turku dependency treebank. Language Resources and Evaluation, pp. 1–39 (English). External Links: ISSN 1574-020X, Document, Link Cited by: 3.
  • F. Karlsson, A. Voutilainen, J. Heikkilä and A. Anttila (1995) Constraint grammar: a language-independent system for parsing unrestricted text. Vol. 4, De Gruyter Mouton. Cited by: 3, 4.
  • K. Koskenniemi (1983) Two-level morphology: a general computational model for word-form recognition and production. Ph.D. Thesis, University of Helsinki. External Links: Link Cited by: 2, 3.
  • K. Koskenniemi (2008) How to build an open source morphological parser now. Resourceful Language Technology: Festschrift in Honor of Anna Sågvall Hein, pp. 86–. Cited by: 2.
  • K. Lindén and T. Pirinen (2009) Weighting finite-state morphological analyzers using HFST tools. In FSMNLP 2009, B. Watson, D. Courie, L. Cleophas and P. Rautenbach (Eds.), External Links: ISBN 978-1-86854-743-2, Link Cited by: 3.
  • K. Lindén and T. Pirinen (2009) Weighted finite-state morphological analysis of Finnish compounds. In Nodalida 2009, K. Jokinen and E. Bick (Eds.), NEALT Proceedings, Vol. 4. External Links: Link Cited by: 3, 4.
  • K. Lindén, E. Axelson, S. Drobac, S. Hardwick, Silfverberg and T. A. Pirinen (2012) Using HFST for creating computational linguistic applications. In Proceedings of Computational Linguistics - Applications, 2012, Cited by: 3, 4.
  • K. Lindén and L. Carlson (2010) FinnWordNet-wordnet på finska via översättning. LexicoNordica. Cited by: 5.
  • K. Lindén (2008) A probabilistic model for guessing base forms of new words by analogy. In Computational Linguistics and Intelligent Text Processing, pp. 106–116. Cited by: 3.
  • C. D. Manning and H. Schütze (1999) Foundations of statistical natural language processing. MIT press. Cited by: 2, 4.
  • C. D. Manning (2011) Part-of-speech tagging from 97% to 100%: is it time for some linguistics?. In Computational Linguistics and Intelligent Text Processing, pp. 171–189. Cited by: 7.1.
  • M. Maxwell and A. David (2008) Joint grammar development by linguists and computer scientists. In Workshop on NLP for Less Privileged Languages, Third International Joint Conference on Natural Language Processing, Hyderabad, India. Cited by: 3.
  • A. Mikheev (1997) Automatic rule induction for unknown-word guessing. Comput. Linguist. 23 (3), pp. 405–423. External Links: ISSN 0891-2017, Link Cited by: 7.1.
  • T. A. Pirinen (2014) Weighted finite-state methods for spell-checking and correction. Ph.D. Thesis, University of Helsinki. Cited by: 3.
  • T. A. Pirinen (2015) Using weighted finite state morphology with visl cg-3—some experiments with free open source finnish resources. Cited by: 3, 4.
  • T. Pirinen (2008) Suomen kielen äärellistilainen automaattinen morfologinen analyysi avoimen lähdekoodin menetelmin. Master’s Thesis, Helsingin yliopisto. External Links: Link Cited by: 3.
  • S. Pyysalo, J. Kanerva, A. Missilä, V. Laippala and F. Ginter (2015) Universal dependencies for finnish. In Nordic Conference of Computational Linguistics NODALIDA 2015, pp. 163. Cited by: 3, 7.1.
  • R. Rubino, T. Pirinen, M. Esplà-Gomis, N. LjubeÅ¡ić, S. Ortiz-Rojas, V. Papavassiliou, P. Prokopidis and A. Toral (2015) Abu-matran at wmt 2015 translation task: morphological segmentation and web crawling. EMNLP 2015, pp. . Note: (to appear) Cited by: 3.
  • M. Silfverberg and K. Lindén (2009) Hfst runtime format—a compacted transducer format allowing for fast lookup. In FSMNLP, Vol. 13. Cited by: 6.
  • J. Toivanen, H. Toivonen, A. Valitutti and O. Gross (2012) Corpus-based generation of content and form in poetry. In Proceedings of the Third International Conference on Computational Creativity, Cited by: 3.
  • A. Voutilainen, K. Muhonen, T. K. Purtonen and K. (2012) Specifying treebanks, outsourcing parsebanks: finntreebank 3. In Proceedings of LREC 2012 8th ELRA Conference on Language Resources and Evaluation, Cited by: 6.

Contact Information:

Tommi A Pirinen
Ollscoil Chathair Bhaile Átha Cliath
IE-D09 W6Y4
Éire
e-mail: Tommi.Pirinen@computing.dcu.ie