omorfi

Open morphology for Finnish


Project maintained by flammie Hosted on GitHub Pages — Theme by mattgraham

Omorfi–Open morphology of Finnish

open morphology of finnish

Omorfi is a free and open source project containing various tools and data for natural language processing of Finnish based on knowledge driven paradigm. The main components of this repository are:

  1. a lexical database containing hundreds of thousands of words (c.f. lexical statistics)
  2. a collection of conversion scripts to convert lexical database into formats upstream NLP tools (c.f. lexical processing)
  3. a collection of utility scripts to process Finnish texts on command-line (c.f. usage examples)
  4. an autotools setup to build and install (or package, or deploy): the scripts, the database, and simple APIs / convenience processing tools
  5. a collection of relatively simple APIs with bindings for a selection of programming languages and scripts to apply the NLP tools and access the database

The formats we produce are (links to free open source implementations included):

  1. lexc, as processed by HFST and foma, to be used for morphological analysis, stemming, segmentation, natural language generation, hyphenation and as a basis for language models,
    1. we provide pre-built automata binaries for each release as a convenient download
  2. apertium, to be used for machine translation
  3. voikko, to be used for spell-checking and correction (also experimental hunspell for legacy spell-checking)
  4. kotus-sanalista, lexical markup framework, tab-separated values, etc. for long and short term storage, intermediate formats.

Documentation

The most recent version of this documentation is online on github pages at https://flammie.github.io/omorfi/

Basics

  1. README
  2. Installation
  3. Usage examples

Bindings

If you wish to use omorfi in a serious application you probably found out from the README that a python or java API is the way to go:

  1. API design
  2. doxygen apidocs

Statistics and generated listings

There’s some semi-automatically generated statistics available.

We also have generated documentation for some aspects of database, such as paradigms or noteworthy words:

  1. Words, particularly those that are problematic (a FAQ for word entries, in a way)
  2. Paradigms, i.e. inflection patterns
  3. Internal keys and codes
  4. All forms of kauppa, a retake on old experiment

Design, historical notes, stuff

Some notes about design and development

  1. Design “principles” for tags
  2. Directory layout
  3. Database structure
  4. Testing

Contact

If you want to discuss about omorfi in Finnish or English, the IRC channels #omorfi and #hfst on Freenode are available for immediate chats (Freenode webchat here). The google group discussion list omorfi-devel@groups.google.com (Google groups web interface here) can also be used, it may require subscription but is very low volume. Suggestions, bug reports, corrections and new lexical data can be sent using github’s omorfi issue tracker. Pull requests are accepted.

Alternatives of omorfi

If omorfi doesn’t suit your needs, you may want to try other similar products: suomi-malaga of voikko fame is another morphological analyser of Finnish. Grammatical Framework also has NLP components for Finnish, and it’s written in haskell.

If you want to use commercial products, there are surely some available somewhere.