omorfi
Open morphology for Finnish
Project maintained by flammie
Hosted on GitHub Pages — Theme by mattgraham
Omorfi–Open morphology of Finnish
Omorfi is a free and open source project containing various tools and data for
natural language processing of Finnish based on a knowledge driven paradigm.
Some of the potential use cases it may be suitable for are:
- morphosyntactic analysis, e.g. generate unimorph or
universal dependencies analyses from word-forms
or text
- building named entity recognisers, sentiment analysis or other high level
apps
- building rule-based machine translation, e.g.
apertium-fin
The main components of this repository are:
- a lexical database containing hundreds of thousands of words (c.f.
lexical statistics)
- a collection of utility scripts to process Finnish texts on command-line
(c.f. usage examples)
- a collection of conversion scripts to convert lexical database into formats
upstream NLP tools (c.f. lexical processing)
- an autotools setup to build and install (or package, or deploy): the
scripts, the database, and simple APIs / convenience processing tools
- a collection of relatively simple APIs with bindings for a
selection of programming languages and scripts to apply the NLP tools and
access the database
The fileformats we produce are (links to free open source implementations
included):
- lexc, as processed by HFST and
foma, to be used for morphological analysis,
stemming, segmentation, natural language generation, hyphenation and
as a basis for language models,
- we provide pre-built automata binaries for each release as a convenient
download
- apertium, to be used for machine translation
- voikko, to be used for spell-checking and
correction (also experimental hunspell for legacy spell-checking)
- kotus-sanalista, lexical markup framework, tab-separated values, etc.
for long and short term storage, intermediate formats.
Documentation
The most recent version of this documentation is online on github pages at
https://flammie.github.io/omorfi/ (should be
this page).
Basics
Read this first:
- README
- Installation
- Usage examples
Bindings
Using omorfi language models from a programming language (python, C++ or java):
- API design
- doxygen apidocs
Statistics and generated listings
There’s some semi-automatically generated statistics
available.
We also have generated documentation for some aspects of database, such as
paradigms or noteworthy words:
- Words, particularly those that are problematic (a FAQ for
word entries, in a way)
- Paradigms, i.e. inflection patterns
- Internal keys and codes
- All forms of kauppa, a retake on old experiment
Design, historical notes, stuff
Some notes about design and development
- Design “principles” for tags
- Directory layout
- Database structure
- Testing
If you want to discuss about omorfi in Finnish or English, there is a matrix
chatroom. The google group discussion
list omorfi-devel@groups.google.com (Google groups web interface
here) can also be used,
it may require subscription but is very low volume. Suggestions, bug reports,
corrections and new lexical data can be sent using github’s omorfi issue
tracker. Pull requests are accepted.
Alternatives of omorfi
For many NLP tasks a neural language model may be more suitable and they are
relatively easy to use and customise these days. Check 🤗HuggingFace or
Turku-NLP for examples.
If omorfi doesn’t suit your needs, you may want to try other similar products:
suomi-malaga of voikko fame is another
morphological analyser of Finnish. Grammatical
Framework also has NLP components for
Finnish, and it’s written in haskell.
For modern neural network approaches, see TurkuNLP and their
parsers
If you want to use commercial products, there are surely some available
somewhere.