Open morphology for Finnish
Omorfi is in its core a database of lexical data, this page describes what
is contained and how it is stored. The data is stored in tsv (tab separated
values) files. The specific dialect of TSV is determined by python at the
moment; see csv.DictWriter. In
the main database we store the lemma field and a homonym
field, which form the unique key of each lexeme. You can find this
in the file lexemes.tsv
. These two fields are joined with a paradigm
field and origin field, which are required for each
lexeme. All other lexical data is optional, joined on the unique key. You can
currently find the joins in attributes/*.tsv
.
Optional fields at the time of writing include:
Fields described per paradigm:
Fields guessed algorithmically:
A dictionary form serving also as unique-ish identifier of the word in the dictionary. In case lemma is not unique, a homonym field must make any entry in the database unique.
Homonym id is an arbitrary string with only requirement that for each non-unique lemma the homonym field must be different. For purposes of stability in morphological analysis in generation, we use numbers when no other uniqifying feature is found; e.g. viini_1 (wine) and viini_2 (quiver) are distinguished by number since there are no other distinguishing feature, for comparison, see viini on wiktionary. For words of differing POS we no longer use numbering: FIXME an example here
The current homonym key is the UPOS field plus potentially the number when necessary.
Denotes the data source for the lexeme. This is necessary for copyright issues and also used to generate the low coverage high precision dictionary
Paradigm determines inflection of a word, in omorfi it also contains some other
information. The paradigm key is usually uppercased UPOS and an example word
from the category, e.g. NOUN_TALO
. In database terms there is an additional
database under paradigms.tsv
that is joined to lexemes.tsv on new_para
field
to form a master database.
Universal POS is a POS value drawn from Universal dependencies standard. This is the main POS value used in omorfi since 2015.
The legacy POS value used by omorfi is strictly limited to morphological features that can be seen from the inflection of the word. It works like this: Nouns inflect in case/number forms, Adjectives have comparative derivation on top of that. Verbs inflect in tense/mood and person forms among others. Particles do not inflect.
An official dictionary classification.
An official dictionary classification.
Determines whether nominal is allowed to have singular forms.
Determines whether partially inflecting word can take up possessives or not.
Determines whether partially inflecting or non-inflecting word can take up clitics.
A semantic class for proper noun, probably from FINER data so named entity class.
Any arbitrary pragmatic usage limitation for word.
The gradation can be split in two cases depending what is the grade in the lemma form.
The suffixes depend on the vowel frontness of the word.
Determines variant of uo/yö/ie words.
Determines vowel of illative forms, if needed.
Pronunciation information from the stem or differing pre-defined pronunciation is stored here for guessing other features (vowel harmony).
For compounds, the lemma with word part boundaries is given here to help determining the vowel harmony correctly. The boundaries are also used for some non-compound boundaries.
Determines additional analyses for words of given pos: these are used to inject additional lexical data to analyses: e.g.: Particle -> adposition, preposition, genitive complement, numeral -> ordinal roman digit, verb -> transitive with elative argument.
Determines optional semantic classes. There is no exhaustive use list yet.
Stub is the part of word that does not undergo any alternations, and thus a good starting point for many practical implementations of morphology.
For analysis styles like FTB 3.1, the lemmas for compound initial words are different from compound final, to accommodate this a complex structure containing both stub and stem is written into lexc files to avoid duplicating all the lexicons.
The original omorfi implementation used stem’s with gradation marked but lots of other variation in lexc stuff.
This stem could use twol rules for all variations.