Open morphology for Finnish
Outdated
This document is for developers and contributors of omorfi to understand the directory structure of the code, and the files contained.
Directory tree
:
. # Top-level directory contains necessary autotools stuff
├── config-aux # is generated by autotools DO NOT TOUCH
├── doc # hand-written documentation may be added here
├── man # all end-user command-line tools have manual pages here
├── autom4te.cache # is generated by autotools DO NOT TOUCH
├── src # all code and data resides here; the build scripts in this dir
│ ├── attributes # lexical data (per lexeme joins)
│ ├── bash # bash scripts
│ ├── continuations # Morphology as modeled in morph-concatenation model
│ ├── docs # in-source documents and wiki pages generated from these
│ ├── examples # examples for tests and documentation
│ ├── externals # lexical data outside omorfi and scripts to handle it
│ ├── generated # holds build process generated files
│ ├── java # java API
│ │ └── com # obligatory JAVA dir struct.
│ │ ├── github # obligatory JAVA dir struct.
│ │ │ └── flammie # obligatory JAVA dir struct.
│ │ │ └── omorfi # obligatory JAVA dir struct.
│ ├── paradigms # paradigm data (per paradigm joins)
│ ├── perl # perl scripts
│ ├── python # python scripts
│ │ └── omorfi # python API
│ │ └── __pycache__ # python may generate these
│ ├── scripts # installable scripts for end-user applications
│ └── voikko # build directory for spell-checker
└── test # Automatic test scripts
File tree
:
.
├── aclocal.m4 # aclocal generates, ignore
├── AUTHORS # list of authors relevant for copyright issues
├── autogen.sh # use to generate autotools setup if autoreconf fails
├── autom4te.cache # autotools generates, ignore
│ ├── output.0 # ''
│ ├── output.1 # ''
│ ├── requests # ''
│ ├── traces.0 # ''
│ └── traces.1 # ''
├── ChangeLog.old # ChangeLog prior to moving git
├── config-aux # autotool generates, ignore
│ ├── install-sh # ''
│ ├── missing # ''
│ └── test-driver # ''
├── config.log # autoconf log, read when configure fails
├── config.status # autoconf status, run to remake current settings
├── configure # configure script, run to initialise or change settings
├── configure.ac # autoconf configuration of configure
├── COPYING # GNU GPL licence
├── doc # doc tree
│ ├── Makefile # generated makefile, ignore
│ ├── Makefile.am # doc compilation rules
│ └── Makefile.in # generated makefile, ignore
├── INSTALL # generated GNU installation instructions
├── Makefile # generated makefile, ignore
├── Makefile.am # root dir make rules: pkg-config installation etc.
├── Makefile.in # generated makefile, ignore
├── man # manual pages for end-user scripts
│ ├── Makefile # generated makefile, ignore
│ ├── Makefile.am # manual make rules: man installation
│ ├── Makefile.in # generated makefile, ignore
│ ├── omorfi-analyse.1 # man page for analyse
│ ├── omorfi-generate.1 # man page for generate
│ ├── omorfi-hyphenate.1 # man page for hyphenate
│ ├── omorfi-interactive.1 # man page for interactive
│ └── omorfi-spell.1 # man page for spell
├── NEWS # Major changes between releases
├── omorfi.pc # generated pkg-config data
├── omorfi.pc.in # pkg-config settings
├── README # README documentation, everyone reads
├── README.bindist # notes about binary distributions
├── src # all source data and databases
│ ├── add-word.bash # add a new word
│ ├── apertium_formatter.py # functions for formatting apertium style dictionaries
│ ├── attributes # ancillary lexical data
│ ├── change-class.bash # change lexemes' class
│ ├── docs # documents for autogeneration
│ │ ├── paradigms.tsv # documents for paradigms
│ │ └── stuff.tsv # documents for internal symbols
│ ├── experimental_xml_formatter.py # experiment
│ ├── externals # external lexical databases
│ │ ├── fiwikt2omorfi.bash # script to convert fi.wiktionary data to omorfi
│ │ ├── fiwiktionary.bash # another one
│ │ ├── joukahainen.xml # joukahainen database
│ │ └── kotus-sanalista_v1.xml # kotus database
│ ├── find-redundant-lexemes.py # script to remove redundancies
│ ├── ftb3_formatter.py # functions for ftb3 formatting
│ ├── generated # generated build files
│ │ ├── apertium-fin.fin.lexc # apertium style lexc
│ │ ├── apertium-fin.fin.lexc.hfst # compiled apertium lexc
│ │ ├── apertium-fin.fin.twolc # apertium style twolc
│ │ ├── apertium-fin.fin.twolc.hfst # compiled apertium twolc
│ │ ├── errmodel.edit-distance-2.hfst # compiled edit distance 2
│ │ ├── errmodel.edit-distance.hfst # compiled edit distance 1
│ │ ├── errmodel.edit-distance.txt # generated edit distance 1
│ │ ├── fin-automorf.hfst # compiled apertium automaton
│ │ ├── inflections.tsv # generated inflection database
│ │ ├── joint.tsv # generated lexeme database join ancillary data
│ │ ├── master.tsv # generated final lexeme database
│ │ ├── omorfi.accept.hfst # generic word-form acceptor
│ │ ├── omorfi-between-tokens.regex # generated tokeniser split symbol
│ │ ├── omorfi-between-tokens.regex.hfst # compiled tokeniser split
│ │ ├── omorfi-ftb3.analyse.hfst # compiled ftb3.1 analyser
│ │ ├── omorfi-ftb3.generate.hfst # compiled ftb3.1 generator
│ │ ├── omorfi-ftb3.lexc # generated ftb3.1 style lexc
│ │ ├── omorfi-ftb3.lexc.hfst # compiled ftb3.1 lexc
│ │ ├── omorfi-ftb3.reweight # generated ftb3.1 simple weights
│ │ ├── omorfi-ftb3-rewrite-tags.regex # generated ftb3.1 tagging hacks
│ │ ├── omorfi-ftb3-rewrite-tags.regex.hfst # compiled ftb3.1 tagging hacks
│ │ ├── omorfi.hyphenate.hfst # compiled hyphenation dictionary
│ │ ├── omorfi-hyphenate.twolc # generated hyphenation rules
│ │ ├── omorfi-hyphenate.twolc.hfst # compiled hyphenation rules
│ │ ├── omorfi-hyphens.twolc # obligatory hyphenation rules
│ │ ├── omorfi-hyphens.twolc.hfst # compiled oblig. hyphenation
│ │ ├── omorfi.lemmatise.hfst # compiled lemmatiser
│ │ ├── omorfi.lexc # generated generic lexc
│ │ ├── omorfi.lexc.hfst # compiled generic lexc
│ │ ├── omorfi.nondict-token.hfst # compiled tokeniser for oov's
│ │ ├── omorfi-omor.analyse.hfst # compiled omor-style analyser
│ │ ├── omorfi-omor.generate.hfst # compiled omor-style generator
│ │ ├── omorfi-omor.lexc # generated omor-style lexc
│ │ ├── omorfi-omor.lexc.hfst # compiled omor-style lexc
│ │ ├── omorfi-omor.reweight # generated omor-style simple weights
│ │ ├── omorfi-orthographic-variations.regex # generated orthographical rules
│ │ ├── omorfi-orthographic-variations.regex.hfst # compiled orthographical rules
│ │ ├── omorfi-recase-any.twolc # generated recasing rules for free recasing
│ │ ├── omorfi-recase-any.twolc.hfst # compiled free recasing
│ │ ├── omorfi-remove-boundaries.regex # generated special symbol cleanup
│ │ ├── omorfi-remove-boundaries.regex.hfst # compiled special symbol cleanup
│ │ ├── omorfi.segment.hfst # compiled segmentation automaton
│ │ ├── omorfi-sh.regex # rules for š
│ │ ├── omorfi-sh.regex.hfst # compiled rules for š
│ │ ├── omorfi.tokenise.hfst # compiled tokeniser
│ │ ├── omorfi-token-joiner.hfst # compiled token-medial symbols
│ │ ├── omorfi-token.regex # generated token patterns
│ │ ├── omorfi-token.regex.hfst # compiled token patterns
│ │ ├── omorfi.token-separator.hfst # compiled token splitters
│ │ ├── omorfi-uppercase-first.twolc # generated uppercasing rule for initial uppercase
│ │ ├── omorfi-uppercase-first.twolc.hfst # compiled initial uppercasing
│ │ ├── omorfi-zh.regex # generated rules for ž
│ │ ├── omorfi-zh.regex.hfst # compiled rules for z
│ │ ├── stemparts.tsv # generated database for stem variants
│ │ ├── temporary.ftb3.hfst # final step for ftb3.1 compilation
│ │ ├── temporary-ftb3.hyphenated.hfst
│ │ ├── temporary-ftb3.orth.hfst
│ │ ├── temporary-ftb3.relaxed.hfst
│ │ ├── temporary-ftb3.tagged.hfst
│ │ ├── temporary-ftb3.tagweighted.hfst
│ │ ├── temporary-ftb3.unbounded.hfst
│ │ ├── temporary.hyphenated.hfst
│ │ ├── temporary.omor.hfst # final step in omor compilation
│ │ ├── temporary-omor.hyphenated.hfst
│ │ ├── temporary-omor.orth.hfst
│ │ ├── temporary-omor.relaxed.hfst
│ │ ├── temporary-omor.tagweighted.hfst
│ │ ├── temporary-omor.tokenweighted.hfst
│ │ ├── temporary-omor.unbounded.hfst
│ │ ├── temporary-omor.weighted.hfst
│ │ ├── temporary.orth.hfst
│ │ ├── temporary.relaxed.hfst
│ │ ├── temporary.tagged.hfst
│ │ ├── temporary.tagweighted.hfst
│ │ ├── temporary.tokenweighted.hfst
│ │ ├── temporary.unbounded.hfst
│ │ ├── temporary.weighted.hfst
│ │ └── timestamp # hack for autotoolsing generated directory
│ ├── generate-edit-distance.py # script to generate edit distances
│ ├── generate-googlecodewiki.py # script to generate wiki pages from database
│ ├── generate-kotus-sanalista.py # script to generate kotus-sanalista style XML
│ ├── generate-lexcs.py # script to generate lexc files
│ ├── generate-monodix.py # script to generate apertium monodix
│ ├── generate-regexes.py # script to generate regex rules
│ ├── generate-reweights.py # script to generate simple weighting
│ ├── generate-twolcs.py # script to generate twolc rules
│ ├── generate-yaml.py # script to generate yaml tests
│ ├── giella_formatter.py # functions for formatting giellatekno style data
│ ├── gradation.py # python rules for finding gradation in dictionary forms
│ ├── guess-csv2tsv.py # script for old to new database changes
│ ├── guess_feats.py # python rules for guessing features from dictionary form
│ ├── guess_new_class.py # python rules for guessing paradigm from dictionary word
│ ├── kotus_sanalista_formatter.py # functions for formatting kotus-sanalista XML
│ ├── lexc_formatter.py # functions for formatting lexc
│ ├── lexemes # lexical databases
│ │ ├── boundaries.tsv # known intra-word boundaries: compound etc.
│ │ ├── broken-paradigms.tsv # paradigms with wrong classes in dictionaries
│ │ ├── origin.tsv # original database of the lexemes
│ │ ├── particle-classes.tsv # particles' semantics and syntax
│ │ ├── plurale-tantum.tsv # plurale tantum words
│ │ ├── possessives.tsv # particles with possessives
│ │ ├── pronoun-classes.tsv # pronouns' semantics
│ │ ├── pronunciation.tsv # orthography to phonemics mismatches
│ │ ├── proper-classes.tsv # proper nouns' semantics
│ │ ├── semantic.tsv # nouns' semantics
│ │ ├── subcategories.tsv # unorganised data
│ │ ├── symbol-classes.tsv # symbols' semantics
│ │ ├── usage.tsv # special usage: dialects, non-standard, etc.
│ │ ├── verb-arguments.tsv # verbs' syntax
│ │ └── lexemes.tsv # the lexical database
│ ├── Makefile # generated makefile
│ ├── Makefile.am # central make rules for everything
│ ├── Makefile.in # generated makefile
│ ├── omorfi_settings.py # general settings for Finnish language and special symbols
│ ├── omor_formatter.py # functions for formatting omor style data
│ ├── omor_strings_io.py # functions for string and i/o handling, error messaging etc.
│ ├── paradigms # paradigm data (per paradigm)
│ │ └── paradigm-data.tsv # paradigm-specific data
│ ├── parse_csv_data.py # functions for parsing tsv databases
│ ├── plurale_tantum.py # python rules for guessing singular from plural
│ ├── __pycache__ # python's generated cache to ignore
│ │ ├── apertium_formatter.cpython-34.pyc
│ │ ├── ftb3_formatter.cpython-34.pyc
│ │ ├── gradation.cpython-34.pyc
│ │ ├── guess_feats.cpython-34.pyc
│ │ ├── guess_new_class.cpython-34.pyc
│ │ ├── lexc_formatter.cpython-34.pyc
│ │ ├── omorfi_settings.cpython-34.pyc
│ │ ├── omor_formatter.cpython-34.pyc
│ │ ├── omor_strings_io.cpython-34.pyc
│ │ ├── parse_csv_data.cpython-34.pyc
│ │ ├── plurale_tantum.cpython-34.pyc
│ │ ├── regex_formatter.cpython-34.pyc
│ │ ├── stub.cpython-34.pyc
│ │ ├── tagset_formatter.cpython-34.pyc
│ │ ├── twolc_formatter.cpython-34.pyc
│ │ └── wordmap.cpython-34.pyc
│ ├── regex_formatter.py # functions for formatting xerox regexes
│ ├── remove-word.bash # remove word from databases
│ ├── scripts # end-user scripts
│ │ ├── convert_tag_format.py # scripts for converting tags
│ │ ├── generate-wordforms.sh # generate all forms of word
│ │ ├── Makefile.am # Make rules for end-user script installation
│ │ ├── omor2apertium.sed # sed rules for turning omor style to apertium
│ │ ├── omor2apertium.sh # script for turning omor style to apertium
│ │ ├── omorfi-analyse.sh # generated analyse script
│ │ ├── omorfi-analyse.sh.in # analyse script sources
│ │ ├── omorfi-generate.sh # generated generate script
│ │ ├── omorfi-generate.sh.in # generation script sources
│ │ ├── omorfi-hyphenate.sh # generated hyphenation script
│ │ ├── omorfi-hyphenate.sh.in # hyphenation script sources
│ │ ├── omorfi-interactive.sh # generated tokenised analysis
│ │ ├── omorfi-interactive.sh.in # tokenised analysis sources
│ │ ├── omorfi.py # python library for omorfi
│ │ ├── omorfi-spell.sh # generated spelling script
│ │ └── omorfi-spell.sh.in # spelling script sources
│ ├── set-attribute.bash # set attribute of a lexeme
│ ├── stub.py # python rules to turn dictionary form into invariant stub
│ ├── stub-stem-inflection # stem and suffix morph databases
│ │ ├── 51-stems.tsv # full-form lists of compounds with irregular agreeing inflection
│ │ ├── acro-inflections.tsv # suffix morphs for acronyms
│ │ ├── acronym-stems.tsv # stem morphs for acronyms
│ │ ├── adjective-inflections.tsv # suffix morphs for adjectives
│ │ ├── adjective-stems.tsv # stem morphs for adjectives
│ │ ├── digit-inflections.tsv # suffix morphs for digits
│ │ ├── digit-stems.tsv # stem morphs for digits
│ │ ├── digit-stubs.tsv # root morphs for digits
│ │ ├── noun-inflections.tsv # suffix morphs for nouns
│ │ ├── noun-stems.tsv # stem morphs for nouns
│ │ ├── numeral-inflections.tsv # suffix morphs for numerals
│ │ ├── numeral-stems.tsv # stem morphs for numerals
│ │ ├── particle-inflections.tsv # suffix morphs for particles
│ │ ├── particle-stems.tsv # stem morphs for particles
│ │ ├── pronoun-inflections.tsv # suffix morphs for pronouns
│ │ ├── pronoun-stems.tsv # stem morphs for pronouns
│ │ ├── symbol-inflections.tsv # suffix morphs for symbols
│ │ ├── symbol-stems.tsv # stem morphs for symbols
│ │ ├── verb-inflections.tsv # suffix morphs for verbs
│ │ └── verb-stems.tsv # stem morphs for verbs
│ ├── tdt_formatter.py # functions for TDT formatting
│ ├── tsv_expand.py # script for guessing missing data in lexical database
│ ├── tsvjoin.py # script for tsv database joins for lexemes and ancillary databases
│ ├── twolc_formatter.py # functions for formatting twol rules
│ ├── voikko # generated spell-checker
│ │ ├── acceptor.default.hfst # default dictionary
│ │ ├── errmodel.default.hfst # default error model
│ │ ├── index.xml # generated metadara
│ │ ├── index.xml.in # metadata source
│ │ └── speller-omorfi.zhfst # generated spell-checker package
│ └── wordmap.py # functions for python dict version of database
├── test # test scripts
│ ├── clusterstuff.pbs.in # source for torque cluster script
│ ├── corps-to-googlecodewiki.sh # script to generate wiki page from test results
│ ├── corpus-measures.sh # script for testing large corpora for coverage etc.
│ ├── corpus-tests.py # scripts for testing large corpora for errors in database
│ ├── count_tsv.awk #
│ ├── europarl-coverage.sh # generated europarl coverage script
│ ├── europarl-coverage.sh.in # sources for europarl coverage measure
│ ├── find_errs.awk #
│ ├── fiwiki-coverage.sh # generated fi.wiki coverage script
│ ├── fiwiki-coverage.sh.in # sources for fi.wiki coverage measure
│ ├── ftb31-coverage.sh # generated ftb3.1 coverage script
│ ├── ftb31-coverage.sh.in # sources for ftb3.1 coverage script
│ ├── ftb-test.py # ftb3.1 faithfulness script
│ ├── ftb-test.sh # ftb3.1 faithfulness wrapper script
│ ├── ftc-test.py # FTC faithfulness script (NON-FREE data!)
│ ├── ftc-test.sh # FTC wrapper script (NON-FREE data!)
│ ├── gutenberg-coverage.sh # generated gutenberg coverage script
│ ├── gutenberg-coverage.sh.in # sources for gutenberg coverage script
│ ├── jrc-acquis-coverage.sh # generated jrc acquis coverage script
│ ├── jrc-acquis-coverage.sh.in # sources for jrc acquis coverage script
│ ├── Makefile # generated makefile
│ ├── Makefile.am # make rules for testing
│ ├── Makefile.in # generated makefile
│ ├── prop-corpus-tests.py # script for lexical data errors on proper nouns
│ ├── rough-tests.sh # fast simple tests for workability of analysers
│ ├── scripts-runnable.sh # test script for workability of end-user scripts
│ ├── speed-test.sh # generated speed test script
│ ├── speed-test.sh.in # sources for speed test script
│ ├── test-header.yaml # generated headers for yam testing
│ ├── test-header.yaml.in # sources for yaml test headers
│ └── wordforms.list # list of word-forms should always be analysed
├── THANKS # list of all contributors and benefactors (not legally binding)
└── TODO # list of things broken and missing