Introduction

This document is for developers and contributors of omorfi to understand the directory structure of the code, and the files contained.

Details

Directory tree:

.  # Top-level directory contains necessary autotools stuff
├── config-aux  # is generated by autotools DO NOT TOUCH
├── doc # hand-written documentation may be added here
├── man # all end-user command-line tools have manual pages here
├── autom4te.cache # is generated by autotools DO NOT TOUCH
├── src # all code and data resides here; the build scripts in this dir
│   ├── attributes # lexical data (per lexeme joins)
│   ├── bash # bash scripts
│   ├── continuations # Morphology as modeled in morph-concatenation model
│   ├── docs # in-source documents and wiki pages generated from these
│   ├── examples # examples for tests and documentation
│   ├── externals # lexical data outside omorfi and scripts to handle it 
│   ├── generated # holds build process generated files
│   ├── java # java API
│   │   └── com # obligatory JAVA dir struct.
│   │       ├── github # obligatory JAVA dir struct.
│   │       │   └── flammie # obligatory JAVA dir struct.
│   │       │       └── omorfi # obligatory JAVA dir struct.
│   ├── paradigms # paradigm data (per paradigm joins)
│   ├── perl # perl scripts
│   ├── python # python scripts 
│   │   └── omorfi # python API
│   │       └── __pycache__ # python may generate these
│   ├── scripts # installable scripts for end-user applications
│   └── voikko # build directory for spell-checker
└── test # Automatic test scripts

File tree:

.
├── aclocal.m4 # aclocal generates, ignore
├── AUTHORS # list of authors relevant for copyright issues
├── autogen.sh # use to generate autotools setup if autoreconf fails
├── autom4te.cache # autotools generates, ignore
│   ├── output.0   # ''
│   ├── output.1   # ''
│   ├── requests   # ''
│   ├── traces.0   # ''
│   └── traces.1   # ''
├── ChangeLog.old  # ChangeLog prior to moving git
├── config-aux     # autotool generates, ignore
│   ├── install-sh # ''
│   ├── missing    # ''
│   └── test-driver # ''
├── config.log # autoconf log, read when configure fails
├── config.status # autoconf status, run to remake current settings
├── configure # configure script, run to initialise or change settings
├── configure.ac # autoconf configuration of configure 
├── COPYING # GNU GPL licence
├── doc # doc tree
│   ├── Makefile # generated makefile, ignore
│   ├── Makefile.am # doc compilation rules
│   └── Makefile.in # generated makefile, ignore
├── INSTALL # generated GNU installation instructions
├── Makefile # generated makefile, ignore
├── Makefile.am # root dir make rules: pkg-config installation etc.
├── Makefile.in # generated makefile, ignore
├── man # manual pages for end-user scripts
│   ├── Makefile # generated makefile, ignore
│   ├── Makefile.am # manual make rules: man installation
│   ├── Makefile.in # generated makefile, ignore
│   ├── omorfi-analyse.1 # man page for analyse
│   ├── omorfi-generate.1 # man page for generate
│   ├── omorfi-hyphenate.1 # man page for hyphenate
│   ├── omorfi-interactive.1 # man page for interactive
│   └── omorfi-spell.1 # man page for spell
├── NEWS # Major changes between releases
├── omorfi.pc # generated pkg-config data
├── omorfi.pc.in # pkg-config settings
├── README # README documentation, everyone reads
├── README.bindist # notes about binary distributions
├── src # all source data and databases
│   ├── add-word.bash # add a new word
│   ├── apertium_formatter.py # functions for formatting apertium style dictionaries
│   ├── attributes # ancillary lexical data
│   ├── change-class.bash # change lexemes' class
│   ├── docs # documents for autogeneration
│   │   ├── paradigms.tsv # documents for paradigms
│   │   └── stuff.tsv # documents for internal symbols
│   ├── experimental_xml_formatter.py # experiment
│   ├── externals # external lexical databases
│   │   ├── fiwikt2omorfi.bash # script to convert fi.wiktionary data to omorfi
│   │   ├── fiwiktionary.bash # another one
│   │   ├── joukahainen.xml # joukahainen database
│   │   └── kotus-sanalista_v1.xml # kotus database
│   ├── find-redundant-lexemes.py # script to remove redundancies
│   ├── ftb3_formatter.py # functions for ftb3 formatting
│   ├── generated # generated build files
│   │   ├── apertium-fin.fin.lexc # apertium style lexc
│   │   ├── apertium-fin.fin.lexc.hfst # compiled apertium lexc
│   │   ├── apertium-fin.fin.twolc # apertium style twolc
│   │   ├── apertium-fin.fin.twolc.hfst # compiled apertium twolc
│   │   ├── errmodel.edit-distance-2.hfst # compiled edit distance 2
│   │   ├── errmodel.edit-distance.hfst # compiled edit distance 1
│   │   ├── errmodel.edit-distance.txt # generated edit distance 1
│   │   ├── fin-automorf.hfst # compiled apertium automaton
│   │   ├── inflections.tsv # generated inflection database
│   │   ├── joint.tsv # generated lexeme database join ancillary data
│   │   ├── master.tsv # generated final lexeme database
│   │   ├── omorfi.accept.hfst # generic word-form acceptor
│   │   ├── omorfi-between-tokens.regex # generated tokeniser split symbol
│   │   ├── omorfi-between-tokens.regex.hfst # compiled tokeniser split
│   │   ├── omorfi-ftb3.analyse.hfst # compiled ftb3.1 analyser
│   │   ├── omorfi-ftb3.generate.hfst # compiled ftb3.1 generator
│   │   ├── omorfi-ftb3.lexc # generated ftb3.1 style lexc
│   │   ├── omorfi-ftb3.lexc.hfst # compiled ftb3.1 lexc
│   │   ├── omorfi-ftb3.reweight # generated ftb3.1 simple weights
│   │   ├── omorfi-ftb3-rewrite-tags.regex # generated ftb3.1 tagging hacks
│   │   ├── omorfi-ftb3-rewrite-tags.regex.hfst # compiled ftb3.1 tagging hacks
│   │   ├── omorfi.hyphenate.hfst # compiled hyphenation dictionary
│   │   ├── omorfi-hyphenate.twolc # generated hyphenation rules
│   │   ├── omorfi-hyphenate.twolc.hfst # compiled hyphenation rules
│   │   ├── omorfi-hyphens.twolc # obligatory hyphenation rules
│   │   ├── omorfi-hyphens.twolc.hfst # compiled oblig. hyphenation
│   │   ├── omorfi.lemmatise.hfst # compiled lemmatiser
│   │   ├── omorfi.lexc # generated generic lexc
│   │   ├── omorfi.lexc.hfst # compiled generic lexc
│   │   ├── omorfi.nondict-token.hfst # compiled tokeniser for oov's
│   │   ├── omorfi-omor.analyse.hfst # compiled omor-style analyser
│   │   ├── omorfi-omor.generate.hfst # compiled omor-style generator
│   │   ├── omorfi-omor.lexc # generated omor-style lexc
│   │   ├── omorfi-omor.lexc.hfst # compiled omor-style lexc
│   │   ├── omorfi-omor.reweight # generated omor-style simple weights
│   │   ├── omorfi-orthographic-variations.regex # generated orthographical rules
│   │   ├── omorfi-orthographic-variations.regex.hfst # compiled orthographical rules
│   │   ├── omorfi-recase-any.twolc # generated recasing rules for free recasing
│   │   ├── omorfi-recase-any.twolc.hfst # compiled free recasing
│   │   ├── omorfi-remove-boundaries.regex # generated special symbol cleanup
│   │   ├── omorfi-remove-boundaries.regex.hfst # compiled special symbol cleanup
│   │   ├── omorfi.segment.hfst # compiled segmentation automaton
│   │   ├── omorfi-sh.regex # rules for š 
│   │   ├── omorfi-sh.regex.hfst # compiled rules for š
│   │   ├── omorfi.tokenise.hfst # compiled tokeniser
│   │   ├── omorfi-token-joiner.hfst # compiled token-medial symbols
│   │   ├── omorfi-token.regex # generated token patterns
│   │   ├── omorfi-token.regex.hfst # compiled token patterns
│   │   ├── omorfi.token-separator.hfst # compiled token splitters
│   │   ├── omorfi-uppercase-first.twolc # generated uppercasing rule for initial uppercase
│   │   ├── omorfi-uppercase-first.twolc.hfst # compiled initial uppercasing
│   │   ├── omorfi-zh.regex # generated rules for ž 
│   │   ├── omorfi-zh.regex.hfst # compiled rules for z
│   │   ├── stemparts.tsv # generated database for stem variants
│   │   ├── temporary.ftb3.hfst # final step for ftb3.1 compilation
│   │   ├── temporary-ftb3.hyphenated.hfst 
│   │   ├── temporary-ftb3.orth.hfst
│   │   ├── temporary-ftb3.relaxed.hfst
│   │   ├── temporary-ftb3.tagged.hfst
│   │   ├── temporary-ftb3.tagweighted.hfst
│   │   ├── temporary-ftb3.unbounded.hfst
│   │   ├── temporary.hyphenated.hfst
│   │   ├── temporary.omor.hfst # final step in omor compilation
│   │   ├── temporary-omor.hyphenated.hfst
│   │   ├── temporary-omor.orth.hfst
│   │   ├── temporary-omor.relaxed.hfst
│   │   ├── temporary-omor.tagweighted.hfst
│   │   ├── temporary-omor.tokenweighted.hfst
│   │   ├── temporary-omor.unbounded.hfst
│   │   ├── temporary-omor.weighted.hfst
│   │   ├── temporary.orth.hfst
│   │   ├── temporary.relaxed.hfst
│   │   ├── temporary.tagged.hfst
│   │   ├── temporary.tagweighted.hfst
│   │   ├── temporary.tokenweighted.hfst
│   │   ├── temporary.unbounded.hfst
│   │   ├── temporary.weighted.hfst
│   │   └── timestamp # hack for autotoolsing generated directory
│   ├── generate-edit-distance.py # script to generate edit distances
│   ├── generate-googlecodewiki.py # script to generate wiki pages from database
│   ├── generate-kotus-sanalista.py # script to generate kotus-sanalista style XML
│   ├── generate-lexcs.py # script to generate lexc files
│   ├── generate-monodix.py # script to generate apertium monodix
│   ├── generate-regexes.py # script to generate regex rules
│   ├── generate-reweights.py # script to generate simple weighting
│   ├── generate-twolcs.py # script to generate twolc rules
│   ├── generate-yaml.py # script to generate yaml tests
│   ├── giella_formatter.py # functions for formatting giellatekno style data
│   ├── gradation.py # python rules for finding gradation in dictionary forms
│   ├── guess-csv2tsv.py # script for old to new database changes
│   ├── guess_feats.py # python rules for guessing features from dictionary form
│   ├── guess_new_class.py # python rules for guessing paradigm from dictionary word
│   ├── kotus_sanalista_formatter.py # functions for formatting kotus-sanalista XML
│   ├── lexc_formatter.py # functions for formatting lexc
│   ├── lexemes # lexical databases
│   │   ├── boundaries.tsv # known intra-word boundaries: compound etc.
│   │   ├── broken-paradigms.tsv # paradigms with wrong classes in dictionaries
│   │   ├── origin.tsv # original database of the lexemes
│   │   ├── particle-classes.tsv # particles' semantics and syntax
│   │   ├── plurale-tantum.tsv # plurale tantum words
│   │   ├── possessives.tsv # particles with possessives
│   │   ├── pronoun-classes.tsv # pronouns' semantics
│   │   ├── pronunciation.tsv # orthography to phonemics mismatches
│   │   ├── proper-classes.tsv # proper nouns' semantics
│   │   ├── semantic.tsv # nouns' semantics
│   │   ├── subcategories.tsv # unorganised data
│   │   ├── symbol-classes.tsv # symbols' semantics
│   │   ├── usage.tsv # special usage: dialects, non-standard, etc.
│   │   ├── verb-arguments.tsv # verbs' syntax
│   │   └── lexemes.tsv # the lexical database
│   ├── Makefile # generated makefile
│   ├── Makefile.am # central make rules for everything
│   ├── Makefile.in # generated makefile
│   ├── omorfi_settings.py # general settings for Finnish language and special symbols
│   ├── omor_formatter.py # functions for formatting omor style data
│   ├── omor_strings_io.py # functions for string and i/o handling, error messaging etc.
│   ├── paradigms # paradigm data (per paradigm)
│   │   └── paradigm-data.tsv # paradigm-specific data
│   ├── parse_csv_data.py # functions for parsing tsv databases
│   ├── plurale_tantum.py # python rules for guessing singular from plural
│   ├── __pycache__ # python's generated cache to ignore
│   │   ├── apertium_formatter.cpython-34.pyc
│   │   ├── ftb3_formatter.cpython-34.pyc
│   │   ├── gradation.cpython-34.pyc
│   │   ├── guess_feats.cpython-34.pyc
│   │   ├── guess_new_class.cpython-34.pyc
│   │   ├── lexc_formatter.cpython-34.pyc
│   │   ├── omorfi_settings.cpython-34.pyc
│   │   ├── omor_formatter.cpython-34.pyc
│   │   ├── omor_strings_io.cpython-34.pyc
│   │   ├── parse_csv_data.cpython-34.pyc
│   │   ├── plurale_tantum.cpython-34.pyc
│   │   ├── regex_formatter.cpython-34.pyc
│   │   ├── stub.cpython-34.pyc
│   │   ├── tagset_formatter.cpython-34.pyc
│   │   ├── twolc_formatter.cpython-34.pyc
│   │   └── wordmap.cpython-34.pyc
│   ├── regex_formatter.py # functions for formatting xerox regexes
│   ├── remove-word.bash # remove word from databases
│   ├── scripts # end-user scripts
│   │   ├── convert_tag_format.py # scripts for converting tags
│   │   ├── generate-wordforms.sh # generate all forms of word
│   │   ├── Makefile.am # Make rules for end-user script installation
│   │   ├── omor2apertium.sed # sed rules for turning omor style to apertium
│   │   ├── omor2apertium.sh # script for turning omor style to apertium
│   │   ├── omorfi-analyse.sh # generated analyse script
│   │   ├── omorfi-analyse.sh.in # analyse script sources
│   │   ├── omorfi-generate.sh # generated generate script
│   │   ├── omorfi-generate.sh.in # generation script sources
│   │   ├── omorfi-hyphenate.sh # generated hyphenation script
│   │   ├── omorfi-hyphenate.sh.in # hyphenation script sources
│   │   ├── omorfi-interactive.sh # generated tokenised analysis
│   │   ├── omorfi-interactive.sh.in # tokenised analysis sources
│   │   ├── omorfi.py # python library for omorfi
│   │   ├── omorfi-spell.sh # generated spelling script
│   │   └── omorfi-spell.sh.in # spelling script sources
│   ├── set-attribute.bash # set attribute of a lexeme
│   ├── stub.py # python rules to turn dictionary form into invariant stub
│   ├── stub-stem-inflection # stem and suffix morph databases
│   │   ├── 51-stems.tsv # full-form lists of compounds with irregular agreeing inflection
│   │   ├── acro-inflections.tsv # suffix morphs for acronyms
│   │   ├── acronym-stems.tsv # stem morphs for acronyms
│   │   ├── adjective-inflections.tsv # suffix morphs for adjectives
│   │   ├── adjective-stems.tsv # stem morphs for adjectives
│   │   ├── digit-inflections.tsv # suffix morphs for digits
│   │   ├── digit-stems.tsv # stem morphs for digits
│   │   ├── digit-stubs.tsv # root morphs for digits
│   │   ├── noun-inflections.tsv # suffix morphs for nouns
│   │   ├── noun-stems.tsv # stem morphs for nouns
│   │   ├── numeral-inflections.tsv # suffix morphs for numerals
│   │   ├── numeral-stems.tsv # stem morphs for numerals
│   │   ├── particle-inflections.tsv # suffix morphs for particles
│   │   ├── particle-stems.tsv # stem morphs for particles
│   │   ├── pronoun-inflections.tsv # suffix morphs for pronouns
│   │   ├── pronoun-stems.tsv # stem morphs for pronouns
│   │   ├── symbol-inflections.tsv # suffix morphs for symbols
│   │   ├── symbol-stems.tsv # stem morphs for symbols
│   │   ├── verb-inflections.tsv # suffix morphs for verbs
│   │   └── verb-stems.tsv # stem morphs for verbs
│   ├── tdt_formatter.py # functions for TDT formatting
│   ├── tsv_expand.py # script for guessing missing data in lexical database
│   ├── tsvjoin.py # script for tsv database joins for lexemes and ancillary databases
│   ├── twolc_formatter.py # functions for formatting twol rules
│   ├── voikko # generated spell-checker 
│   │   ├── acceptor.default.hfst # default dictionary
│   │   ├── errmodel.default.hfst # default error model
│   │   ├── index.xml # generated metadara
│   │   ├── index.xml.in # metadata source 
│   │   └── speller-omorfi.zhfst # generated spell-checker package
│   └── wordmap.py # functions for python dict version of database
├── test # test scripts
│   ├── clusterstuff.pbs.in # source for torque cluster script
│   ├── corps-to-googlecodewiki.sh # script to generate wiki page from test results
│   ├── corpus-measures.sh # script for testing large corpora for coverage etc.
│   ├── corpus-tests.py # scripts for testing large corpora for errors in database
│   ├── count_tsv.awk #
│   ├── europarl-coverage.sh # generated europarl coverage script
│   ├── europarl-coverage.sh.in # sources for europarl coverage measure
│   ├── find_errs.awk # 
│   ├── fiwiki-coverage.sh # generated fi.wiki coverage script
│   ├── fiwiki-coverage.sh.in # sources for fi.wiki coverage measure
│   ├── ftb31-coverage.sh # generated ftb3.1 coverage script 
│   ├── ftb31-coverage.sh.in # sources for ftb3.1 coverage script
│   ├── ftb-test.py # ftb3.1 faithfulness script
│   ├── ftb-test.sh # ftb3.1 faithfulness wrapper script
│   ├── ftc-test.py # FTC faithfulness script (NON-FREE data!)
│   ├── ftc-test.sh # FTC wrapper script (NON-FREE data!)
│   ├── gutenberg-coverage.sh # generated gutenberg coverage script
│   ├── gutenberg-coverage.sh.in # sources for gutenberg coverage script
│   ├── jrc-acquis-coverage.sh # generated jrc acquis coverage script
│   ├── jrc-acquis-coverage.sh.in # sources for jrc acquis coverage script
│   ├── Makefile # generated makefile
│   ├── Makefile.am # make rules for testing
│   ├── Makefile.in # generated makefile
│   ├── prop-corpus-tests.py # script for lexical data errors on proper nouns
│   ├── rough-tests.sh # fast simple tests for workability of analysers
│   ├── scripts-runnable.sh # test script for workability of end-user scripts
│   ├── speed-test.sh # generated speed test script 
│   ├── speed-test.sh.in # sources for speed test script
│   ├── test-header.yaml # generated headers for yam testing
│   ├── test-header.yaml.in # sources for yaml test headers
│   └── wordforms.list # list of word-forms should always be analysed
├── THANKS # list of all contributors and benefactors (not legally binding)
└── TODO # list of things broken and missing