omorfi

Open morphology for Finnish

Project maintained by flammie Hosted on GitHub Pages — Theme by mattgraham

Directory layout in omorfi

Outdated

Introduction

This document is for developers and contributors of omorfi to understand the directory structure of the code, and the files contained.

Details

Directory tree:

.  # Top-level directory contains necessary autotools stuff
├── config-aux  # is generated by autotools DO NOT TOUCH
├── doc # hand-written documentation may be added here
├── man # all end-user command-line tools have manual pages here
├── autom4te.cache # is generated by autotools DO NOT TOUCH
├── src # all code and data resides here; the build scripts in this dir
│   ├── attributes # lexical data (per lexeme joins)
│   ├── bash # bash scripts
│   ├── continuations # Morphology as modeled in morph-concatenation model
│   ├── docs # in-source documents and wiki pages generated from these
│   ├── examples # examples for tests and documentation
│   ├── externals # lexical data outside omorfi and scripts to handle it
│   ├── generated # holds build process generated files
│   ├── java # java API
│   │   └── com # obligatory JAVA dir struct.
│   │       ├── github # obligatory JAVA dir struct.
│   │       │   └── flammie # obligatory JAVA dir struct.
│   │       │       └── omorfi # obligatory JAVA dir struct.
│   ├── paradigms # paradigm data (per paradigm joins)
│   ├── perl # perl scripts
│   ├── python # python scripts
│   │   └── omorfi # python API
│   │       └── __pycache__ # python may generate these
│   ├── scripts # installable scripts for end-user applications
│   └── voikko # build directory for spell-checker
└── test # Automatic test scripts

File tree:

.
├── aclocal.m4 # aclocal generates, ignore
├── AUTHORS # list of authors relevant for copyright issues
├── autogen.sh # use to generate autotools setup if autoreconf fails
├── autom4te.cache # autotools generates, ignore
│   ├── output.0   # ''
│   ├── output.1   # ''
│   ├── requests   # ''
│   ├── traces.0   # ''
│   └── traces.1   # ''
├── ChangeLog.old  # ChangeLog prior to moving git
├── config-aux     # autotool generates, ignore
│   ├── install-sh # ''
│   ├── missing    # ''
│   └── test-driver # ''
├── config.log # autoconf log, read when configure fails
├── config.status # autoconf status, run to remake current settings
├── configure # configure script, run to initialise or change settings
├── configure.ac # autoconf configuration of configure
├── COPYING # GNU GPL licence
├── doc # doc tree
│   ├── Makefile # generated makefile, ignore
│   ├── Makefile.am # doc compilation rules
│   └── Makefile.in # generated makefile, ignore
├── INSTALL # generated GNU installation instructions
├── Makefile # generated makefile, ignore
├── Makefile.am # root dir make rules: pkg-config installation etc.
├── Makefile.in # generated makefile, ignore
├── man # manual pages for end-user scripts
│   ├── Makefile # generated makefile, ignore
│   ├── Makefile.am # manual make rules: man installation
│   ├── Makefile.in # generated makefile, ignore
│   ├── omorfi-analyse.1 # man page for analyse
│   ├── omorfi-generate.1 # man page for generate
│   ├── omorfi-hyphenate.1 # man page for hyphenate
│   ├── omorfi-interactive.1 # man page for interactive
│   └── omorfi-spell.1 # man page for spell
├── NEWS # Major changes between releases
├── omorfi.pc # generated pkg-config data
├── omorfi.pc.in # pkg-config settings
├── README # README documentation, everyone reads
├── README.bindist # notes about binary distributions
├── src # all source data and databases
│   ├── add-word.bash # add a new word
│   ├── apertium_formatter.py # functions for formatting apertium style dictionaries
│   ├── attributes # ancillary lexical data
│   ├── change-class.bash # change lexemes' class
│   ├── docs # documents for autogeneration
│   │   ├── paradigms.tsv # documents for paradigms
│   │   └── stuff.tsv # documents for internal symbols
│   ├── experimental_xml_formatter.py # experiment
│   ├── externals # external lexical databases
│   │   ├── fiwikt2omorfi.bash # script to convert fi.wiktionary data to omorfi
│   │   ├── fiwiktionary.bash # another one
│   │   ├── joukahainen.xml # joukahainen database
│   │   └── kotus-sanalista_v1.xml # kotus database
│   ├── find-redundant-lexemes.py # script to remove redundancies
│   ├── ftb3_formatter.py # functions for ftb3 formatting
│   ├── generated # generated build files
│   │   ├── apertium-fin.fin.lexc # apertium style lexc
│   │   ├── apertium-fin.fin.lexc.hfst # compiled apertium lexc
│   │   ├── apertium-fin.fin.twolc # apertium style twolc
│   │   ├── apertium-fin.fin.twolc.hfst # compiled apertium twolc
│   │   ├── errmodel.edit-distance-2.hfst # compiled edit distance 2
│   │   ├── errmodel.edit-distance.hfst # compiled edit distance 1
│   │   ├── errmodel.edit-distance.txt # generated edit distance 1
│   │   ├── fin-automorf.hfst # compiled apertium automaton
│   │   ├── inflections.tsv # generated inflection database
│   │   ├── joint.tsv # generated lexeme database join ancillary data
│   │   ├── master.tsv # generated final lexeme database
│   │   ├── omorfi.accept.hfst # generic word-form acceptor
│   │   ├── omorfi-between-tokens.regex # generated tokeniser split symbol
│   │   ├── omorfi-between-tokens.regex.hfst # compiled tokeniser split
│   │   ├── omorfi-ftb3.analyse.hfst # compiled ftb3.1 analyser
│   │   ├── omorfi-ftb3.generate.hfst # compiled ftb3.1 generator
│   │   ├── omorfi-ftb3.lexc # generated ftb3.1 style lexc
│   │   ├── omorfi-ftb3.lexc.hfst # compiled ftb3.1 lexc
│   │   ├── omorfi-ftb3.reweight # generated ftb3.1 simple weights
│   │   ├── omorfi-ftb3-rewrite-tags.regex # generated ftb3.1 tagging hacks
│   │   ├── omorfi-ftb3-rewrite-tags.regex.hfst # compiled ftb3.1 tagging hacks
│   │   ├── omorfi.hyphenate.hfst # compiled hyphenation dictionary
│   │   ├── omorfi-hyphenate.twolc # generated hyphenation rules
│   │   ├── omorfi-hyphenate.twolc.hfst # compiled hyphenation rules
│   │   ├── omorfi-hyphens.twolc # obligatory hyphenation rules
│   │   ├── omorfi-hyphens.twolc.hfst # compiled oblig. hyphenation
│   │   ├── omorfi.lemmatise.hfst # compiled lemmatiser
│   │   ├── omorfi.lexc # generated generic lexc
│   │   ├── omorfi.lexc.hfst # compiled generic lexc
│   │   ├── omorfi.nondict-token.hfst # compiled tokeniser for oov's
│   │   ├── omorfi-omor.analyse.hfst # compiled omor-style analyser
│   │   ├── omorfi-omor.generate.hfst # compiled omor-style generator
│   │   ├── omorfi-omor.lexc # generated omor-style lexc
│   │   ├── omorfi-omor.lexc.hfst # compiled omor-style lexc
│   │   ├── omorfi-omor.reweight # generated omor-style simple weights
│   │   ├── omorfi-orthographic-variations.regex # generated orthographical rules
│   │   ├── omorfi-orthographic-variations.regex.hfst # compiled orthographical rules
│   │   ├── omorfi-recase-any.twolc # generated recasing rules for free recasing
│   │   ├── omorfi-recase-any.twolc.hfst # compiled free recasing
│   │   ├── omorfi-remove-boundaries.regex # generated special symbol cleanup
│   │   ├── omorfi-remove-boundaries.regex.hfst # compiled special symbol cleanup
│   │   ├── omorfi.segment.hfst # compiled segmentation automaton
│   │   ├── omorfi-sh.regex # rules for š
│   │   ├── omorfi-sh.regex.hfst # compiled rules for š
│   │   ├── omorfi.tokenise.hfst # compiled tokeniser
│   │   ├── omorfi-token-joiner.hfst # compiled token-medial symbols
│   │   ├── omorfi-token.regex # generated token patterns
│   │   ├── omorfi-token.regex.hfst # compiled token patterns
│   │   ├── omorfi.token-separator.hfst # compiled token splitters
│   │   ├── omorfi-uppercase-first.twolc # generated uppercasing rule for initial uppercase
│   │   ├── omorfi-uppercase-first.twolc.hfst # compiled initial uppercasing
│   │   ├── omorfi-zh.regex # generated rules for ž
│   │   ├── omorfi-zh.regex.hfst # compiled rules for z
│   │   ├── stemparts.tsv # generated database for stem variants
│   │   ├── temporary.ftb3.hfst # final step for ftb3.1 compilation
│   │   ├── temporary-ftb3.hyphenated.hfst
│   │   ├── temporary-ftb3.orth.hfst
│   │   ├── temporary-ftb3.relaxed.hfst
│   │   ├── temporary-ftb3.tagged.hfst
│   │   ├── temporary-ftb3.tagweighted.hfst
│   │   ├── temporary-ftb3.unbounded.hfst
│   │   ├── temporary.hyphenated.hfst
│   │   ├── temporary.omor.hfst # final step in omor compilation
│   │   ├── temporary-omor.hyphenated.hfst
│   │   ├── temporary-omor.orth.hfst
│   │   ├── temporary-omor.relaxed.hfst
│   │   ├── temporary-omor.tagweighted.hfst
│   │   ├── temporary-omor.tokenweighted.hfst
│   │   ├── temporary-omor.unbounded.hfst
│   │   ├── temporary-omor.weighted.hfst
│   │   ├── temporary.orth.hfst
│   │   ├── temporary.relaxed.hfst
│   │   ├── temporary.tagged.hfst
│   │   ├── temporary.tagweighted.hfst
│   │   ├── temporary.tokenweighted.hfst
│   │   ├── temporary.unbounded.hfst
│   │   ├── temporary.weighted.hfst
│   │   └── timestamp # hack for autotoolsing generated directory
│   ├── generate-edit-distance.py # script to generate edit distances
│   ├── generate-googlecodewiki.py # script to generate wiki pages from database
│   ├── generate-kotus-sanalista.py # script to generate kotus-sanalista style XML
│   ├── generate-lexcs.py # script to generate lexc files
│   ├── generate-monodix.py # script to generate apertium monodix
│   ├── generate-regexes.py # script to generate regex rules
│   ├── generate-reweights.py # script to generate simple weighting
│   ├── generate-twolcs.py # script to generate twolc rules
│   ├── generate-yaml.py # script to generate yaml tests
│   ├── giella_formatter.py # functions for formatting giellatekno style data
│   ├── gradation.py # python rules for finding gradation in dictionary forms
│   ├── guess-csv2tsv.py # script for old to new database changes
│   ├── guess_feats.py # python rules for guessing features from dictionary form
│   ├── guess_new_class.py # python rules for guessing paradigm from dictionary word
│   ├── kotus_sanalista_formatter.py # functions for formatting kotus-sanalista XML
│   ├── lexc_formatter.py # functions for formatting lexc
│   ├── lexemes # lexical databases
│   │   ├── boundaries.tsv # known intra-word boundaries: compound etc.
│   │   ├── broken-paradigms.tsv # paradigms with wrong classes in dictionaries
│   │   ├── origin.tsv # original database of the lexemes
│   │   ├── particle-classes.tsv # particles' semantics and syntax
│   │   ├── plurale-tantum.tsv # plurale tantum words
│   │   ├── possessives.tsv # particles with possessives
│   │   ├── pronoun-classes.tsv # pronouns' semantics
│   │   ├── pronunciation.tsv # orthography to phonemics mismatches
│   │   ├── proper-classes.tsv # proper nouns' semantics
│   │   ├── semantic.tsv # nouns' semantics
│   │   ├── subcategories.tsv # unorganised data
│   │   ├── symbol-classes.tsv # symbols' semantics
│   │   ├── usage.tsv # special usage: dialects, non-standard, etc.
│   │   ├── verb-arguments.tsv # verbs' syntax
│   │   └── lexemes.tsv # the lexical database
│   ├── Makefile # generated makefile
│   ├── Makefile.am # central make rules for everything
│   ├── Makefile.in # generated makefile
│   ├── omorfi_settings.py # general settings for Finnish language and special symbols
│   ├── omor_formatter.py # functions for formatting omor style data
│   ├── omor_strings_io.py # functions for string and i/o handling, error messaging etc.
│   ├── paradigms # paradigm data (per paradigm)
│   │   └── paradigm-data.tsv # paradigm-specific data
│   ├── parse_csv_data.py # functions for parsing tsv databases
│   ├── plurale_tantum.py # python rules for guessing singular from plural
│   ├── __pycache__ # python's generated cache to ignore
│   │   ├── apertium_formatter.cpython-34.pyc
│   │   ├── ftb3_formatter.cpython-34.pyc
│   │   ├── gradation.cpython-34.pyc
│   │   ├── guess_feats.cpython-34.pyc
│   │   ├── guess_new_class.cpython-34.pyc
│   │   ├── lexc_formatter.cpython-34.pyc
│   │   ├── omorfi_settings.cpython-34.pyc
│   │   ├── omor_formatter.cpython-34.pyc
│   │   ├── omor_strings_io.cpython-34.pyc
│   │   ├── parse_csv_data.cpython-34.pyc
│   │   ├── plurale_tantum.cpython-34.pyc
│   │   ├── regex_formatter.cpython-34.pyc
│   │   ├── stub.cpython-34.pyc
│   │   ├── tagset_formatter.cpython-34.pyc
│   │   ├── twolc_formatter.cpython-34.pyc
│   │   └── wordmap.cpython-34.pyc
│   ├── regex_formatter.py # functions for formatting xerox regexes
│   ├── remove-word.bash # remove word from databases
│   ├── scripts # end-user scripts
│   │   ├── convert_tag_format.py # scripts for converting tags
│   │   ├── generate-wordforms.sh # generate all forms of word
│   │   ├── Makefile.am # Make rules for end-user script installation
│   │   ├── omor2apertium.sed # sed rules for turning omor style to apertium
│   │   ├── omor2apertium.sh # script for turning omor style to apertium
│   │   ├── omorfi-analyse.sh # generated analyse script
│   │   ├── omorfi-analyse.sh.in # analyse script sources
│   │   ├── omorfi-generate.sh # generated generate script
│   │   ├── omorfi-generate.sh.in # generation script sources
│   │   ├── omorfi-hyphenate.sh # generated hyphenation script
│   │   ├── omorfi-hyphenate.sh.in # hyphenation script sources
│   │   ├── omorfi-interactive.sh # generated tokenised analysis
│   │   ├── omorfi-interactive.sh.in # tokenised analysis sources
│   │   ├── omorfi.py # python library for omorfi
│   │   ├── omorfi-spell.sh # generated spelling script
│   │   └── omorfi-spell.sh.in # spelling script sources
│   ├── set-attribute.bash # set attribute of a lexeme
│   ├── stub.py # python rules to turn dictionary form into invariant stub
│   ├── stub-stem-inflection # stem and suffix morph databases
│   │   ├── 51-stems.tsv # full-form lists of compounds with irregular agreeing inflection
│   │   ├── acro-inflections.tsv # suffix morphs for acronyms
│   │   ├── acronym-stems.tsv # stem morphs for acronyms
│   │   ├── adjective-inflections.tsv # suffix morphs for adjectives
│   │   ├── adjective-stems.tsv # stem morphs for adjectives
│   │   ├── digit-inflections.tsv # suffix morphs for digits
│   │   ├── digit-stems.tsv # stem morphs for digits
│   │   ├── digit-stubs.tsv # root morphs for digits
│   │   ├── noun-inflections.tsv # suffix morphs for nouns
│   │   ├── noun-stems.tsv # stem morphs for nouns
│   │   ├── numeral-inflections.tsv # suffix morphs for numerals
│   │   ├── numeral-stems.tsv # stem morphs for numerals
│   │   ├── particle-inflections.tsv # suffix morphs for particles
│   │   ├── particle-stems.tsv # stem morphs for particles
│   │   ├── pronoun-inflections.tsv # suffix morphs for pronouns
│   │   ├── pronoun-stems.tsv # stem morphs for pronouns
│   │   ├── symbol-inflections.tsv # suffix morphs for symbols
│   │   ├── symbol-stems.tsv # stem morphs for symbols
│   │   ├── verb-inflections.tsv # suffix morphs for verbs
│   │   └── verb-stems.tsv # stem morphs for verbs
│   ├── tdt_formatter.py # functions for TDT formatting
│   ├── tsv_expand.py # script for guessing missing data in lexical database
│   ├── tsvjoin.py # script for tsv database joins for lexemes and ancillary databases
│   ├── twolc_formatter.py # functions for formatting twol rules
│   ├── voikko # generated spell-checker
│   │   ├── acceptor.default.hfst # default dictionary
│   │   ├── errmodel.default.hfst # default error model
│   │   ├── index.xml # generated metadara
│   │   ├── index.xml.in # metadata source
│   │   └── speller-omorfi.zhfst # generated spell-checker package
│   └── wordmap.py # functions for python dict version of database
├── test # test scripts
│   ├── clusterstuff.pbs.in # source for torque cluster script
│   ├── corps-to-googlecodewiki.sh # script to generate wiki page from test results
│   ├── corpus-measures.sh # script for testing large corpora for coverage etc.
│   ├── corpus-tests.py # scripts for testing large corpora for errors in database
│   ├── count_tsv.awk #
│   ├── europarl-coverage.sh # generated europarl coverage script
│   ├── europarl-coverage.sh.in # sources for europarl coverage measure
│   ├── find_errs.awk #
│   ├── fiwiki-coverage.sh # generated fi.wiki coverage script
│   ├── fiwiki-coverage.sh.in # sources for fi.wiki coverage measure
│   ├── ftb31-coverage.sh # generated ftb3.1 coverage script
│   ├── ftb31-coverage.sh.in # sources for ftb3.1 coverage script
│   ├── ftb-test.py # ftb3.1 faithfulness script
│   ├── ftb-test.sh # ftb3.1 faithfulness wrapper script
│   ├── ftc-test.py # FTC faithfulness script (NON-FREE data!)
│   ├── ftc-test.sh # FTC wrapper script (NON-FREE data!)
│   ├── gutenberg-coverage.sh # generated gutenberg coverage script
│   ├── gutenberg-coverage.sh.in # sources for gutenberg coverage script
│   ├── jrc-acquis-coverage.sh # generated jrc acquis coverage script
│   ├── jrc-acquis-coverage.sh.in # sources for jrc acquis coverage script
│   ├── Makefile # generated makefile
│   ├── Makefile.am # make rules for testing
│   ├── Makefile.in # generated makefile
│   ├── prop-corpus-tests.py # script for lexical data errors on proper nouns
│   ├── rough-tests.sh # fast simple tests for workability of analysers
│   ├── scripts-runnable.sh # test script for workability of end-user scripts
│   ├── speed-test.sh # generated speed test script
│   ├── speed-test.sh.in # sources for speed test script
│   ├── test-header.yaml # generated headers for yam testing
│   ├── test-header.yaml.in # sources for yaml test headers
│   └── wordforms.list # list of word-forms should always be analysed
├── THANKS # list of all contributors and benefactors (not legally binding)
└── TODO # list of things broken and missing