omorfi-tokenise − Tokenise Finnish text with help of morphological dictionary
omorfi-tokenise.py [OPTION] [FILENAME...]
Tokenises text
using dictionary in addition to your average whitespace
splitting and punctuation stripping
−h, −−help
show this help message and exit
−f FSAPATH, −−fsa FSAPATH
Path to directory of HFST format automata
−i INFILE, −−input INFILE
source of analysis data
−v, −−verbose
print verbosely while processing
−o OUTFILE, −−output OUTFILE
print conll-u into OUTFILE
−x STATFILE, −−statistics STATFILE
print statistics into STATFILE
−O OUTFORMAT, −−output-format OUTFORMAT
format output for OUTFORMAT
If no INFILE is given, input is read from standard input. If no OUTFILE is given, output is written to standard output. OUTFORMAT should be one of the supported end applications, e.g. conllu for CONLL-U, moses for Moses SMT.
The following command
omorfi-tokenise.py −i rautatie.text −o rautatie.tokens
tokenise a raw text corpus
Copyright
© 2016 Omorfi contributors Licence GPLv3: GNU GPL
version 3 <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and
redistribute it. There is NO WARRANTY, to the extent
permitted by law.