OMORFI-TOKENISE

NAME
SYNOPSIS
DESCRIPTION
EXAMPLES
COPYRIGHT

NAME

omorfi-tokenise − Tokenise Finnish text with help of morphological dictionary

SYNOPSIS

omorfi-tokenise.py [OPTION] [FILENAME...]

DESCRIPTION

Tokenises text using dictionary in addition to your average whitespace splitting and punctuation stripping
−h
, −−help

show this help message and exit

−f FSAPATH, −−fsa FSAPATH

Path to directory of HFST format automata

−i INFILE, −−input INFILE

source of analysis data

−v, −−verbose

print verbosely while processing

−o OUTFILE, −−output OUTFILE

print conll-u into OUTFILE

−x STATFILE, −−statistics STATFILE

print statistics into STATFILE

−O OUTFORMAT, −−output-format OUTFORMAT

format output for OUTFORMAT

If no INFILE is given, input is read from standard input. If no OUTFILE is given, output is written to standard output. OUTFORMAT should be one of the supported end applications, e.g. conllu for CONLL-U, moses for Moses SMT.

EXAMPLES

The following command

omorfi-tokenise.py −i rautatie.text −o rautatie.tokens

tokenise a raw text corpus

COPYRIGHT

Copyright © 2016 Omorfi contributors Licence GPLv3: GNU GPL version 3 <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law.