![]() |
omorfi 0.9.9
Open morphology of Finnish
|
Public Member Functions | |
| def | __init__ (self, verbosity=False) |
| def | load_labelsegmenter (self, f) |
| def | load_segmenter (self, f) |
| def | load_analyser (self, f) |
| def | load_generator (self, f) |
| def | load_acceptor (self, f) |
| def | load_tokeniser (self, f) |
| def | load_lemmatiser (self, f) |
| def | load_hyphenator (self, f) |
| def | load_guesser (self, f) |
| def | load_udpipe (self, str filename) |
| def | fsa_tokenise (self, str line) |
| def | python_tokenise (self, str line) |
| def | tokenise (self, str line) |
| def | analyse (self, Token token) |
| def | analyse_sentence (self, str s) |
| def | guess (self, Token token) |
| def | lemmatise (self, Token token) |
| def | segment (self, Token token) |
| def | labelsegment (self, Token token) |
| def | accept (self, Token token) |
| def | generate (self, str omorstring) |
| def | tokenise_sentence (self, str sentence) |
| def | tokenise_plaintext (self, f) |
| def | tokenise_conllu (self, f) |
| def | tokenise_vislcg (self, f) |
An object holding omorfi binaries for all the functions of omorfi.
The following functionalities use automata binaries that need to be loaded
separately:
* analysis
* tokenisation
* generation
* lemmatisation
* segmentation
* lookup
* guess
There is python code to perform basic string munging controlled by
following bool attributes:
try_lowercase: to use `str.lower()`
try_titlecase: to use `str[0].upper() + str[1:]`
try_uppercase: to use `str.upper()`
try_detitlecase: to use `str[0].lower + str[1:]`
The annotations will be changed when transformation has been applied.
| def omorfi.omorfi.Omorfi.__init__ | ( | self, | |
verbosity = False |
|||
| ) |
Construct Omorfi with given verbosity for printouts.
| def omorfi.omorfi.Omorfi.accept | ( | self, | |
| Token | token | ||
| ) |
Check if the token is in the dictionary or not.
Returns:
False for OOVs, True otherwise. Note, that this is not
necessarily more efficient than bool(analyse(token))
| def omorfi.omorfi.Omorfi.analyse | ( | self, | |
| Token | token | ||
| ) |
Perform a simple morphological analysis lookup.
The analysis will be performed for re-cased variants based on the
state of the member variables. The re-cased analyses will have more
penalty weight and additional analyses indicating the changes.
Side-Effects:
The analyses are stored in the token, and only the new analyses
are returned.
Args:
token: token to be analysed.
Returns:
An HFST structure of raw analyses, or None if there are no matches
in the dictionary.
| def omorfi.omorfi.Omorfi.analyse_sentence | ( | self, | |
| str | s | ||
| ) |
Analyse a full sentence with tokenisation and guessing. for details of tokenisation, see @c tokenise(self, s). for details of analysis, see @c analyse(self, token). If further models like udpipe are loaded, may fill in gaps with that.
| def omorfi.omorfi.Omorfi.fsa_tokenise | ( | self, | |
| str | line | ||
| ) |
Tokenise with FSA.
Args:
line: string to tokenise
Todo:
Not implemented (needs pmatch python support)
| def omorfi.omorfi.Omorfi.generate | ( | self, | |
| str | omorstring | ||
| ) |
Generate surface forms corresponding given token description.
Currently only supports very direct omor style analysis string
generation.
Args:
omorstring: Omorfi analysis string to generate
Returns
A surface string word-form, or the omorstring argument if
generation fails. Or None if generator is not loaded.
| def omorfi.omorfi.Omorfi.guess | ( | self, | |
| Token | token | ||
| ) |
Speculate morphological analyses of OOV token.
This method may use multiple information sources, but not the actual
analyser. Therefore a typical use of this is after the analyse(token)
function has failed. Note that some information sources perform badly
when guessing without context, for these the analyse_sentence(sent) is
the only option.
Side-effect:
This operation stores guesses in token for future use as well as
returning them.
Args:
token: token to analyse with guessers.
Returns:
New guesses as a list of Analysis objects.
| def omorfi.omorfi.Omorfi.labelsegment | ( | self, | |
| Token | token | ||
| ) |
Segment token into labelled morphs, words and other string pieces.
The segments are suffixed with their morphologically relevant
informations, e.g. lexical classes for root lexemes and inflectional
features for inflectional segments. This functionality is experimental
due to hacky way it was patched together.
Side-effect:
Note that this operation stores the labelsegments in the token for
future use, and only returns raw HFST structures. To get pythonic
you can use Token's methods afterwards.
Args:
token: token to segment with labels
Returns:
New labeled segemntations in analysis list.
| def omorfi.omorfi.Omorfi.lemmatise | ( | self, | |
| Token | token | ||
| ) |
Lemmatise token, splitting it into valid word id's from lexical db.
Side-effect:
This operation stores lemmas in the token for future use and only
returns HFST structures. Use Token's method's to retrieve tokens
in pythonic structures.
Args:
token: token to lemmatise
Returns:
New lemmas in analysis list
| def omorfi.omorfi.Omorfi.load_acceptor | ( | self, | |
| f | |||
| ) |
Load acceptor model from a file.
Args:
f: containing single hfst automaton binary.
| def omorfi.omorfi.Omorfi.load_analyser | ( | self, | |
| f | |||
| ) |
Load analysis model from a file. Also sets up a basic tokeniser and
lemmatiser using the analyser.
Args
f: containing single hfst automaton binary.
| def omorfi.omorfi.Omorfi.load_generator | ( | self, | |
| f | |||
| ) |
Load generation model from a file.
Args:
f: containing single hfst automaton binary.
| def omorfi.omorfi.Omorfi.load_guesser | ( | self, | |
| f | |||
| ) |
Load guesser model from a file.
Args:
f: containing single hfst automaton binary.
| def omorfi.omorfi.Omorfi.load_hyphenator | ( | self, | |
| f | |||
| ) |
Load hyphenator model from a file.
Args:
f: containing single hfst automaton binary.
| def omorfi.omorfi.Omorfi.load_labelsegmenter | ( | self, | |
| f | |||
| ) |
Load labeled segments model from a file.
Args:
f: containing single hfst automaton binary.
| def omorfi.omorfi.Omorfi.load_lemmatiser | ( | self, | |
| f | |||
| ) |
Load lemmatiser model from a file.
Args:
f: containing single hfst automaton binary.
| def omorfi.omorfi.Omorfi.load_segmenter | ( | self, | |
| f | |||
| ) |
Load segmentation model from a file.
Args:
f: containing single hfst automaton binary.
| def omorfi.omorfi.Omorfi.load_tokeniser | ( | self, | |
| f | |||
| ) |
Load tokeniser model from a file.
Args:
f: containing single hfst automaton binary.
| def omorfi.omorfi.Omorfi.load_udpipe | ( | self, | |
| str | filename | ||
| ) |
Load UDPipe model for statistical parsing. UDPipe can be used as extra information source for OOV symbols or all tokens. It works best with sentence-based analysis, token based does not keep track of context. @param filename path to UDPipe model
| def omorfi.omorfi.Omorfi.python_tokenise | ( | self, | |
| str | line | ||
| ) |
Tokenise with python's basic string functions.
Args:
line: string to tokenise
| def omorfi.omorfi.Omorfi.segment | ( | self, | |
| Token | token | ||
| ) |
Segment token into morphs, words and other string pieces.
Side-effect:
this operation stores segments in the token for future
use and only returns the HFST structures. To get pythonic data use
Token's methods afterwards.
Args:
token: token to segment
Returns:
New segmentations in analysis list
| def omorfi.omorfi.Omorfi.tokenise | ( | self, | |
| str | line | ||
| ) |
Perform tokenisation with loaded tokeniser if any, or `split()`.
If tokeniser is available, it is applied to input line and if
result is achieved, it is split to tokens according to tokenisation
strategy and returned as a list.
If no tokeniser are present, or none give results, the line will be
tokenised using python's basic string functions. If analyser is
present, tokeniser will try harder to get some analyses for each
token using hard-coded list of extra splits.
Args:
line: a string to be tokenised, should contain a line of text or a
sentence
Returns:
A list of tokens based on the line. List may include boundary
non-tokens if e.g. sentence boundaries are recognised. For empty
line a paragraph break non-token may be returned.
| def omorfi.omorfi.Omorfi.tokenise_conllu | ( | self, | |
| f | |||
| ) |
tokenise a conllu sentence or comment.
Should be used a file-like iterable that has CONLL-U sentence or
comment or empty block coming up.
Args:
f: filelike object with iterable strings
Returns:
list of tokens
| def omorfi.omorfi.Omorfi.tokenise_plaintext | ( | self, | |
| f | |||
| ) |
tokenise a whole text.
Args:
f: filelike object with iterable strings
Returns:
list of tokens
| def omorfi.omorfi.Omorfi.tokenise_sentence | ( | self, | |
| str | sentence | ||
| ) |
tokenise a sentence.
To be used when text is already sentence-splitted. If the
text is plain text with sentence boundaries within lines,
use
Args:
sentence: a string containing one sentence
Returns:
list of tokens in sentence
| def omorfi.omorfi.Omorfi.tokenise_vislcg | ( | self, | |
| f | |||
| ) |
Tokenises a sentence from VISL-CG format data.
Returns a list of tokens when it hits first non-token block, including
a token representing this non-token block.
Args:
f: filelike object to itrate strings of vislcg data
Returns:
list of tokens