![]() |
omorfi 0.9.9
Open morphology of Finnish
|
Public Member Functions | |
def | __init__ (self, verbosity=False) |
def | load_labelsegmenter (self, f) |
def | load_segmenter (self, f) |
def | load_analyser (self, f) |
def | load_generator (self, f) |
def | load_acceptor (self, f) |
def | load_tokeniser (self, f) |
def | load_lemmatiser (self, f) |
def | load_hyphenator (self, f) |
def | load_guesser (self, f) |
def | load_udpipe (self, str filename) |
def | fsa_tokenise (self, str line) |
def | python_tokenise (self, str line) |
def | tokenise (self, str line) |
def | analyse (self, Token token) |
def | analyse_sentence (self, str s) |
def | guess (self, Token token) |
def | lemmatise (self, Token token) |
def | segment (self, Token token) |
def | labelsegment (self, Token token) |
def | accept (self, Token token) |
def | generate (self, str omorstring) |
def | tokenise_sentence (self, str sentence) |
def | tokenise_plaintext (self, f) |
def | tokenise_conllu (self, f) |
def | tokenise_vislcg (self, f) |
An object holding omorfi binaries for all the functions of omorfi. The following functionalities use automata binaries that need to be loaded separately: * analysis * tokenisation * generation * lemmatisation * segmentation * lookup * guess There is python code to perform basic string munging controlled by following bool attributes: try_lowercase: to use `str.lower()` try_titlecase: to use `str[0].upper() + str[1:]` try_uppercase: to use `str.upper()` try_detitlecase: to use `str[0].lower + str[1:]` The annotations will be changed when transformation has been applied.
def omorfi.omorfi.Omorfi.__init__ | ( | self, | |
verbosity = False |
|||
) |
Construct Omorfi with given verbosity for printouts.
def omorfi.omorfi.Omorfi.accept | ( | self, | |
Token | token | ||
) |
Check if the token is in the dictionary or not. Returns: False for OOVs, True otherwise. Note, that this is not necessarily more efficient than bool(analyse(token))
def omorfi.omorfi.Omorfi.analyse | ( | self, | |
Token | token | ||
) |
Perform a simple morphological analysis lookup. The analysis will be performed for re-cased variants based on the state of the member variables. The re-cased analyses will have more penalty weight and additional analyses indicating the changes. Side-Effects: The analyses are stored in the token, and only the new analyses are returned. Args: token: token to be analysed. Returns: An HFST structure of raw analyses, or None if there are no matches in the dictionary.
def omorfi.omorfi.Omorfi.analyse_sentence | ( | self, | |
str | s | ||
) |
Analyse a full sentence with tokenisation and guessing. for details of tokenisation, see @c tokenise(self, s). for details of analysis, see @c analyse(self, token). If further models like udpipe are loaded, may fill in gaps with that.
def omorfi.omorfi.Omorfi.fsa_tokenise | ( | self, | |
str | line | ||
) |
Tokenise with FSA. Args: line: string to tokenise Todo: Not implemented (needs pmatch python support)
def omorfi.omorfi.Omorfi.generate | ( | self, | |
str | omorstring | ||
) |
Generate surface forms corresponding given token description. Currently only supports very direct omor style analysis string generation. Args: omorstring: Omorfi analysis string to generate Returns A surface string word-form, or the omorstring argument if generation fails. Or None if generator is not loaded.
def omorfi.omorfi.Omorfi.guess | ( | self, | |
Token | token | ||
) |
Speculate morphological analyses of OOV token. This method may use multiple information sources, but not the actual analyser. Therefore a typical use of this is after the analyse(token) function has failed. Note that some information sources perform badly when guessing without context, for these the analyse_sentence(sent) is the only option. Side-effect: This operation stores guesses in token for future use as well as returning them. Args: token: token to analyse with guessers. Returns: New guesses as a list of Analysis objects.
def omorfi.omorfi.Omorfi.labelsegment | ( | self, | |
Token | token | ||
) |
Segment token into labelled morphs, words and other string pieces. The segments are suffixed with their morphologically relevant informations, e.g. lexical classes for root lexemes and inflectional features for inflectional segments. This functionality is experimental due to hacky way it was patched together. Side-effect: Note that this operation stores the labelsegments in the token for future use, and only returns raw HFST structures. To get pythonic you can use Token's methods afterwards. Args: token: token to segment with labels Returns: New labeled segemntations in analysis list.
def omorfi.omorfi.Omorfi.lemmatise | ( | self, | |
Token | token | ||
) |
Lemmatise token, splitting it into valid word id's from lexical db. Side-effect: This operation stores lemmas in the token for future use and only returns HFST structures. Use Token's method's to retrieve tokens in pythonic structures. Args: token: token to lemmatise Returns: New lemmas in analysis list
def omorfi.omorfi.Omorfi.load_acceptor | ( | self, | |
f | |||
) |
Load acceptor model from a file. Args: f: containing single hfst automaton binary.
def omorfi.omorfi.Omorfi.load_analyser | ( | self, | |
f | |||
) |
Load analysis model from a file. Also sets up a basic tokeniser and lemmatiser using the analyser. Args f: containing single hfst automaton binary.
def omorfi.omorfi.Omorfi.load_generator | ( | self, | |
f | |||
) |
Load generation model from a file. Args: f: containing single hfst automaton binary.
def omorfi.omorfi.Omorfi.load_guesser | ( | self, | |
f | |||
) |
Load guesser model from a file. Args: f: containing single hfst automaton binary.
def omorfi.omorfi.Omorfi.load_hyphenator | ( | self, | |
f | |||
) |
Load hyphenator model from a file. Args: f: containing single hfst automaton binary.
def omorfi.omorfi.Omorfi.load_labelsegmenter | ( | self, | |
f | |||
) |
Load labeled segments model from a file. Args: f: containing single hfst automaton binary.
def omorfi.omorfi.Omorfi.load_lemmatiser | ( | self, | |
f | |||
) |
Load lemmatiser model from a file. Args: f: containing single hfst automaton binary.
def omorfi.omorfi.Omorfi.load_segmenter | ( | self, | |
f | |||
) |
Load segmentation model from a file. Args: f: containing single hfst automaton binary.
def omorfi.omorfi.Omorfi.load_tokeniser | ( | self, | |
f | |||
) |
Load tokeniser model from a file. Args: f: containing single hfst automaton binary.
def omorfi.omorfi.Omorfi.load_udpipe | ( | self, | |
str | filename | ||
) |
Load UDPipe model for statistical parsing. UDPipe can be used as extra information source for OOV symbols or all tokens. It works best with sentence-based analysis, token based does not keep track of context. @param filename path to UDPipe model
def omorfi.omorfi.Omorfi.python_tokenise | ( | self, | |
str | line | ||
) |
Tokenise with python's basic string functions. Args: line: string to tokenise
def omorfi.omorfi.Omorfi.segment | ( | self, | |
Token | token | ||
) |
Segment token into morphs, words and other string pieces. Side-effect: this operation stores segments in the token for future use and only returns the HFST structures. To get pythonic data use Token's methods afterwards. Args: token: token to segment Returns: New segmentations in analysis list
def omorfi.omorfi.Omorfi.tokenise | ( | self, | |
str | line | ||
) |
Perform tokenisation with loaded tokeniser if any, or `split()`. If tokeniser is available, it is applied to input line and if result is achieved, it is split to tokens according to tokenisation strategy and returned as a list. If no tokeniser are present, or none give results, the line will be tokenised using python's basic string functions. If analyser is present, tokeniser will try harder to get some analyses for each token using hard-coded list of extra splits. Args: line: a string to be tokenised, should contain a line of text or a sentence Returns: A list of tokens based on the line. List may include boundary non-tokens if e.g. sentence boundaries are recognised. For empty line a paragraph break non-token may be returned.
def omorfi.omorfi.Omorfi.tokenise_conllu | ( | self, | |
f | |||
) |
tokenise a conllu sentence or comment. Should be used a file-like iterable that has CONLL-U sentence or comment or empty block coming up. Args: f: filelike object with iterable strings Returns: list of tokens
def omorfi.omorfi.Omorfi.tokenise_plaintext | ( | self, | |
f | |||
) |
tokenise a whole text. Args: f: filelike object with iterable strings Returns: list of tokens
def omorfi.omorfi.Omorfi.tokenise_sentence | ( | self, | |
str | sentence | ||
) |
tokenise a sentence. To be used when text is already sentence-splitted. If the text is plain text with sentence boundaries within lines, use Args: sentence: a string containing one sentence Returns: list of tokens in sentence
def omorfi.omorfi.Omorfi.tokenise_vislcg | ( | self, | |
f | |||
) |
Tokenises a sentence from VISL-CG format data. Returns a list of tokens when it hits first non-token block, including a token representing this non-token block. Args: f: filelike object to itrate strings of vislcg data Returns: list of tokens