omorfi 0.9.9
Open morphology of Finnish
Public Member Functions | Data Fields
omorfi.omorfi.Omorfi Class Reference

Public Member Functions

def __init__ (self, verbosity=False)
 
def load_labelsegmenter (self, f)
 
def load_segmenter (self, f)
 
def load_analyser (self, f)
 
def load_generator (self, f)
 
def load_acceptor (self, f)
 
def load_tokeniser (self, f)
 
def load_lemmatiser (self, f)
 
def load_hyphenator (self, f)
 
def load_guesser (self, f)
 
def load_udpipe (self, str filename)
 
def fsa_tokenise (self, str line)
 
def python_tokenise (self, str line)
 
def tokenise (self, str line)
 
def analyse (self, Token token)
 
def analyse_sentence (self, str s)
 
def guess (self, Token token)
 
def lemmatise (self, Token token)
 
def segment (self, Token token)
 
def labelsegment (self, Token token)
 
def accept (self, Token token)
 
def generate (self, str omorstring)
 
def tokenise_sentence (self, str sentence)
 
def tokenise_plaintext (self, f)
 
def tokenise_conllu (self, f)
 
def tokenise_vislcg (self, f)
 

Data Fields

 analyser
 analyser model
 
 tokeniser
 tokeniser
 
 generator
 generator model
 
 lemmatiser
 lemmatising model
 
 hyphenator
 hyphenating model
 
 segmenter
 segmenting model
 
 labelsegmenter
 label-segment model
 
 acceptor
 acceptor
 
 guesser
 guesser model
 
 udpiper
 UDPipe model.
 
 udpipeline
 UDPipeline object :-(.
 
 uderror
 UDError object :-(.
 
 lexlogprobs
 database of lexical unigram probabilities
 
 taglogprobs
 database of tag unigram probabilities
 
 try_lowercase
 whether to lowercase and re-analyse if needed
 
 try_titlecase
 whether to Titlecase and re-analyse if needed
 
 try_detitlecase
 whether to dEtitlecase and re-analyse if needed
 
 try_detitle_firstinsent
 whether to dEtitlecase and re-analyse if needed
 
 try_uppercase
 whether to UPPERCASE and re-analyse if needed
 
 can_accept
 whether accept model is loaded
 
 can_analyse
 whether analyser model is loaded
 
 can_tokenise
 whether tokenisr model is loaded
 
 can_generate
 whether generator model is loaded
 
 can_lemmatise
 whether lemmatising model is loaded
 
 can_hyphenate
 whether hypenation model is loaded
 
 can_segment
 whether segmentation model is loaded
 
 can_labelsegment
 whether label segmentation model is loaded
 
 can_guess
 whether guesser model is loaded
 
 can_udpipe
 whether UDPipe is loaded
 

Detailed Description

An object holding omorfi binaries for all the functions of omorfi.

The following functionalities use automata binaries that need to be loaded
separately:
* analysis
* tokenisation
* generation
* lemmatisation
* segmentation
* lookup
* guess

There is python code to perform basic string munging controlled by
following bool attributes:
    try_lowercase: to use `str.lower()`
    try_titlecase: to use `str[0].upper() + str[1:]`
    try_uppercase: to use `str.upper()`
    try_detitlecase: to use `str[0].lower + str[1:]`

The annotations will be changed when transformation has been applied.

Constructor & Destructor Documentation

◆ __init__()

def omorfi.omorfi.Omorfi.__init__ (   self,
  verbosity = False 
)
Construct Omorfi with given verbosity for printouts.

Member Function Documentation

◆ accept()

def omorfi.omorfi.Omorfi.accept (   self,
Token  token 
)
Check if the token is in the dictionary or not.

Returns:
    False for OOVs, True otherwise. Note, that this is not
necessarily more efficient than bool(analyse(token))

◆ analyse()

def omorfi.omorfi.Omorfi.analyse (   self,
Token  token 
)
Perform a simple morphological analysis lookup.

The analysis will be performed for re-cased variants based on the
state of the member variables. The re-cased analyses will have more
penalty weight and additional analyses indicating the changes.

Side-Effects:
    The analyses are stored in the token, and only the new analyses
    are returned.

Args:
    token: token to be analysed.

Returns:
    An HFST structure of raw analyses, or None if there are no matches
    in the dictionary.

◆ analyse_sentence()

def omorfi.omorfi.Omorfi.analyse_sentence (   self,
str  s 
)
Analyse a full sentence with tokenisation and guessing.

for details of tokenisation, see @c tokenise(self, s).
for details of analysis, see @c analyse(self, token).
If further models like udpipe are loaded, may fill in gaps with that.

◆ fsa_tokenise()

def omorfi.omorfi.Omorfi.fsa_tokenise (   self,
str  line 
)
Tokenise with FSA.

Args:
    line:  string to tokenise

Todo:
    Not implemented (needs pmatch python support)

◆ generate()

def omorfi.omorfi.Omorfi.generate (   self,
str  omorstring 
)
Generate surface forms corresponding given token description.

Currently only supports very direct omor style analysis string
generation.

Args:
    omorstring: Omorfi analysis string to generate

Returns
    A surface string word-form, or the omorstring argument if
    generation fails. Or None if generator is not loaded.

◆ guess()

def omorfi.omorfi.Omorfi.guess (   self,
Token  token 
)
Speculate morphological analyses of OOV token.

This method may use multiple information sources, but not the actual
analyser. Therefore a typical use of this is after the analyse(token)
function has failed. Note that some information sources perform badly
when guessing without context, for these the analyse_sentence(sent) is
the only option.

Side-effect:
    This operation stores guesses in token for future use as well as
    returning them.

Args:
    token: token to analyse with guessers.

Returns:
    New guesses as a list of Analysis objects.

◆ labelsegment()

def omorfi.omorfi.Omorfi.labelsegment (   self,
Token  token 
)
Segment token into labelled morphs, words and other string pieces.

The segments are suffixed with their morphologically relevant
informations, e.g. lexical classes for root lexemes and inflectional
features for inflectional segments. This functionality is experimental
due to hacky way it was patched together.

Side-effect:
    Note that this operation stores the labelsegments in the token for
future use, and only returns raw HFST structures. To get pythonic
you can use Token's methods afterwards.

Args:
    token: token to segment with labels

Returns:
    New labeled segemntations in analysis list.

◆ lemmatise()

def omorfi.omorfi.Omorfi.lemmatise (   self,
Token  token 
)
Lemmatise token, splitting it into valid word id's from lexical db.

Side-effect:
    This operation stores lemmas in the token for future use and only
    returns HFST structures. Use Token's method's to retrieve tokens
    in pythonic structures.

Args:
    token: token to lemmatise

Returns:
    New lemmas in analysis list

◆ load_acceptor()

def omorfi.omorfi.Omorfi.load_acceptor (   self,
  f 
)
Load acceptor model from a file.

Args:
    f: containing single hfst automaton binary.

◆ load_analyser()

def omorfi.omorfi.Omorfi.load_analyser (   self,
  f 
)
Load analysis model from a file. Also sets up a basic tokeniser and
lemmatiser using the analyser.

Args
    f: containing single hfst automaton binary.

◆ load_generator()

def omorfi.omorfi.Omorfi.load_generator (   self,
  f 
)
Load generation model from a file.

Args:
    f: containing single hfst automaton binary.

◆ load_guesser()

def omorfi.omorfi.Omorfi.load_guesser (   self,
  f 
)
Load guesser model from a file.

Args:
    f: containing single hfst automaton binary.

◆ load_hyphenator()

def omorfi.omorfi.Omorfi.load_hyphenator (   self,
  f 
)
Load hyphenator model from a file.

Args:
    f: containing single hfst automaton binary.

◆ load_labelsegmenter()

def omorfi.omorfi.Omorfi.load_labelsegmenter (   self,
  f 
)
Load labeled segments model from a file.

Args:
    f: containing single hfst automaton binary.

◆ load_lemmatiser()

def omorfi.omorfi.Omorfi.load_lemmatiser (   self,
  f 
)
Load lemmatiser model from a file.

Args:
    f: containing single hfst automaton binary.

◆ load_segmenter()

def omorfi.omorfi.Omorfi.load_segmenter (   self,
  f 
)
Load segmentation model from a file.

Args:
    f: containing single hfst automaton binary.

◆ load_tokeniser()

def omorfi.omorfi.Omorfi.load_tokeniser (   self,
  f 
)
Load tokeniser model from a file.

Args:
    f: containing single hfst automaton binary.

◆ load_udpipe()

def omorfi.omorfi.Omorfi.load_udpipe (   self,
str  filename 
)
Load UDPipe model for statistical parsing.

UDPipe can be used as extra information source for OOV symbols
or all tokens. It works best with sentence-based analysis, token
based does not keep track of context.

@param filename  path to UDPipe model

◆ python_tokenise()

def omorfi.omorfi.Omorfi.python_tokenise (   self,
str  line 
)
Tokenise with python's basic string functions.

Args:
    line:  string to tokenise

◆ segment()

def omorfi.omorfi.Omorfi.segment (   self,
Token  token 
)
Segment token into morphs, words and other string pieces.

Side-effect:
    this operation stores segments in the token for future
use and only returns the HFST structures. To get pythonic data use
Token's methods afterwards.

Args:
    token: token to segment

Returns:
    New segmentations in analysis list

◆ tokenise()

def omorfi.omorfi.Omorfi.tokenise (   self,
str  line 
)
Perform tokenisation with loaded tokeniser if any, or `split()`.

If tokeniser is available, it is applied to input line and if
result is achieved, it is split to tokens according to tokenisation
strategy and returned as a list.

If no tokeniser are present, or none give results, the line will be
tokenised using python's basic string functions. If analyser is
present, tokeniser will try harder to get some analyses for each
token using hard-coded list of extra splits.

Args:
    line: a string to be tokenised, should contain a line of text or a
          sentence

Returns:
    A list of tokens based on the line. List may include boundary
    non-tokens if e.g. sentence boundaries are recognised. For empty
    line a paragraph break non-token may be returned.

◆ tokenise_conllu()

def omorfi.omorfi.Omorfi.tokenise_conllu (   self,
  f 
)
tokenise a conllu sentence or comment.

Should be used a file-like iterable that has CONLL-U sentence or
comment or empty block coming up.

Args:
    f: filelike object with iterable strings

Returns:
    list of tokens

◆ tokenise_plaintext()

def omorfi.omorfi.Omorfi.tokenise_plaintext (   self,
  f 
)
tokenise a whole text.

Args:
    f: filelike object with iterable strings

Returns:
    list of tokens

◆ tokenise_sentence()

def omorfi.omorfi.Omorfi.tokenise_sentence (   self,
str  sentence 
)
tokenise a sentence.

To be used when text is already sentence-splitted. If the
text is plain text with sentence boundaries within lines,
use

Args:
    sentence: a string containing one sentence

Returns:
    list of tokens in sentence

◆ tokenise_vislcg()

def omorfi.omorfi.Omorfi.tokenise_vislcg (   self,
  f 
)
Tokenises a sentence from VISL-CG format data.

Returns a list of tokens when it hits first non-token block, including
a token representing this non-token block.

Args:
    f: filelike object to itrate strings of vislcg data

Returns:
    list of tokens

The documentation for this class was generated from the following file: