Public Member Functions
def	__init__ (self, verbosity=False)

def	load_labelsegmenter (self, f)

def	load_segmenter (self, f)

def	load_analyser (self, f)

def	load_generator (self, f)

def	load_acceptor (self, f)

def	load_tokeniser (self, f)

def	load_lemmatiser (self, f)

def	load_hyphenator (self, f)

def	load_guesser (self, f)

def	load_udpipe (self, str filename)

def	fsa_tokenise (self, str line)

def	python_tokenise (self, str line)

def	tokenise (self, str line)

def	analyse (self, Token token)

def	analyse_sentence (self, str s)

def	guess (self, Token token)

def	lemmatise (self, Token token)

def	segment (self, Token token)

def	labelsegment (self, Token token)

def	accept (self, Token token)

def	generate (self, str omorstring)

def	tokenise_sentence (self, str sentence)

def	tokenise_plaintext (self, f)

def	tokenise_conllu (self, f)

def	tokenise_vislcg (self, f)

Data Fields
	analyser
	analyser model

	tokeniser
	tokeniser

	generator
	generator model

	lemmatiser
	lemmatising model

	hyphenator
	hyphenating model

	segmenter
	segmenting model

	labelsegmenter
	label-segment model

	acceptor
	acceptor

	guesser
	guesser model

	udpiper
	UDPipe model.

	udpipeline
	UDPipeline object :-(.

	uderror
	UDError object :-(.

	lexlogprobs
	database of lexical unigram probabilities

	taglogprobs
	database of tag unigram probabilities

	try_lowercase
	whether to lowercase and re-analyse if needed

	try_titlecase
	whether to Titlecase and re-analyse if needed

	try_detitlecase
	whether to dEtitlecase and re-analyse if needed

	try_detitle_firstinsent
	whether to dEtitlecase and re-analyse if needed

	try_uppercase
	whether to UPPERCASE and re-analyse if needed

	can_accept
	whether accept model is loaded

	can_analyse
	whether analyser model is loaded

	can_tokenise
	whether tokenisr model is loaded

	can_generate
	whether generator model is loaded

	can_lemmatise
	whether lemmatising model is loaded

	can_hyphenate
	whether hypenation model is loaded

	can_segment
	whether segmentation model is loaded

	can_labelsegment
	whether label segmentation model is loaded

	can_guess
	whether guesser model is loaded

	can_udpipe
	whether UDPipe is loaded

Detailed Description

An object holding omorfi binaries for all the functions of omorfi.

The following functionalities use automata binaries that need to be loaded
separately:
* analysis
* tokenisation
* generation
* lemmatisation
* segmentation
* lookup
* guess

There is python code to perform basic string munging controlled by
following bool attributes:
    try_lowercase: to use `str.lower()`
    try_titlecase: to use `str[0].upper() + str[1:]`
    try_uppercase: to use `str.upper()`
    try_detitlecase: to use `str[0].lower + str[1:]`

The annotations will be changed when transformation has been applied.

Constructor & Destructor Documentation

◆ init()

def omorfi.omorfi.Omorfi.__init__	(	self,
		verbosity = `False`
	)

Construct Omorfi with given verbosity for printouts.

Member Function Documentation

◆ accept()

def omorfi.omorfi.Omorfi.accept	(		self,
		Token	token
	)

Check if the token is in the dictionary or not.

Returns:
    False for OOVs, True otherwise. Note, that this is not
necessarily more efficient than bool(analyse(token))

◆ analyse()

def omorfi.omorfi.Omorfi.analyse	(		self,
		Token	token
	)

Perform a simple morphological analysis lookup.

The analysis will be performed for re-cased variants based on the
state of the member variables. The re-cased analyses will have more
penalty weight and additional analyses indicating the changes.

Side-Effects:
    The analyses are stored in the token, and only the new analyses
    are returned.

Args:
    token: token to be analysed.

Returns:
    An HFST structure of raw analyses, or None if there are no matches
    in the dictionary.

◆ analyse_sentence()

def omorfi.omorfi.Omorfi.analyse_sentence	(		self,
		str	s
	)

Analyse a full sentence with tokenisation and guessing.

for details of tokenisation, see @c tokenise(self, s).
for details of analysis, see @c analyse(self, token).
If further models like udpipe are loaded, may fill in gaps with that.

◆ fsa_tokenise()

def omorfi.omorfi.Omorfi.fsa_tokenise	(		self,
		str	line
	)

Tokenise with FSA.

Args:
    line:  string to tokenise

Todo:
    Not implemented (needs pmatch python support)

◆ generate()

def omorfi.omorfi.Omorfi.generate	(		self,
		str	omorstring
	)

Generate surface forms corresponding given token description.

Currently only supports very direct omor style analysis string
generation.

Args:
    omorstring: Omorfi analysis string to generate

Returns
    A surface string word-form, or the omorstring argument if
    generation fails. Or None if generator is not loaded.

◆ guess()

def omorfi.omorfi.Omorfi.guess	(		self,
		Token	token
	)

Speculate morphological analyses of OOV token.

This method may use multiple information sources, but not the actual
analyser. Therefore a typical use of this is after the analyse(token)
function has failed. Note that some information sources perform badly
when guessing without context, for these the analyse_sentence(sent) is
the only option.

Side-effect:
    This operation stores guesses in token for future use as well as
    returning them.

Args:
    token: token to analyse with guessers.

Returns:
    New guesses as a list of Analysis objects.

◆ labelsegment()

def omorfi.omorfi.Omorfi.labelsegment	(		self,
		Token	token
	)

Segment token into labelled morphs, words and other string pieces.

The segments are suffixed with their morphologically relevant
informations, e.g. lexical classes for root lexemes and inflectional
features for inflectional segments. This functionality is experimental
due to hacky way it was patched together.

Side-effect:
    Note that this operation stores the labelsegments in the token for
future use, and only returns raw HFST structures. To get pythonic
you can use Token's methods afterwards.

Args:
    token: token to segment with labels

Returns:
    New labeled segemntations in analysis list.

◆ lemmatise()

def omorfi.omorfi.Omorfi.lemmatise	(		self,
		Token	token
	)

Lemmatise token, splitting it into valid word id's from lexical db.

Side-effect:
    This operation stores lemmas in the token for future use and only
    returns HFST structures. Use Token's method's to retrieve tokens
    in pythonic structures.

Args:
    token: token to lemmatise

Returns:
    New lemmas in analysis list

◆ load_acceptor()

def omorfi.omorfi.Omorfi.load_acceptor	(	self,
		f
	)

Load acceptor model from a file.

Args:
    f: containing single hfst automaton binary.

◆ load_analyser()

def omorfi.omorfi.Omorfi.load_analyser	(	self,
		f
	)

Load analysis model from a file. Also sets up a basic tokeniser and
lemmatiser using the analyser.

Args
    f: containing single hfst automaton binary.

◆ load_generator()

def omorfi.omorfi.Omorfi.load_generator	(	self,
		f
	)

Load generation model from a file.

Args:
    f: containing single hfst automaton binary.

◆ load_guesser()

def omorfi.omorfi.Omorfi.load_guesser	(	self,
		f
	)

Load guesser model from a file.

Args:
    f: containing single hfst automaton binary.

◆ load_hyphenator()

def omorfi.omorfi.Omorfi.load_hyphenator	(	self,
		f
	)

Load hyphenator model from a file.

Args:
    f: containing single hfst automaton binary.

◆ load_labelsegmenter()

def omorfi.omorfi.Omorfi.load_labelsegmenter	(	self,
		f
	)

Load labeled segments model from a file.

Args:
    f: containing single hfst automaton binary.

◆ load_lemmatiser()

def omorfi.omorfi.Omorfi.load_lemmatiser	(	self,
		f
	)

Load lemmatiser model from a file.

Args:
    f: containing single hfst automaton binary.

◆ load_segmenter()

def omorfi.omorfi.Omorfi.load_segmenter	(	self,
		f
	)

Load segmentation model from a file.

Args:
    f: containing single hfst automaton binary.

◆ load_tokeniser()

def omorfi.omorfi.Omorfi.load_tokeniser	(	self,
		f
	)

Load tokeniser model from a file.

Args:
    f: containing single hfst automaton binary.

◆ load_udpipe()

def omorfi.omorfi.Omorfi.load_udpipe	(		self,
		str	filename
	)

Load UDPipe model for statistical parsing.

UDPipe can be used as extra information source for OOV symbols
or all tokens. It works best with sentence-based analysis, token
based does not keep track of context.

@param filename  path to UDPipe model

◆ python_tokenise()

def omorfi.omorfi.Omorfi.python_tokenise	(		self,
		str	line
	)

Tokenise with python's basic string functions.

Args:
    line:  string to tokenise

◆ segment()

def omorfi.omorfi.Omorfi.segment	(		self,
		Token	token
	)

Segment token into morphs, words and other string pieces.

Side-effect:
    this operation stores segments in the token for future
use and only returns the HFST structures. To get pythonic data use
Token's methods afterwards.

Args:
    token: token to segment

Returns:
    New segmentations in analysis list

◆ tokenise()

def omorfi.omorfi.Omorfi.tokenise	(		self,
		str	line
	)

Perform tokenisation with loaded tokeniser if any, or `split()`.

If tokeniser is available, it is applied to input line and if
result is achieved, it is split to tokens according to tokenisation
strategy and returned as a list.

If no tokeniser are present, or none give results, the line will be
tokenised using python's basic string functions. If analyser is
present, tokeniser will try harder to get some analyses for each
token using hard-coded list of extra splits.

Args:
    line: a string to be tokenised, should contain a line of text or a
          sentence

Returns:
    A list of tokens based on the line. List may include boundary
    non-tokens if e.g. sentence boundaries are recognised. For empty
    line a paragraph break non-token may be returned.

◆ tokenise_conllu()

def omorfi.omorfi.Omorfi.tokenise_conllu	(	self,
		f
	)

tokenise a conllu sentence or comment.

Should be used a file-like iterable that has CONLL-U sentence or
comment or empty block coming up.

Args:
    f: filelike object with iterable strings

Returns:
    list of tokens

◆ tokenise_plaintext()

def omorfi.omorfi.Omorfi.tokenise_plaintext	(	self,
		f
	)

tokenise a whole text.

Args:
    f: filelike object with iterable strings

Returns:
    list of tokens

◆ tokenise_sentence()

def omorfi.omorfi.Omorfi.tokenise_sentence	(		self,
		str	sentence
	)

tokenise a sentence.

To be used when text is already sentence-splitted. If the
text is plain text with sentence boundaries within lines,
use

Args:
    sentence: a string containing one sentence

Returns:
    list of tokens in sentence

◆ tokenise_vislcg()

def omorfi.omorfi.Omorfi.tokenise_vislcg	(	self,
		f
	)

Tokenises a sentence from VISL-CG format data.

Returns a list of tokens when it hits first non-token block, including
a token representing this non-token block.

Args:
    f: filelike object to itrate strings of vislcg data

Returns:
    list of tokens

The documentation for this class was generated from the following file:

/home/flammie/github/flammie/omorfi/src/python/omorfi/omorfi.py

Public Member Functions

Data Fields

Detailed Description

Constructor & Destructor Documentation

◆ __init__()

Member Function Documentation

◆ accept()

◆ analyse()

◆ analyse_sentence()

◆ fsa_tokenise()

◆ generate()

◆ guess()

◆ labelsegment()

◆ lemmatise()

◆ load_acceptor()

◆ load_analyser()

◆ load_generator()

◆ load_guesser()

◆ load_hyphenator()

◆ load_labelsegmenter()

◆ load_lemmatiser()

◆ load_segmenter()

◆ load_tokeniser()

◆ load_udpipe()

◆ python_tokenise()

◆ segment()

◆ tokenise()

◆ tokenise_conllu()

◆ tokenise_plaintext()

◆ tokenise_sentence()

◆ tokenise_vislcg()

◆ init()