Intermediate representation in rule-based machine translation for the Uralic languages\footnotepubrights This work is licensed under a Creative Commons AttributionâNoDerivatives 4.0 International Licence. Licence details: \urlhttp://creativecommons.org/licenses/by-nd/4.0/. Original publication in proceedings of second IWCLUL held in Szeged 2016

Francis M. Tyers,
HSL-fakultehta
UiT Norgga Ã¡rktalaÅ¡ universitehta
francis.tyers@uit.no Tommi A. Pirinen
ADAPT Centre
School of Computing,
Dublin City University
tommi.pirinen@computing.dcu.ie

Abstract

This paper presents some of the major obstacles and challenges in creating machine translation systems between Uralic languages where the intermediate representation is based on morphology and syntax. The Uralic languages are very alike in many ways: similar case inventories, word order and non-finite clause forms. However current rule-based grammatical resources take many different approaches to encoding this information. These approaches are sometimes based on legacy or traditional grammatical description, important for making the tools comfortable for linguists, but sometimes based on arbitrary and incompatible decisions. This paper presents an overview of some of the issues in working with existing tools and representations and provides some guidelines and suggestions to facilitate future work.

1 Introduction

Creating rule-based machine translation (RBMT) systems is a process where one creates a mapping between units of source language and target language. The units can be different depending on the approach to the problem, i.e., on scale of translating word-forms to word-forms to translating via an intermediate abstract universal language, or an interlingua. In this article we study the approach of using just morphological analysis with the Uralic languages. The problem of such a system is that, even when morphologies of the closely related Uralic languages are expected to match, there are often engineering issues that make the work more tedious and cumbersome than necessary. Minimising the amount of simple engineering work is vital for making rule-based machine attractive to linguists and programmers alike.

The rest of the article is structured as follows: first we describe the backgrounds of the problem in 2, then we introduce the resources we are going to use in 3, we suggest some common best practices in 6, in 7 we briefly describe universal parts-of-speech and morphological features, and finally in 8 we provide some short concluding remarks.

2 Background

RBMT is a popular way of developing high-quality machine translations between related languages [1]. The building of an RBMT system rapidly for related languages is possible, as has been done with, e.g. Dutch and Afrikaans [5]. A wide-coverage machine translation requires wide-coverage lexical resources for the languages. Developing an analyser to a stage where it is usable by multiple applications, including RBMT, can take years, so it is often a good idea to use readily available resources instead of re-writing a new analyser from the scratch. However, the majority of existing analysers are made with language-dependent annotation systems, which unnecessarily complicate the description of machine translation. It should be clear, that if two related languages use the same morphological and syntactic structures to describe a phenomenon, a rule mapping between the two should be entirely trivial. This is not the case when taking most off-the-shelf analysers for contemporary Uralic morphologies. Table 1 shows an example of the morphological annotation of five Uralic languages for a simple five-word sentence.

James ja Mary +N+Prop+Sem/Mal+Sg+Nom +CC +N+Prop+Sem/Fem+Sg+Nom leaba gÃ¡rdimis . +V+IV+Ind+Prs+Du3 +N+Sg+Loc +CLB
ÐÐ¶ÐµÐ¹Ð¼Ñ Ð¼Ð°ÑÑÐ¾ ÐÐ°ÑÐ¸Ñ +N+Prop+Sem/Mal+Sg+Nom+Indef Ð¼Ð°ÑÑÐ¾+Po+COM +N+Prop+Sem/Fem+Pl+Nom+Indef ÑÐ°Ð´Ð¿Ð¸ÑÐµÑÑÑÑ . +N+SP+Ine+Indef+Der/Pr+V+Ind+Prs+ScPl3 +CLB
James ja Mary ovat N Prop Nom Sg Part N Prop Nom Sg V Prs Act Pl3 puutarhassa . N Ine Sg Punct
James ja Mary on +H+sg+nom +J +H+sg+nom +V+indic+pres+ps3+pl+ps+af aias . +S+sg+in .
James Ã©s Mary a /NOUN /CONJ /NOUN /ART kÃ©rtben vannak . /ADJ¡CAS¡INE¿¿ /VERB¡PLUR¿ /PUNCT

Table 1: Translations of the sentence ‘James and Mary are in the garden.’ in several Uralic languages (North SÃ¡mi, Erzya, Finnish, Estonian, Hungarian) with the tag strings used in their morphological analysers. There are examples of real morphosyntactic differences (compare the third-person dual in North SÃ¡mi with the third-person plural in other languages) and arbitrary tag differences (compare the tag that the word for and receives in the different languages).

2.1 Intermediate representations

Figure 1: The Vauquois triangle which illustrates the amount of transfer needed for different levels of intermediate representation.

In machine translation, an intermediate representation is an abstraction away from the surface forms of the language. Figure 1 shows the Vauquois triangle, a common illustration of different levels of intermediate representation.

At the bottom of the triangle, there is no intermediate representation and translation is performed on a word-for-word basis. At the apex of the triangle is interlingual translation, where the source language is first mapped to a language-independent semantic representation, and this representation is then used to generate the target language.

In the middle is (morpho-)syntactic transfer. Here the source language is analysed to a language-dependent intermediate representation (usually based on a combination of syntactic structure and morphosyntactic features) and then transfer rules are applied to convert the source language intermediate representation to one compatible with the target-language generation component.

3 Resources

In this paper we make use of five sets of linguistic data for five different Uralic languages: Finnish, North SÃ¡mi, Erzya, Estonian and Hungarian. We take the North SÃ¡mi and Erzya data from the Giellatekno language technology repository.¹¹\urlhttp://giellatekno.uit.no The North SÃ¡mi data has primarily been developed by the Divvun and Giellatekno groups at UiT Norgga Ã¡rktalaÅ¡ universitehta and the Erzya data has been developed by Jack Rueter at Helsingin yliopisto [9]. For the Estonian data, we use the plamk analyser²²\urlhttps://github.com/jjpp/plamk written by Jaak Pruulmann-Vengerfeldt, for Finnish, omorfi [6]³³\urlhttps://github.com/flammie/omorfi and for Hungarian, hunmorph [10].⁴⁴\urlhttp://mokk.bme.hu/resources/hunmorph/

4 Strategies

There a different ways to fix systematic mismatches. We evaluate the followings:

4.1 Relabelling

An obvious approach to getting around the problem of divergent tagsets is to simply perform relabelling. This is where you replace the canonical tags in one language with their equivalents in the other language, or with a common equivalent in both languages.

+CC

\rightarrow

\leftarrow

+J+Coord

However, this solution has its disadvantages. Even though +J and +CC both are used for conjuctions, the plamk tag is also used with subordinating and other conjunctions, while the Giellatekno tag excludes those. Relabelling +J+Coord to +CC and any other +J to +CS might work on the analyser, but will not work in a disambiguation rule saying “select the noun reading if the word to the right is tagged +J”, here we need to relabel +J to (+CS or +CC). In the opposite direction, +CS would need to be relabelled to (+J but not +Coord). The distinction between these may be irrelevant for the translation process (in all cases, ja in North SÃ¡mi will be translated to ja in Estonian), but for the intervening grammatical tools, it may be vital to make (or not) the distinction.

4.2 Interlingua

Another potential solution is to use a semantic interlingua (see description in section 2.1). This is the approach adopted by the machine translation system based on Grammatical Framework [8].⁵⁵\urlhttp://grammaticalframework.org In this framework there is no direct transfer of morphological features.

5 Specific linguistic issues

There are a number of linguistic issues in RBMT. We cover the following in detail:

5.1 Copula

There are two main copula constructions in the Uralic languages, the first functions more or less like in the Germanic languages. The copula is a normal verb that agrees with the subject. The second copula construction works like in the Turkic languages. In languages with the Turkic-style copula, it does not typically surface in the third-person singular present tense. In our examples, North SÃ¡mi, Finnish and Estonian are of the Germanic type, while Hungarian and Erzya are of the Turkic type.

	‘She is a student.’	‘She was a student.’
North SÃ¡mi	Son lea studeanta.	Son lei studeanta.
Erzya	Ð¡Ð¾Ð½ ÑÑÑÐ´ÐµÐ½Ñ.	Ð¡Ð¾Ð½ ÑÑÑÐ´ÐµÐ½ÑÐµÐ»Ñ.
Finnish	HÃ¤n on opiskelija.	HÃ¤n oli opiskelija.
Estonian	Ta on Ã¼liÃµpilane.	Ta oli Ã¼liÃµpilane.
Hungarian	Å hallgatÃ³.	Å hallgatÃ³ volt.

In North SÃ¡mi, Finnish and Estonian, the treatment of lea, on is similar. It is a verb which inflects and agrees like other verbs.

There are divergences when we look at the Erzya and Hungarian examples. Although they have the same structure, zero copula in the present tense and surfaced copula in the past tense. The morphological analyser for Erzya treats the copula as a derivation:

ÑÑÑÐ´ÐµÐ½Ñ+N+Sg+Nom+Indef+Der/Pr+V+Ind+Prs+ScSg3

Where in Hungarian it is simply omitted in the present (if it surfaced it would be van), and in the past it is considered a verb form.

5.2 Non-finite verb forms

Non-finite verb forms are infinitives and participles on the on hand and derivations on the another. There are a different number of them between languages and their tasks vary from being syntactic arguments of constructions to derived words, and a wide range of analyses are used to accommodate that. There are some differences in the table 2

Language	Sentence	Non-finite tag
	‘I see the man who is running’
North SÃ¡mi	Oidnen dievddu viehkame	Actio+Ess
Erzya	ÐÐµÑÐ½ ÑÑÑÐ°Ð½ÑÑ, ÐºÐ¾Ð½Ð°ÑÑ ÑÐ¸Ð¹Ð½Ð¸.	Der/Ð«+ActPrcShort+A
Finnish	NÃ¤en miehen juoksemassa.	InfMA+Ine
Estonian	NÃ¤en meest, kes jookseb.	—
Hungarian	LÃ¡tom a futÃ³ embert.	/VERB[IMPERF_PART]/ADJ
	‘While running I saw the man’
North SÃ¡mi	Oidnen dievddu viegadettiinan.	Ger+Px1Sg
Erzya	ÐÐµÑÐ½ ÑÐ¸Ð¹Ð½Ð¸ÑÑ ÑÑÑÐ°Ð½ÑÑ.	Der/Ð«ÑÑ+ActDemPrc+A
Finnish	NÃ¤in miehen juostessani.	InfE+Ine+PxSg1
Estonian	Jooksmise ajal nÃ¤gin ma meest.	Der/mine+Gen
Hungarian	FutÃ¡s kÃ¶zben lÃ¡ttam az embert.	/VERB[GERUND]/NOUN
	‘I see the running man.’
North SÃ¡mi	OainnÃ¡n viehkki dievddu.	PrsPrc
Erzya	Ð§Ð¸Ð¹Ð½ÐµÐ¼Ð°ÑÑ ÑÐµÐ´ÐµÐ½Ñ ÐºÐµÑÑÐ²ÑÑ.	Der/ÐÐ¼Ð+Nom
Finnish	NÃ¤en juoksevan miehen.	PrsPrc
Estonian	NÃ¤en jooksvat meest.	Der/v+A+Nom
Hungarian	LÃ¡tom a futÃ³ embert.	/VERB[IMPERF_PART]/ADJ
	‘Running is fun.’
North SÃ¡mi	Viehkan lea suohtas.	Actio+Nom
Erzya	ÐÐµÐ»ÐµÐ·ÑÐ½Ñ ÑÑÐºÑÐ½Ñ ÑÐ¸Ð¹Ð½ÐµÐ¼Ð°ÑÑ.	Der/ÐÐ¼Ð+Nom
Finnish	Juokseminen on kivaa.	Der/minen+Nom
Estonian	Jooksmine on lahe.	Der/mine+Nom
Hungarian	A futÃ¡s jÃ³ dolog.	/VERB[GERUND]/NOUN
	‘I like running.’
North SÃ¡mi	Liikon viehkat.	Inf
Erzya	Ð§Ð¸Ð¹Ð½ÐµÐ¼ÑÑÑ Ð½ÐµÐ¸Ñ ÑÑÑÐ°Ð½ÑÑ.	Inf+Ela
Finnish	PidÃ¤n juoksemisesta.	Der/minen+Ela
Estonian	Mulle meeldib joosta.	Inf
Hungarian	Szeretem futni.	/VERB¡INF¿

Table 2: Examples of the use and tagging of non-finite verb forms in the languages in our sample. It is not to be expected that the tags are completely equivalent, but for example, given the similarity in structure, should there be a difference in annotation between Finnish PrsPrc and Estonian Der/v+A?

5.3 Derivation, compounding and lexicalisation

A classical problem in computational morphologies lies in question of lexicalisation and productivity of certain processes; is a morphologically created word-form a new word or a form of a, possibly distant root. Morphologies take widely different and opposing approaches to this ranging from lexicalise-everything to collect-everything. See examples below:

	‘to drink’	‘a drink’	‘drinker’	‘brewery’
North SÃ¡mi	juhkat	juhkamuÅ¡	—	vuollaÂ·buvttadeaddji
Erzya	ÑÐ¸Ð¼ÐµÐ¼Ñ	ÑÐ¸Ð¼ÐµÐ¼Ð°-Ð¿ÐµÐ»Ñ	ÑÐ¸Ð¼Ð¸ÑÑ	Ð¿Ð¸ÑÐ½Ñ Ð·Ð°Ð²Ð¾Ð´
Finnish	juoda	juo-ma	juo—ja	olutÂ·tehdas
Estonian	jooma	joo—gi	joo—	ÃµlleÂ·tehas
Hungarian	iszik	ital	iv—Ã³	sÃ¶rÂ·fÅzde

The symbols ‘Â·’, ‘-’ and ‘—’ stand for compounding, inflection and derivation, respectively.

5.4 Pronouns and determiners

The distinction between pronoun and determiner is not widely made in traditional grammars of most Uralic languages. Words which may be considered both pronouns and determiners are lumped into a single morphosyntactic class (usually pronoun). Consider the following examples involving the word ‘this’

	‘I see this house.’	‘I see this.’
North SÃ¡mi	OainnÃ¡n dÃ¡n viesu.	OainnÃ¡n dÃ¡n.
Erzya	ÐÐµÑÐ½ ÑÐµ_det ÐºÑÐ´Ð¾Ð½ÑÑ.	ÐÐµÑÐ½ ÑÐµÐ½Ñ_pron.
Finnish	MÃ¤ nÃ¤en tÃ¤mÃ¤n_pron talon.	MÃ¤ nÃ¤en tÃ¤mÃ¤n_pron.
Estonian	Ma nÃ¤en selle_pron maja.	Ma nÃ¤en selle_pron.
Hungarian	NÃ©zem ezt_det/noun a_art hÃ¡zat.	NÃ©zem azt_det/noun

In traditional grammars of North SÃ¡mi, Finnish and Estonian both the pronominal and the modifier analyses of ‘this’ are classified as pronouns. In Hungarian and Erzya, a distinction is made, with Hungarian making a pronoun/determiner distinction and Erzya making a distinction between quantifier (determiner) and nominalised quantifier.

If we consider a standard definition of pronoun to be ‘that which stands in place (pro-) of a noun phrase (-noun)’ then we can see that in the above, only the tools for Erzya follow this. The other languages leave the distinction to tools later in the pipeline.

5.5 Non-inflecting words

All languages in the Uralic family have a wide variety of non-inflecting word forms. Depending on the grammatical tradition followed by the language resource these may be simply lumped into a single class, or they may have extensive syntactic or semantic subcategorisation. Table 3 gives a number of examples of non-inflecting words and the equivalent morphological analyses they receive in each of the languages we are studying. To a machine translation practitioner, these distinctions are largely superfluous, ja in North SÃ¡mi will be translated as ja in Finnish and ja in Estonian. However, the distinctions may be vital for the intervening disambiguation tools, and as such need to be taken into account.

	North SÃ¡mi	Erzya	Finnish	Estonian	Hungarian
and	ja+CC	Ð¼Ð°ÑÑÐ¾+Po+COM	ja Part	ja+J	Ã©s /CONJ
very	hui+Adv	Ð¿ÐµÐº+Adv+AdA	tosi Part	vÃ¤ga+Adv	nagyon /ADV
under	vuolde+Po	Ð°Ð»Ð¾Ð²+Po+Lat	alle Part	alla+K	alatt /POSTP
now	dÃ¡l+Adv	Ð½ÐµÐ¹+Adv+Temp	nyt Part	praegu+Adv	most /ADV
hello	bures+Interj	ÑÑÐ¼Ð±ÑÐ°ÑÐ¸+Interj+Formulaic	moi Part	tere+I	szia /UTT-INT

Table 3: Some examples of non-inflecting words with divergent morphological and syntactic annotation. In terms of morphology, the transfer of these tags may be a simple one-to-one substitution. However the syntactic environments may vary substantially.

6 Guidelines

6.1 Separation of lexicon and morphotactics

One of the main components of any rule-based system for morphologically-complex languages is a lexicon consisting of stems and inflectional/derivation categories. In some cases, such as for Finnish, these are partly provided by a state institution, such as a language board. In other cases they are the product of many years of work.

Although categorising stems for inclusion in a morphological lexicon (many contain over 100,000 entries) can take a substantial amount of work, even if done semi-automatically, implementing the morphotactics (that is, the rules covering inflection, derivation and compounding) may take substantially less time.

6.2 Maximise parallelism

In line with the Universal Dependencies project (see 7), we propose the adoption of a principle of maximum parallelism. In short “things that are the same should be tagged the same”. We do not propose that this should mean that all distinctions should be made in all languages. For example, those Uralic languages without object conjugation should not be required to adopt the agreement tags of those that have it. But it should be possible to come up with principled and consistent guidelines for closed categories.

7 Universal dependencies

Universal dependencies is a large multi-language project [3] aiming at common tagset for part-of-speech, morphosyntactic features and dependency relations. We do not propose adopting the exact tagset of the universal dependency project. Most projects working on Uralic languages have been ongoing for many years and the tools that they create are used for more than just machine translation. What we find more important is to adopt, or make available tools based on a consistent theoretical background and consistent morphosyntactic description. This could form the basis of a kind of universal morphosyntactic interlingua for the Uralic languages. These tools do not have to replace the current tools, and may be automatically generated from them, but they must be consistent. A systematic mapping needs to be considered while developing. The national Uralic languages have specifications for universal dependencies [7, 4, 11]. But these specifications differ in unnecessary ways. For example, consider the annotation of ‘that house’ in the two treebanks for Finnish: Turku Dependency Treebank (TDT) and FinnTreeBank (FTB); and Hungarian:

	this		house
Finnish (TDT)	tÃ¤mÃ¤_PRON		talo_NOUN
Finnish (FTB)	tÃ¤mÃ¤_DET		talo_NOUN
Hungarian	az_PRON	a_ART	hÃ¡z_NOUN

8 Concluding remarks

Rule-based machine translation provides a fascinating basis for exploring real linguistic differences between the Uralic languages. However, as we have shown, in current state-of-the-art tools, real linguistic differences are hidden behind a combination of incompatible tagsets and idiosyncratic traditional grammatical norms. We do not propose that the North SÃ¡mi adopt the Finnish norms or the Hungarians the Erzya norms, instead we propose developing a common morphological annotation scheme for the Uralic languages based on guidelines of the Universal dependencies project. It is not our aim for this to supercede national standards, but provide a common bridge between them to facilitate the cross-linguistic study and functional rule-based machine translation.

Acknowledgements

Heiki-Jaan Kaalep, Jack Rueter, LÃ¡szlÃ³ Tihany as well as the anonymous reviewers have all contributed to the language examples, the remaining mistakes are ours.

Appendix A Example of Universal dependencies for Uralic languages

Example is shown in table 4.

James ja Mary PROPN CONJ PROPN Number=Sing—Case=Nom Number=Sing—Case=Nom leaba gÃ¡rdimis . VERB NOUN PUNCT Mood=Ind—Tense=Pres—Person=3—Number=Dual Number=Sing—Case=Loc
ÐÐ¶ÐµÐ¹Ð¼Ñ Ð¼Ð°ÑÑÐ¾ PROPN CONJ Number=Sing—Case=Nom—Definite=Ind ÐÐ°ÑÐ¸Ñ ÑÐ°Ð´Ð¿Ð¸ÑÐµÑÑ- PROPN NOUN Number=Plur—Case=Nom—Definite=Ind Case=Ine—Definite=Ind -ÑÑ . VERB PUNCT Mood=Ind—Tense=Pres—Pers[subj]=3—Number[subj]=Plur
James ja Mary PROPN CONJ PROPN Number=Sing—Case=Nom Number=Sing—Case=Nom ovat puutarhassa . VERB NOUN PUNCT Mood=Ind—Tense=Pres—Person=3—Number=Plur Number=Sing—Case=Ine
James ja Mary PROPN CONJ PROPN Number=Sing—Case=Nom Number=Sing—Case=Nom on aias . VERB NOUN PUNCT Mood=Ind—Tense=Pres—Person=3—Number=Plur Number=Sing—Case=Ine
James Ã©s Mary PROPN CONJ PROPN Number=Sing—Case=Nom Number=Sing—Case=Nom kÃ©rtben . NOUN PUNCT Number=Sing—Case=Ine

Table 4: An example of applying universal part-of-speech tags and morphological features to the Uralic languages. Note how the massive differences in annotation are reduced to only the linguistically relevant compared to Table 1.

References

[1] M. L. Forcada, M. G. Rosell, J. Nordfalk, J. O’Regan, S. Ortiz-Rojas, J. A. PÃ©rez-Ortiz, G. R. nchez, F. SÃ¡nchez-MartÃnez and F. M. Tyers (2010) Apertium: a free/open-source platform for rule-based machine translation platform. Machine Translation. Cited by: 2.
[2] M. L. Forcada, M. G. Rosell, J. Nordfalk, J. O’Regan, S. Ortiz-Rojas, J. A. PÃ©rez-Ortiz, G. R. nchez, F. SÃ¡nchez-MartÃnez and F. M. Tyers (2010) Apertium: a free/open-source platform for rule-based machine translation platform. Machine Translation. Cited by: 1.
[3] R. T. McDonald, J. Nivre, Y. Quirmbach-Brundage, Y. Goldberg, D. Das, K. Ganchev, K. B. Hall, Petrov, H. Zhang and O. Täckström (2013) Universal dependency annotation for multilingual parsing.. In ACL (2), pp. 92–97. Cited by: 7.
[4] K. Muischnek, K. Müürisep, T. Puolakainen, E. Aedmaa, R. Kirt and D. Särg (2014) Estonian dependency treebank and its annotation scheme. In Proceedings of 13th Workshop on Treebanks and Linguistic Theories (TLT13), pp. 285–291. Cited by: 7.
[5] P. Otte and F. M.Tyers (2011) Rapid rule-based machine translation between dutch and afrikaans. In Proceedings of the 15th conference of the European Association for Machine Translation, 30-31 May 2011, Leuven, Belgium, pp. 153–160. Cited by: 2.
[6] T. A. Pirinen (2015) Omorfi–free and open source morphological lexical database for Finnish. In Nordic Conference of Computational Linguistics NODALIDA 2015, pp. 313. Cited by: 3.
[7] S. Pyysalo, J. Kanerva, A. Missilä, V. Laippala and F. Ginter (2015) Universal dependencies for finnish. In Nordic Conference of Computational Linguistics NODALIDA 2015, pp. 163. Cited by: 7.
[8] A. Ranta (2011) Grammatical framework: programming with multilingual grammars. CSLI Publications, Center for the Study of Language and Information. Cited by: 4.2.
[9] J. Rueter (2010) Adnominal person in the morphological system of erzya. Ph.D. Thesis, Helsingin ylipisto. Cited by: 3.
[10] V. TrÃ³n, A. Kornai, G. Gyepesi, L. NÃ©meth, P. HalÃ¡csy and D. (2005) Hunmorph: open source word analysis. In Proceedings of the Workshop on Software, pp. 77–85. Cited by: 3.
[11] V. Vincze, D. Szauter, A. Almási, G. Móra, Z. Alexin and J. Csirik (2010) Hungarian dependency treebank.. In LREC, Cited by: 7.