Intermediate representation in rule-based machine translation for the Uralic languages\footnotepubrights This work is licensed under a Creative Commons AttributionâNoDerivatives 4.0 International Licence. Licence details: \urlhttp://creativecommons.org/licenses/by-nd/4.0/. Original publication in proceedings of second IWCLUL held in Szeged 2016
Abstract
This paper presents some of the major obstacles and challenges in creating machine translation systems between Uralic languages where the intermediate representation is based on morphology and syntax. The Uralic languages are very alike in many ways: similar case inventories, word order and non-finite clause forms. However current rule-based grammatical resources take many different approaches to encoding this information. These approaches are sometimes based on legacy or traditional grammatical description, important for making the tools comfortable for linguists, but sometimes based on arbitrary and incompatible decisions. This paper presents an overview of some of the issues in working with existing tools and representations and provides some guidelines and suggestions to facilitate future work.
1 Introduction
Creating rule-based machine translation (RBMT) systems is a process where one creates a mapping between units of source language and target language. The units can be different depending on the approach to the problem, i.e., on scale of translating word-forms to word-forms to translating via an intermediate abstract universal language, or an interlingua. In this article we study the approach of using just morphological analysis with the Uralic languages. The problem of such a system is that, even when morphologies of the closely related Uralic languages are expected to match, there are often engineering issues that make the work more tedious and cumbersome than necessary. Minimising the amount of simple engineering work is vital for making rule-based machine attractive to linguists and programmers alike.
The rest of the article is structured as follows: first we describe the backgrounds of the problem in 2, then we introduce the resources we are going to use in 3, we suggest some common best practices in 6, in 7 we briefly describe universal parts-of-speech and morphological features, and finally in 8 we provide some short concluding remarks.
2 Background
RBMT is a popular way of developing high-quality machine translations between related languages [1]. The building of an RBMT system rapidly for related languages is possible, as has been done with, e.g. Dutch and Afrikaans [5]. A wide-coverage machine translation requires wide-coverage lexical resources for the languages. Developing an analyser to a stage where it is usable by multiple applications, including RBMT, can take years, so it is often a good idea to use readily available resources instead of re-writing a new analyser from the scratch. However, the majority of existing analysers are made with language-dependent annotation systems, which unnecessarily complicate the description of machine translation. It should be clear, that if two related languages use the same morphological and syntactic structures to describe a phenomenon, a rule mapping between the two should be entirely trivial. This is not the case when taking most off-the-shelf analysers for contemporary Uralic morphologies. Table 1 shows an example of the morphological annotation of five Uralic languages for a simple five-word sentence.
2.1 Intermediate representations
In machine translation, an intermediate representation is an abstraction away from the surface forms of the language. Figure 1 shows the Vauquois triangle, a common illustration of different levels of intermediate representation.
At the bottom of the triangle, there is no intermediate representation and translation is performed on a word-for-word basis. At the apex of the triangle is interlingual translation, where the source language is first mapped to a language-independent semantic representation, and this representation is then used to generate the target language.
In the middle is (morpho-)syntactic transfer. Here the source language is analysed to a language-dependent intermediate representation (usually based on a combination of syntactic structure and morphosyntactic features) and then transfer rules are applied to convert the source language intermediate representation to one compatible with the target-language generation component.
3 Resources
In this paper we make use of five sets of linguistic data for five different Uralic languages: Finnish, North Sámi, Erzya, Estonian and Hungarian. We take the North Sámi and Erzya data from the Giellatekno language technology repository.11\urlhttp://giellatekno.uit.no The North Sámi data has primarily been developed by the Divvun and Giellatekno groups at UiT Norgga árktalaš universitehta and the Erzya data has been developed by Jack Rueter at Helsingin yliopisto [9]. For the Estonian data, we use the plamk analyser22\urlhttps://github.com/jjpp/plamk written by Jaak Pruulmann-Vengerfeldt, for Finnish, omorfi [6]33\urlhttps://github.com/flammie/omorfi and for Hungarian, hunmorph [10].44\urlhttp://mokk.bme.hu/resources/hunmorph/
4 Strategies
There a different ways to fix systematic mismatches. We evaluate the followings:
4.1 Relabelling
An obvious approach to getting around the problem of divergent tagsets is to simply perform relabelling. This is where you replace the canonical tags in one language with their equivalents in the other language, or with a common equivalent in both languages.
+CC <cnjcoo> +J+Coord |
However, this solution has its disadvantages. Even though +J and +CC both are used for conjuctions, the plamk tag is also used with subordinating and other conjunctions, while the Giellatekno tag excludes those. Relabelling +J+Coord to +CC and any other +J to +CS might work on the analyser, but will not work in a disambiguation rule saying “select the noun reading if the word to the right is tagged +J”, here we need to relabel +J to (+CS or +CC). In the opposite direction, +CS would need to be relabelled to (+J but not +Coord). The distinction between these may be irrelevant for the translation process (in all cases, ja in North Sámi will be translated to ja in Estonian), but for the intervening grammatical tools, it may be vital to make (or not) the distinction.
4.2 Interlingua
Another potential solution is to use a semantic interlingua (see description in section 2.1). This is the approach adopted by the machine translation system based on Grammatical Framework [8].55\urlhttp://grammaticalframework.org In this framework there is no direct transfer of morphological features.
5 Specific linguistic issues
There are a number of linguistic issues in RBMT. We cover the following in detail:
5.1 Copula
There are two main copula constructions in the Uralic languages, the first functions more or less like in the Germanic languages. The copula is a normal verb that agrees with the subject. The second copula construction works like in the Turkic languages. In languages with the Turkic-style copula, it does not typically surface in the third-person singular present tense. In our examples, North Sámi, Finnish and Estonian are of the Germanic type, while Hungarian and Erzya are of the Turkic type.
‘She is a student.’ | ‘She was a student.’ | |
North Sámi | Son lea studeanta. | Son lei studeanta. |
Erzya | Сон ÑÑÑденÑ. | Сон ÑÑÑденÑелÑ. |
Finnish | Hän on opiskelija. | Hän oli opiskelija. |
Estonian | Ta on üliõpilane. | Ta oli üliõpilane. |
Hungarian | Šhallgató. | Šhallgató volt. |
In North Sámi, Finnish and Estonian, the treatment of lea, on is similar. It is a verb which inflects and agrees like other verbs.
There are divergences when we look at the Erzya and Hungarian examples. Although they have the same structure, zero copula in the present tense and surfaced copula in the past tense. The morphological analyser for Erzya treats the copula as a derivation:
ÑÑÑденÑ+N+Sg+Nom+Indef+Der/Pr+V+Ind+Prs+ScSg3
Where in Hungarian it is simply omitted in the present (if it surfaced it would be van), and in the past it is considered a verb form.
5.2 Non-finite verb forms
Non-finite verb forms are infinitives and participles on the on hand and derivations on the another. There are a different number of them between languages and their tasks vary from being syntactic arguments of constructions to derived words, and a wide range of analyses are used to accommodate that. There are some differences in the table 2
Language | Sentence | Non-finite tag |
---|---|---|
‘I see the man who is running’ | ||
North Sámi | Oidnen dievddu viehkame | Actio+Ess |
Erzya | ÐеÑн ÑÑÑанÑÑ, конаÑÑ Ñийни. | Der/Ы+ActPrcShort+A |
Finnish | Näen miehen juoksemassa. | InfMA+Ine |
Estonian | Näen meest, kes jookseb. | — |
Hungarian | Látom a futó embert. | /VERB[IMPERF_PART]/ADJ |
‘While running I saw the man’ | ||
North Sámi | Oidnen dievddu viegadettiinan. | Ger+Px1Sg |
Erzya | ÐеÑн ÑийниÑÑ ÑÑÑанÑÑ. | Der/ЫÑÑ+ActDemPrc+A |
Finnish | Näin miehen juostessani. | InfE+Ine+PxSg1 |
Estonian | Jooksmise ajal nägin ma meest. | Der/mine+Gen |
Hungarian | Futás közben láttam az embert. | /VERB[GERUND]/NOUN |
‘I see the running man.’ | ||
North Sámi | Oainnán viehkki dievddu. | PrsPrc |
Erzya | ЧийнемаÑÑ ÑÐµÐ´ÐµÐ½Ñ ÐºÐµÑÑвÑÑ. | Der/ÐмÐ+Nom |
Finnish | Näen juoksevan miehen. | PrsPrc |
Estonian | Näen jooksvat meest. | Der/v+A+Nom |
Hungarian | Látom a futó embert. | /VERB[IMPERF_PART]/ADJ |
‘Running is fun.’ | ||
North Sámi | Viehkan lea suohtas. | Actio+Nom |
Erzya | ÐелезÑÐ½Ñ ÑÑкÑÐ½Ñ ÑийнемаÑÑ. | Der/ÐмÐ+Nom |
Finnish | Juokseminen on kivaa. | Der/minen+Nom |
Estonian | Jooksmine on lahe. | Der/mine+Nom |
Hungarian | A futás jó dolog. | /VERB[GERUND]/NOUN |
‘I like running.’ | ||
North Sámi | Liikon viehkat. | Inf |
Erzya | ЧийнемÑÑÑ Ð½ÐµÐ¸Ñ ÑÑÑанÑÑ. | Inf+Ela |
Finnish | Pidän juoksemisesta. | Der/minen+Ela |
Estonian | Mulle meeldib joosta. | Inf |
Hungarian | Szeretem futni. | /VERB¡INF¿ |
5.3 Derivation, compounding and lexicalisation
A classical problem in computational morphologies lies in question of lexicalisation and productivity of certain processes; is a morphologically created word-form a new word or a form of a, possibly distant root. Morphologies take widely different and opposing approaches to this ranging from lexicalise-everything to collect-everything. See examples below:
|
‘to drink’ | ‘a drink’ | ‘drinker’ | ‘brewery’ |
---|---|---|---|---|
North Sámi | juhkat | juhkamuÅ¡ | — | vuolla·buvttadeaddji |
Erzya | ÑÐ¸Ð¼ÐµÐ¼Ñ | Ñимема-Ð¿ÐµÐ»Ñ | ÑимиÑÑ | пиÑÐ½Ñ Ð·Ð°Ð²Ð¾Ð´ |
Finnish | juoda | juo-ma | juo—ja | olut·tehdas |
Estonian | jooma | joo—gi | joo— | õlle·tehas |
Hungarian | iszik | ital | iv—ó | sör·fÅzde |
The symbols ‘·’, ‘-’ and ‘—’ stand for compounding, inflection and derivation, respectively.
5.4 Pronouns and determiners
The distinction between pronoun and determiner is not widely made in traditional grammars of most Uralic languages. Words which may be considered both pronouns and determiners are lumped into a single morphosyntactic class (usually pronoun). Consider the following examples involving the word ‘this’
‘I see this house.’ | ‘I see this.’ | |
North Sámi | Oainnán dán viesu. | Oainnán dán. |
---|---|---|
Erzya | ÐеÑн Ñе_det кÑдонÑÑ. | ÐеÑн ÑенÑ_pron. |
Finnish | Mä näen tämän_pron talon. | Mä näen tämän_pron. |
Estonian | Ma näen selle_pron maja. | Ma näen selle_pron. |
Hungarian | Nézem ezt_det/noun a_art házat. | Nézem azt_det/noun |
In traditional grammars of North Sámi, Finnish and Estonian both the pronominal and the modifier analyses of ‘this’ are classified as pronouns. In Hungarian and Erzya, a distinction is made, with Hungarian making a pronoun/determiner distinction and Erzya making a distinction between quantifier (determiner) and nominalised quantifier.
If we consider a standard definition of pronoun to be ‘that which stands in place (pro-) of a noun phrase (-noun)’ then we can see that in the above, only the tools for Erzya follow this. The other languages leave the distinction to tools later in the pipeline.
5.5 Non-inflecting words
All languages in the Uralic family have a wide variety of non-inflecting word forms. Depending on the grammatical tradition followed by the language resource these may be simply lumped into a single class, or they may have extensive syntactic or semantic subcategorisation. Table 3 gives a number of examples of non-inflecting words and the equivalent morphological analyses they receive in each of the languages we are studying. To a machine translation practitioner, these distinctions are largely superfluous, ja in North Sámi will be translated as ja in Finnish and ja in Estonian. However, the distinctions may be vital for the intervening disambiguation tools, and as such need to be taken into account.
North Sámi | Erzya | Finnish | Estonian | Hungarian | |
and | ja+CC | маÑÑо+Po+COM | ja Part | ja+J | és /CONJ |
---|---|---|---|---|---|
very | hui+Adv | пек+Adv+AdA | tosi Part | väga+Adv | nagyon /ADV |
under | vuolde+Po | алов+Po+Lat | alle Part | alla+K | alatt /POSTP |
now | dál+Adv | ней+Adv+Temp | nyt Part | praegu+Adv | most /ADV |
hello | bures+Interj | ÑÑмбÑаÑи+Interj+Formulaic | moi Part | tere+I | szia /UTT-INT |
6 Guidelines
6.1 Separation of lexicon and morphotactics
One of the main components of any rule-based system for morphologically-complex languages is a lexicon consisting of stems and inflectional/derivation categories. In some cases, such as for Finnish, these are partly provided by a state institution, such as a language board. In other cases they are the product of many years of work.
Although categorising stems for inclusion in a morphological lexicon (many contain over 100,000 entries) can take a substantial amount of work, even if done semi-automatically, implementing the morphotactics (that is, the rules covering inflection, derivation and compounding) may take substantially less time.
6.2 Maximise parallelism
In line with the Universal Dependencies project (see 7), we propose the adoption of a principle of maximum parallelism. In short “things that are the same should be tagged the same”. We do not propose that this should mean that all distinctions should be made in all languages. For example, those Uralic languages without object conjugation should not be required to adopt the agreement tags of those that have it. But it should be possible to come up with principled and consistent guidelines for closed categories.
7 Universal dependencies
Universal dependencies is a large multi-language project [3] aiming at common tagset for part-of-speech, morphosyntactic features and dependency relations. We do not propose adopting the exact tagset of the universal dependency project. Most projects working on Uralic languages have been ongoing for many years and the tools that they create are used for more than just machine translation. What we find more important is to adopt, or make available tools based on a consistent theoretical background and consistent morphosyntactic description. This could form the basis of a kind of universal morphosyntactic interlingua for the Uralic languages. These tools do not have to replace the current tools, and may be automatically generated from them, but they must be consistent. A systematic mapping needs to be considered while developing. The national Uralic languages have specifications for universal dependencies [7, 4, 11]. But these specifications differ in unnecessary ways. For example, consider the annotation of ‘that house’ in the two treebanks for Finnish: Turku Dependency Treebank (TDT) and FinnTreeBank (FTB); and Hungarian:
this | house | ||
Finnish (TDT) | tämä_PRON | talo_NOUN | |
Finnish (FTB) | tämä_DET | talo_NOUN | |
Hungarian | az_PRON | a_ART | ház_NOUN |
8 Concluding remarks
Rule-based machine translation provides a fascinating basis for exploring real linguistic differences between the Uralic languages. However, as we have shown, in current state-of-the-art tools, real linguistic differences are hidden behind a combination of incompatible tagsets and idiosyncratic traditional grammatical norms. We do not propose that the North Sámi adopt the Finnish norms or the Hungarians the Erzya norms, instead we propose developing a common morphological annotation scheme for the Uralic languages based on guidelines of the Universal dependencies project. It is not our aim for this to supercede national standards, but provide a common bridge between them to facilitate the cross-linguistic study and functional rule-based machine translation.
Acknowledgements
Heiki-Jaan Kaalep, Jack Rueter, László Tihany as well as the anonymous reviewers have all contributed to the language examples, the remaining mistakes are ours.
Appendix A Example of Universal dependencies for Uralic languages
Example is shown in table 4.
References
- [1] (2010) Apertium: a free/open-source platform for rule-based machine translation platform. Machine Translation. Cited by: 2.
- [2] (2010) Apertium: a free/open-source platform for rule-based machine translation platform. Machine Translation. Cited by: 1.
- [3] (2013) Universal dependency annotation for multilingual parsing.. In ACL (2), pp. 92–97. Cited by: 7.
- [4] (2014) Estonian dependency treebank and its annotation scheme. In Proceedings of 13th Workshop on Treebanks and Linguistic Theories (TLT13), pp. 285–291. Cited by: 7.
- [5] (2011) Rapid rule-based machine translation between dutch and afrikaans. In Proceedings of the 15th conference of the European Association for Machine Translation, 30-31 May 2011, Leuven, Belgium, pp. 153–160. Cited by: 2.
- [6] (2015) Omorfi–free and open source morphological lexical database for Finnish. In Nordic Conference of Computational Linguistics NODALIDA 2015, pp. 313. Cited by: 3.
- [7] (2015) Universal dependencies for finnish. In Nordic Conference of Computational Linguistics NODALIDA 2015, pp. 163. Cited by: 7.
- [8] (2011) Grammatical framework: programming with multilingual grammars. CSLI Publications, Center for the Study of Language and Information. Cited by: 4.2.
- [9] (2010) Adnominal person in the morphological system of erzya. Ph.D. Thesis, Helsingin ylipisto. Cited by: 3.
- [10] (2005) Hunmorph: open source word analysis. In Proceedings of the Workshop on Software, pp. 77–85. Cited by: 3.
- [11] (2010) Hungarian dependency treebank.. In LREC, Cited by: 7.