[New publication] Multiword Units in Machine Translation and Translation Technology

Multiword Units in Machine Translation and Translation Technology

Editors Ruslan Mitkov, Johanna Monti, Gloria Corpas Pastor and Violeta Seretan

About this book:

The correct interpretation of Multiword Units (MWUs) is crucial to many applications in Natural Language Processing but is a challenging and complex task. In recent years, the computational treatment of MWUs has received considerable attention but there is much more to be done before we can claim that NLP and Machine Translation (MT) systems process MWUs successfully.

This volume provides a general overview of the field with particular reference to Machine Translation and Translation Technology and focuses on languages such as English, Basque, French, Romanian, German, Dutch and Croatian, among others. The chapters of the volume illustrate a variety of topics that address this challenge, such as the use of rule-based approaches, compound splitting techniques, MWU identification methodologies in multilingual applications, and MWU alignment issues.

Abstract of Chapters:

Analysing linguistic information about word combinations for a Spanish-Basque rule-based machine translation system

Uxoa Iñurrieta | IXA NLP group, University of the Basque Country

Itziar Aduriz | Department of Linguistics, University of Barcelona

Arantza Díaz de Ilarraza | IXA NLP group, University of the Basque Country

Gorka Labaka | IXA NLP group, University of the Basque Country

Kepa Sarasola | IXA NLP group, University of the Basque Country

This paper describes an in-depth analysis of noun + verb combinations in Spanish-Basque translations. Firstly, we examined noun + verb constructions in the dictionary, and confirmed that this kind of MWU varies considerably from language to language, which justifies the need for their specific treatment in MT systems. Then, we searched for those combinations in a parallel corpus, and we selected the most frequently-occurring ones to analyse them further and classify them according to their level of syntactic fixedness and semantic compositionality. We tested whether adding linguistic data relevant to MWUs improved the detection of Spanish combinations, and we found that, indeed, the number of MWUs identified increased by 30.30% with a precision of 97.61%. Finally, we also evaluated how an RBMT system translated the MWUs we analysed, and concluded that at least 44.44% needed to be corrected or improved.

Keywords: Basque, Multiword Units, Spanish, Rule-Based Machine Translation, semantic compositionality, morphosyntactic fixedness. https://doi.org/10.1075/cilt.341.02inu

How do students cope with machine translation output of multiword units? An exploratory study

Joke Daems | Ghent University, Department of Translation, Interpreting and Communication

Michael Carl | Renmin University of China, Beijing & Copenhagen Business School, Department of Management, Society and Communication

Sonia Vandepitte | Ghent University, Department of Translation, Interpreting and Communication

Robert J. Hartsuiker | Ghent University, Department of Experimental Psychology

Lieve Macken | Ghent University, Department of Translation, Interpreting and Communication

In this chapter, we take a closer look at students’ post-editing of multiword units (MWUs) from English into Dutch. The data consists of newspaper articles post-edited by translation students as collected by means of advanced keystroke logging tools.

We discuss the quality of the machine translation (MT) output for various types of MWUs, and compare this with the final post-edited quality. In addition, we examine the external resources consulted for each type of MWU. Results indicate that contrastive MWUs are harder to translate for the MT system, and harder to correct by the student post-editors than non-contrastive MWUs. We further find that consulting a variety of external resources helps student post-editors solve MT problems.

Keywords: post-editing, search strategies, machine translation, translation process, translation quality, external resources, multiword units. https://doi.org/10.1075/cilt.341.03dae

Aligning verb + noun collocations to improve a French-Romanian FSMT system

Amalia Todiraşcu | FDT (Fonctionnements Discursifs et Traduction), LiLPa (Linguistique, Langues, Parole), Université de Strasbourg

Mirabela Navlea | FDT (Fonctionnements Discursifs et Traduction), LiLPa (Linguistique, Langues, Parole), Université de Strasbourg

We present several Verb + Noun collocation integration methods using linguistic information, aiming to improve the results of a French-Romanian factored statistical machine translation system (FSMT). The system uses lemmatised, tagged and sentence-aligned legal parallel corpora. Verb + Noun collocations are frequent word associations, sometimes discontinuous, related by syntactic links and with non-compositional sense (Gledhill, 2007). Our first strategy extracts collocations from monolingual corpora, using a hybrid method which combines morphosyntactic properties and frequency criteria. The second method applies a bilingual collocation dictionary to identify collocations. Both methods transform collocations into single tokens before alignment. The third method applies a specific alignment algorithm for collocations. We evaluate the influence of these collocation alignment methods on the results of the lexical alignment and of the FSMT system.

Keywords: hybrid collocation identification, lexical alignment, MWE, FSMT, MWE-aware MT systems, collocation dictionary. https://doi.org/10.1075/cilt.341.04tod

Multiword expressions in multilingual information extraction

Gregor Thurmair | Linguatec

Multilingual Information Extraction requires significant Multiword Expressions (MWE) processing as many such items are multiwords. The lexical representation of MWEs supports large bilingual lexicons (for Persian, Pashto, Turkish, Arabic); multiwords are represented like single words, extended by two annotations: MWE head, and lemma plus part of speech for the MWE parts. In text analysis, MWEs are recognised as part of the parsing process, mot as pre- or post-processing components. The analysis design extends the X-bar scheme by a level for multiword rules. In transfer, MWEs are translated as elementary nodes like single word lemmata, to present key concepts for relevance judgement in Information Extraction. Evaluation shows that 90% of the MWE patterns in the lexicon can be analysed with about 150 MWE-specific rules, and that more than 90% of text document tokens are covered by the proposed integrated single and multiword processing.

Keywords: Multilingual Information Extraction, Machine Translation, Multiword expressions, Multilingual Indexing, Persian, Pashto, morphological analyser, lexical analysis, Arabic, lexical representation. https://doi.org/10.1075/cilt.341.05thu

A multilingual gold standard for translation spotting of German compounds and their corresponding multiword units in English, French, Italian and Spanish

Simon Clematide | Institute of Computational Linguistics, University of Zurich

Stéphanie Lehner | Institute of Computational Linguistics, University of Zurich

Johannes Graën | Institute of Computational Linguistics, University of Zurich

Martin Volk | Institute of Computational Linguistics, University of Zurich

This article describes a new word alignment gold standard for German nominal compounds and their multiword translation equivalents in English, French, Italian, and Spanish. The gold standard contains alignments for each of the ten language pairs, resulting in a total of 8,229 bidirectional alignments. It covers 362 occurrences of 137 different German compounds randomly selected from the corpus of European Parliament plenary sessions, sampled according to the criteria of frequency and morphological complexity. The standard serves for the evaluation and optimisation of automatic word alignments in the context of spotting translations of German compounds. The study also shows that in this text genre, around 80% of German noun types are morphological compounds indicating potential multiword units in their parallel equivalents.

Keywords: gold standard, word alignment, German, English, compounding, multilinguality, Spanish, Italian, French. https://doi.org/10.1075/cilt.341.06cle

Dutch compound splitting for bilingual terminology extraction

Lieve Macken | Ghent University, Department of Translation, Interpreting and Communication

Arda Tezcan | Ghent University, Department of Translation, Interpreting and Communication

As compounds pose a problem for applications that rely on precise word alignments, we developed a state-of-the-art compound splitter for Dutch that makes use of corpus frequency information and linguistic knowledge. Domain-adaptation techniques are used to combine large out-of-domain and dynamically compiled in-domain frequency lists. As compounds are not always translated compositionally, we developed a novel methodology for word alignment. We train the word alignment models twice: a first time on the original data set and a second time on the data set in which the compounds are split into their component parts. Experiments show that the compound splitter combined with the novel word alignment technique considerably improves bilingual terminology extraction results.

Keywords: compound splitting, bilingual terminology extraction, word alignment, Dutch, multiword units, translation. https://doi.org/10.1075/cilt.341.07mac

A flexible framework for collocation retrieval and translation from parallel and comparable corpora

Oscar Mendoza Rivera | Research Group in Computational Linguistics, University of Wolverhampton

Ruslan Mitkov | Research Group in Computational Linguistics, University of Wolverhampton

Gloria Corpas Pastor | Research Group in Computational Linguistics, University of Wolverhampton

This paper outlines a methodology and a system for collocation retrieval and translation from parallel and comparable corpora, developed with translators and language learners in mind. It is based on a phraseology framework, applies statistical techniques, and employs source tools and online resources. The collocation retrieval and translation has proved successful for English and Spanish and can be easily adapted to other languages. The evaluation results are promising and future goals are proposed. Furthermore, conclusions are drawn on the nature of comparable corpora and how they can be better exploited to suit particular needs of target users.

Keywords: parallel corpora, comparable corpora, collocation retrieval, collocation translation, phraseology. https://doi.org/10.1075/cilt.341.08riv

On identification of bilingual lexical bundles for translation purposes – The case of an English-Polish comparable corpus of patient information leaflets

Łukasz Grabowski | University of Opole (Poland)

Grounded in phraseology and corpus linguistics, this paper aims to explore the use of bilingual lexical bundles to improve the degree of naturalness and textual fit of translated texts. More specifically, this study attempts to identify lexical bundles, that is, recurrent sequences of 3–7 words with similar discursive functions in a purpose-designed comparable corpus of English and Polish patient information leaflets, with 100 text samples in each language. Because of cross-linguistic differences, we additionally apply a number of formal criteria in order to filter out the bundles in each subcorpus. The results show that bilingual lexical bundles with overlapping discourse functions in texts and extracted from comparable corpora hold unexplored potential for machine translation, computer-assisted translation and bilingual lexicography.

Keywords: lexical bundles, translation universals, patient information leaflets, comparable corpora, translation quality. https://doi.org/10.1075/cilt.341.09gra

The quest for croatian idioms as multiword units

Kristina Kocijan | Department of Information and Communication Sciences, Faculty of Humanities and Social Sciences, University of Zagreb

Sara Librenjak | Department of Information and Communication Sciences, Faculty of Humanities and Social Sciences, University of Zagreb

Idiomatic expressions are types of MWUs in which the meaning of the unit does not equal the cummulative meaning of its parts. They are culturally dependent, so the translation cannot be inferred from the expression itself. Croatian language has a very rich idiomatic structure. A few such expressions can be understood in direct translation but most are different from the literal translations. As the idioms are rooted in the tradition of the language and society from which they hail, they need special treatment in computational linguistics. Using NooJ as an NLP tool, we describe different types of Croatian idioms that will help us recognize them in texts. Idioms recognition should be given special treatment, being the major task in translation.

Keywords: idiomatic expressions, translation, Multiword units, Croatian, NooJ. https://doi.org/10.1075/cilt.341.10koc

Corpus analysis of croatian constructions with the verb doći ‘to come’

Goranka Blagus Bartolec | Institut za hrvatski jezik i jezikoslovlje, Zagreb

Ivana Matas Ivanković | Institut za hrvatski jezik i jezikoslovlje, Zagreb

This paper presents a corpus-based analysis of constructions consisting of the verb doći ‘to come’ followed by the prepositions do ‘to’ or na ‘onto’ and a noun. Croatian lexicography mainly describes the verb doći as an intransitive verb of motion. Other uses of this verb are listed in the idioms section. This paper will address the association between the verb-preposition constructions and nouns that follow them. The paper will take frequency data into consideration and attempt to distinguish between primary and additional meanings of constructions doći do ‘to come to’ and doći na ‘to come onto’. The research will be based on hrWaC 2.0 (Croatian Web Corpus, Version 2), the Croatian National Corpus and the Croatian Language Repository corpora.

Keywords: Croatian language, verb doci, preposition, corpus-based analysis, construction, noun. https://doi.org/10.1075/cilt.341.11bar

Anaphora resolution, collocations and translation

Eric Wehrli | LATL-CUI, University of Geneva

Luka Nerima | LATL-CUI, University of Geneva

Collocation identification and anaphora resolution are widely recognised as major issues for natural language processing, and particularly for machine translation. This paper focuses on their intersection domain, that is verb-object collocations in which the object has been pronominalised. To handle such cases, an anaphora resolution procedure must link the direct object pronoun to its antecedent. The identification of a collocation can then be made on the basis of the verb and its object or its antecedent. Results obtained from the translation of a large corpus will be discussed, as well as an evaluation of the precision of the anaphora resolution procedure for this specific task.

Keywords: translation, pronominalized collocations, syntax based translation, Collocations, anaphora resolution, syntax based collocation detection. https://doi.org/10.1075/cilt.341.12weh
About the editors:

Ruslan Mitkov: I completed my university degree at Humboldt University, Berlin, Germany where I studied from 1974 to 1979. I was privileged to have as my doctoral supervisor one of the pioneers of the German Computer Science, Prof Dr Nikolaus Joachim Lehmann at the Technical University of Dresden where I received my PhD on 9 January 1987, Germany. I joined the University of Wolverhampton on 2 October 1995. Since 1997, I am Professor of Computational Linguistics and Language Engineering and the same year I founded the Research Group in Computational Linguistics which I have led ever since. I am Director of the Research Institute for Information and Language Processing which consists of two internationally renowned research groups – Statistical Cybermetrics Research Group led by Prof. Mike Thelwall and the Research Group in Computational Linguistics. Prior to coming to Wolverhampton, I was Research Professor at the Institute of Mathematics, Bulgarian Academy of Science. I was also Fellow of the Alexander von Humboldt Foundation in 1993, and 1994, at the Saarland University and University of Hamburg. I was awarded Doctor Honoris Causa from Plovdiv University in 2011 and Professor Honoris Causa from Veliko Tarnovo University in 2014. [https://www.wlv.ac.uk/research/institutes-and-centres/riilp—research-institute-in-information-and-lan/research-group-of-computational-linguistics/staff-at-rgcl/professor-mitkov/]

Johanna Monti is Associate Professor of Modern Languages Teaching at the “L’Orientale”University of Naples. She was the Computational Linguistics Research manager of the Thamus Consortium (Italy). She received her PhD in Computational Linguistics at the University of Salerno, Italy. Her research activities are in the field of hybrid approaches to Machine Translation and NLP applications. (https://www.researchgate.net/profile/Johanna_Monti)

Gloria Corpas Pastor is Professor of Translation and Interpreting of the University of Malaga (Spain), Visiting Professor in Translation Technologies at the Research Institute of Information and Language Processing (University of Wolverhampton) and Head of the Research Group in Translation and Lexicography (Lexytrad). She has been actively involved in the development of the EN 15038:2006 as an AEN/CTN 174 and CEN/BTTF 138 Spanish delegate. She is a Spanish expert for the future ISO Standard (ISO TC37/SC2-WG6 “Translation and Interpreting”). She acts as a Ministry advisor on the Bologna Process via the Spanish Agency ANECA. Prof. Corpas’s research fields range from technical translation & ICTs to phraseology (member of the Advisory Council of Europhras), corpus linguistics and corpus-based translation (2007 Translation Technologies Research Award), and lexicography (1995 Euralex Verbatim Award). She has published a wide collection of papers in both national and international journals, and is author and editor of several books. Her Manual de fraseología española (Madrid, Gredos) has been a reference in the field since 1996. [http://www.europhras2015.eu/shortbiogloria]

Violeta Seretan is a senior researcher at the Department of Translation Technology (referred to by its French acronym TIM) of the University of Geneva. Her current research is in the area of machine translation, with a specific focus on evaluating pre-editing and post-editing strategies for improving the translation of user-generated content. She co-directs the ACCEPT European project, the goal of which is to bring machine translation closer to user communities. She is more generally interested in computational linguistics, information extraction, computational lexicography and translation aids. Previously, she conducted research on text simplification and text summarization at the Institute for Language, Cognition and Computation of the University of Edinburgh, thanks to a Swiss National Science Foundation grant for advanced researchers. Before that, she worked on rule-based machine translation and multilingual parsing as a senior researcher at the Language Technology Laboratory of the University of Geneva. She received her PhD in Computational Linguistics from the University of Geneva in 2008. Her doctoral research concerned the interrelation between phraseology, parsing and translation, a topic on which she presented and published extensively. Her doctoral work has been awarded the University of Geneva Latsis Prize in 2010. [https://www.unige.ch/fti/en/faculte/departements/dtim/membrestim/seretan/]