Machine translation between related languages by Serge Sharoff
The general principle of Statistical MT is based on learning the probabilities of translations of individual words and constructions in aligned corpora. The traditional approach makes no distinction concerning the linguistic ‘distances’ between the languages: the English-Chinese pair is built using the same methods as the Spanish-Portuguese one. However, the latter pair is likely to benefit from the similarities in words and constructions between these two languages since they have a common Latin origin. This issue is especially relevant to the European translation infrastructure, given that a large number of the EU languages are related to each other. In our presentation, we will discuss (1) the technical and linguistic issues involved in building Statistical MT for related languages, as well as (2) the issues concerning evaluation of MT output quality for a large number of language pairs in indirect translation. The first task involves automatic detection of cognate words, i.e., words having similarities in their spelling and meaning in two languages, e.g., maladie (French for ‘disease’) versus malattia (its Italian equivalent). Such lists can be generated from large monolingual resources in order to improve the out-of-vocabulary coverage beyond the lexicon available from (smaller) aligned corpora. In addition to their orthographic similarity, the process of adding them to the translation lexicon of an SMT engine also needs to take into account the link between the linguistic forms with their functions, e.g., whether they are both singular or plural, past indicative or conditional mood, etc. The second task involves dealing with a large amount of variation possible in indirect translation via a pivot language. For example, an acceptable MT output from Spanish into Portuguese is not necessarily close to a human translation produced directly from English into Portuguese, because the translator of this sentence from English into Spanish could have used lexical choices different from those in the English-Portuguese direction. We approach this problem via customised Quality Estimation (QE) models, while paying attention to the fact that the number of language pairs for existing QE models is quite small and this does not include such pairs as Spanish-Portuguese. Our solution again takes into account the similarities between the languages, so that existing machine learning models are transferred using domain adaptation techniques, such as Self-Taught Learning. The in-domain in this case is represented through the feature space of a language with available QE resources (e.g., Spanish) transferred to the out- of-domain feature space of a related language with fewer resources (e.g., Portuguese).
Default Presenter
10/24/2016 2:02:59 PM
View

Mediasite Showcase
Mediasite's the trusted cornerstone of any campus or enterprise video strategy. Our unyielding commitment to all things video helps you transform education, training, communications and online events.
Webcasting Video Content Management Video Delivery Integration Services Mediasite Community
Powered By Mediasite - Enterprise Video Platform
Mediasite
Sonic Foundry