Machine translation between related languages by Serge Sharoff
The general principle of Statistical MT is based on learning the
probabilities of translations of individual words and constructions in
aligned corpora. The traditional approach makes no distinction
concerning the linguistic ‘distances’ between the languages: the
English-Chinese pair is built using the same methods as the
Spanish-Portuguese one. However, the latter pair is likely to benefit
from the similarities in words and constructions between these two
languages since they have a common Latin origin.
This issue is especially relevant to the European translation
infrastructure, given that a large number of the EU languages are
related to each other. In our presentation, we will discuss (1) the
technical and linguistic issues involved in building Statistical MT
for related languages, as well as (2) the issues concerning evaluation
of MT output quality for a large number of language pairs in indirect
translation.
The first task involves automatic detection of cognate words, i.e.,
words having similarities in their spelling and meaning in two
languages, e.g., maladie (French for ‘disease’) versus malattia (its
Italian equivalent). Such lists can be generated from large
monolingual resources in order to improve the out-of-vocabulary
coverage beyond the lexicon available from (smaller) aligned
corpora. In addition to their orthographic similarity, the process of
adding them to the translation lexicon of an SMT engine also needs to
take into account the link between the linguistic forms with their
functions, e.g., whether they are both singular or plural, past
indicative or conditional mood, etc.
The second task involves dealing with a large amount of variation
possible in indirect translation via a pivot language. For example, an
acceptable MT output from Spanish into Portuguese is not necessarily
close to a human translation produced directly from English into
Portuguese, because the translator of this sentence from English into
Spanish could have used lexical choices different from those in the
English-Portuguese direction. We approach this problem via customised
Quality Estimation (QE) models, while paying attention to the fact
that the number of language pairs for existing QE models is quite
small and this does not include such pairs as Spanish-Portuguese. Our
solution again takes into account the similarities between the
languages, so that existing machine learning models are transferred
using domain adaptation techniques, such as Self-Taught Learning. The
in-domain in this case is represented through the feature space of a
language with available QE resources (e.g., Spanish) transferred to
the out- of-domain feature space of a related language with fewer
resources (e.g., Portuguese).
Default Presenter
10/24/2016 2:02:59 PM
Mediasite Showcase
Mediasite's the trusted cornerstone of any campus or enterprise video strategy. Our unyielding commitment to all things video helps you transform education, training, communications and online events.
Webcasting Video Content Management Video Delivery Integration Services Mediasite Community
Powered By Mediasite - Enterprise Video Platform