MODULE 5
MODULE 5
E5
Machine Translation:
Translating in Low-Resource Situations, MT Evaluation, Bias and Ethical Issues.
Translating in Low-Resource Situations
Machine Translation (MT) relies heavily on large, high-quality parallel corpora—collections of
sentence pairs in two languages. However, many languages lack such resources, making them "low-
resource languages" in the context of MT.
1. Limited Parallel Corpora: Many languages, especially those spoken in low-income areas or less
widely used languages, do not have extensive parallel corpora—collections of text translated into
multiple languages—available for training MT systems. This scarcity makes it difficult to build
effective translation models because machine learning approaches generally require large datasets
to perform well.
2. Data Sparsity: Even high-resource languages can face challenges when translating into low-
resource domains, where very little data may be available. For instance, a particular genre or field
may not have a substantial amount of text to train on, leading to similar data sparsity issues.
3. Quality of Available Data: Quality concerns may arise not just from the quantity of data but also
from the nature of the data available. Many parallel corpora can contain incorrect translations,
boilerplate phrases, or repetitive sentences, especially if not enough native speakers were involved
in the content creation or quality checks.
Strategies for Addressing Low-Resource Situations
To tackle these issues, the PDF introduces two primary approaches commonly employed in low-resource
MT contexts:
1. Backtranslation:
Mechanism:
• Start with a small parallel corpus (bitext) between the source and target languages.
• Train a target-to-source MT model using this bitext.
• Use the trained model to translate monolingual data available in the target language back to the
source language. This creates a synthetic bitext where natural sentences in the target language are
aligned to sentences generated by the MT model.
• This additional synthetic data is then combined with the existing parallel corpus and used to retrain
the original source-to-target MT model.
- Example: If there's a small bitext for translating Navajo to English, but there are plenty of English
sentences available, one could create a target-to-source model that translates these English sentences
2. Multilingual Models:
- Using multilingual models can help MT systems become more robust in low-resource settings. These
models can learn from multiple languages simultaneously, allowing for shared learning across related
languages, which can mitigate the issues of low data availability for any single language.
- By leveraging knowledge from high-resource languages, multilingual models can provide better
translation capabilities even for the low-resource languages they include.
Sociotechnical Issues
In the context of Translating in Low-Resource Situations, sociotechnical concerns highlight how
human, cultural, and organizational factors influence the quality and equity of MT systems.
2. Evaluation Methods
A. Human Evaluation
Human evaluation is considered the gold standard for MT assessment due to its higher accuracy
compared to automatic methods. Human raters evaluate translations based on fluency and adequacy,
typically using a scoring scale (e.g., 1 to 5 or 1 to 100) to rate various aspects:
- Fluency Rater Scale: Raters may score how intelligible, clear, readable, or natural the output is, using
a numerical scale where low scores denote poor quality and high scores denote high quality.
- Adequacy Rater Scale: Bilingual raters may be given both the source sentence and the proposed
target translation to score how much information from the source is preserved in the target translation.
Ranking Method: Alternatively, raters might be asked to rank candidate translations to determine
preferences between two or more outputs.
B. Statistical Methodology for Human Evaluation
Training human raters is crucial, as those without translation expertise may struggle to distinguish
between fluency and adequacy. A common practice includes:
- Removing outlier raters whose scores vary significantly from the group.
- Normalizing scores to ensure consistency across evaluations. Specifically, this involves subtracting the
mean from each rater's scores and dividing by the variance to standardize evaluations.
- chrF focuses on character n-gram overlap, solving some limitations of BLEU by allowing partial
matches and addressing morphological complexities.
BERTScore is a machine translation evaluation metric that leverages pre-trained contextual embeddings
from BERT (or similar transformer models) to compare the semantic similarity between the candidate
and reference translations. Unlike BLEU and chrF which rely on exact n-gram matches, BERTScore
captures semantic meaning.
Cosine Similarity
4. Statistical Significance Testing
When comparing the performance of two MT systems (e.g., system A and system B), it's vital to
determine if observed differences in their scores are statistically significant.
• A paired bootstrap test can be applied to assess whether the difference in scores is statistically
significant. This involves:
• Creating thousands of pseudo-test sets by randomly sampling with replacement from the original
test set.
• Computing the metric scores (e.g., BLEU, chrF, BERTScore) for each pseudo-test and
determining how frequently one system scores higher than the other.
• This test helps evaluate whether the difference in metrics reflects a true performance improvement or
5. Limitations of Automatic Metrics
just random variation in the test set.
• Sensitivity to Word Tokenization:
• BLEU's performance may deteriorate based on how words are tokenized, especially in
morphologically rich
languages.
• Local Evaluation:
• chrF and BLEU focus on n-gram overlap and are sensitive to local word or character
sequences but fail to capture sentence-level semantics, coherence, or global logical flow.
• Embedding Dependency (BERTScore):
• BERTScore addresses semantic similarity better by using contextual embeddings, but:
• It is computationally more expensive.
• Its results depend on the pretrained model (e.g., BERT-base vs. RoBERTa-large).
• It may overestimate performance when both systems make similar semantic errors.
• Lack of Interpretability:
• Unlike BLEU and chrF where n-gram matches can be directly observed, BERTScore relies on
Bias in Machine Translation
1. Gender Bias
• Machine Translation (MT) systems can perpetuate gender biases present in the training data.
• Example:
• When translating from gender-neutral languages (e.g., Hungarian) into English, MT often
assigns gendered pronouns based on stereotypes.
• A gender-neutral subject like "ő" (Hungarian) may be translated into “he” or “she” depending
on the profession mentioned in the sentence.
• For instance:
• "ő egy ápoló" → "She is a nurse"
• "ő egy vezérigazgató" → "He is a CEO"
• These outputs reflect societal stereotypes and bias in profession-gender associations.
• The issue becomes significant in domains like job applications or media translations, where the
wrong gender assumption can propagate discriminatory ideas.
2. Cultural Stereotypes
• MT systems may reflect and amplify cultural stereotypes because:
• The training data often over-represents dominant cultures.
• Low-diversity datasets fail to capture minority voices or culturally nuanced expressions.
• Bias may appear in:
• Translation of religious, ethnic, or geopolitical content.
• Interpretation of idioms, metaphors, or emotionally charged expressions in a way that
misrepresents the source culture.
Ethical Issues in Machine Translation