0% found this document useful (0 votes)

10 views

MODULE 5

The document discusses the challenges and strategies in machine translation (MT) for low-resource languages, highlighting issues such as limited parallel corpora and data quality. It outlines approaches like backtranslation and multilingual models to improve MT performance, while also addressing sociotechnical concerns, evaluation methods, and biases in MT systems. Ethical issues related to representation, quality of training data, and resource allocation are emphasized, particularly the impact on underrepresented languages.

Uploaded by

amthevibes

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views

MODULE 5

Uploaded by

amthevibes

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 17

MODUL

E5
Machine Translation:
Translating in Low-Resource Situations, MT Evaluation, Bias and Ethical Issues.
Translating in Low-Resource Situations
Machine Translation (MT) relies heavily on large, high-quality parallel corpora—collections of
sentence pairs in two languages. However, many languages lack such resources, making them "low-
resource languages" in the context of MT.

Challenges in Low-Resource Machine Translation

1. Limited Parallel Corpora: Many languages, especially those spoken in low-income areas or less
widely used languages, do not have extensive parallel corpora—collections of text translated into
multiple languages—available for training MT systems. This scarcity makes it difficult to build
effective translation models because machine learning approaches generally require large datasets
to perform well.

2. Data Sparsity: Even high-resource languages can face challenges when translating into low-
resource domains, where very little data may be available. For instance, a particular genre or field
may not have a substantial amount of text to train on, leading to similar data sparsity issues.

3. Quality of Available Data: Quality concerns may arise not just from the quantity of data but also
from the nature of the data available. Many parallel corpora can contain incorrect translations,
boilerplate phrases, or repetitive sentences, especially if not enough native speakers were involved
in the content creation or quality checks.
Strategies for Addressing Low-Resource Situations

To tackle these issues, the PDF introduces two primary approaches commonly employed in low-resource
MT contexts:

1. Backtranslation:

• Definition: Backtranslation is a data augmentation technique that leverages monolingual corpora to

generate synthetic parallel data. It involves training a reverse-direction MT system to create
additional training data.

Mechanism:
• Start with a small parallel corpus (bitext) between the source and target languages.
• Train a target-to-source MT model using this bitext.
• Use the trained model to translate monolingual data available in the target language back to the
source language. This creates a synthetic bitext where natural sentences in the target language are
aligned to sentences generated by the MT model.
• This additional synthetic data is then combined with the existing parallel corpus and used to retrain
the original source-to-target MT model.

- Example: If there's a small bitext for translating Navajo to English, but there are plenty of English
sentences available, one could create a target-to-source model that translates these English sentences
2. Multilingual Models:

- Using multilingual models can help MT systems become more robust in low-resource settings. These
models can learn from multiple languages simultaneously, allowing for shared learning across related
languages, which can mitigate the issues of low data availability for any single language.

- By leveraging knowledge from high-resource languages, multilingual models can provide better
translation capabilities even for the low-resource languages they include.
Sociotechnical Issues
In the context of Translating in Low-Resource Situations, sociotechnical concerns highlight how
human, cultural, and organizational factors influence the quality and equity of MT systems.

1. Lack of Native Speaker Involvement:

One significant hurdle in developing MT systems for low-resource languages is the absence of
native speakers in critical roles such as content curation, data annotation, or evaluation of the
translation systems. This can compromise the quality and relevance of the training data.

2. Quality Concerns in Parallel Corpora:

Studies have suggested that many parallel corpora may contain substantial amounts of low-quality
data, often resulting
from a lack of involvement from native speakers. For instance, in one study, it was found that less
than 50% of the sentences
in many multilingual datasets were of acceptable quality, indicating the need for better data
governance and inclusion of
native perspectives in the dataset curation process.

3. Resource Allocation Bias:

Many MT systems historically focus on high-resource languages, particularly English.
This bias in resource allocation can lead to disparities in translation quality and availability, as
systems trained
Machine Translation (MT) Evaluation
1. Dimensions of MT Evaluation
MT is generally evaluated along two key dimensions:
- Adequacy: This measures how well the translation preserves the meaning of the source text. It's
sometimes referred to as "faithfulness" or "fidelity."
- Fluency: This assesses how natural and grammatically correct the translation is in the target language.
It evaluates whether the translation is clear, readable, and coherent.

2. Evaluation Methods
A. Human Evaluation
Human evaluation is considered the gold standard for MT assessment due to its higher accuracy
compared to automatic methods. Human raters evaluate translations based on fluency and adequacy,
typically using a scoring scale (e.g., 1 to 5 or 1 to 100) to rate various aspects:

- Fluency Rater Scale: Raters may score how intelligible, clear, readable, or natural the output is, using
a numerical scale where low scores denote poor quality and high scores denote high quality.
- Adequacy Rater Scale: Bilingual raters may be given both the source sentence and the proposed
target translation to score how much information from the source is preserved in the target translation.

Ranking Method: Alternatively, raters might be asked to rank candidate translations to determine
preferences between two or more outputs.
B. Statistical Methodology for Human Evaluation
Training human raters is crucial, as those without translation expertise may struggle to distinguish
between fluency and adequacy. A common practice includes:
- Removing outlier raters whose scores vary significantly from the group.
- Normalizing scores to ensure consistency across evaluations. Specifically, this involves subtracting the
mean from each rater's scores and dividing by the variance to standardize evaluations.

3. Automatic Evaluation Metrics

MT evaluation also utilizes various automatic metrics to provide efficiency and scalability in the
evaluation process. Below are key automatic metrics discussed:

A. BLEU (Bilingual Evaluation Understudy)

- BLEU measures the overlap of n-grams (sequences of words or characters) between machine
translations and references. It calculates precision at various n-gram levels and is widely used to
evaluate translations. The BLEU score includes:
- N-gram Precision: Calculate precision for unigrams, bigrams, trigrams, and up to 4-grams.
- Brevity Penalty: Short translations are penalized to encourage systems to produce longer outputs.
B. chrF

- chrF focuses on character n-gram overlap, solving some limitations of BLEU by allowing partial
matches and addressing morphological complexities.

Unlike BLEU (which is word-based and precision-oriented), chrF is:

• Character-based
• Uses F-score, which balances both precision and recall
C. BERTScore

BERTScore is a machine translation evaluation metric that leverages pre-trained contextual embeddings
from BERT (or similar transformer models) to compare the semantic similarity between the candidate
and reference translations. Unlike BLEU and chrF which rely on exact n-gram matches, BERTScore
captures semantic meaning.
Cosine Similarity
4. Statistical Significance Testing
When comparing the performance of two MT systems (e.g., system A and system B), it's vital to
determine if observed differences in their scores are statistically significant.
• A paired bootstrap test can be applied to assess whether the difference in scores is statistically
significant. This involves:
• Creating thousands of pseudo-test sets by randomly sampling with replacement from the original
test set.
• Computing the metric scores (e.g., BLEU, chrF, BERTScore) for each pseudo-test and
determining how frequently one system scores higher than the other.
• This test helps evaluate whether the difference in metrics reflects a true performance improvement or
5. Limitations of Automatic Metrics
just random variation in the test set.
• Sensitivity to Word Tokenization:
• BLEU's performance may deteriorate based on how words are tokenized, especially in
morphologically rich
languages.
• Local Evaluation:
• chrF and BLEU focus on n-gram overlap and are sensitive to local word or character
sequences but fail to capture sentence-level semantics, coherence, or global logical flow.
• Embedding Dependency (BERTScore):
• BERTScore addresses semantic similarity better by using contextual embeddings, but:
• It is computationally more expensive.
• Its results depend on the pretrained model (e.g., BERT-base vs. RoBERTa-large).
• It may overestimate performance when both systems make similar semantic errors.
• Lack of Interpretability:
• Unlike BLEU and chrF where n-gram matches can be directly observed, BERTScore relies on
Bias in Machine Translation
1. Gender Bias
• Machine Translation (MT) systems can perpetuate gender biases present in the training data.
• Example:
• When translating from gender-neutral languages (e.g., Hungarian) into English, MT often
assigns gendered pronouns based on stereotypes.
• A gender-neutral subject like "ő" (Hungarian) may be translated into “he” or “she” depending
on the profession mentioned in the sentence.
• For instance:
• "ő egy ápoló" → "She is a nurse"
• "ő egy vezérigazgató" → "He is a CEO"
• These outputs reflect societal stereotypes and bias in profession-gender associations.
• The issue becomes significant in domains like job applications or media translations, where the
wrong gender assumption can propagate discriminatory ideas.
2. Cultural Stereotypes
• MT systems may reflect and amplify cultural stereotypes because:
• The training data often over-represents dominant cultures.
• Low-diversity datasets fail to capture minority voices or culturally nuanced expressions.
• Bias may appear in:
• Translation of religious, ethnic, or geopolitical content.
• Interpretation of idioms, metaphors, or emotionally charged expressions in a way that
misrepresents the source culture.
Ethical Issues in Machine Translation

1. Representation and Involvement

Native speakers of low-resource languages are often underrepresented in MT development.
Consequences:
Poor-quality translations.
Reduced trust and usability.
Studies have found that some multilingual datasets contain low-quality content due to:
Lack of native speaker input.
Automated or crowd-sourced translations without verification.
2. Quality of Training Data
Many parallel corpora (used for training MT systems) contain:
Repetitive phrases
Incorrect translations
Noisy or synthetic data
This leads to:
Systematic translation errors.
Propagation of low-quality or misleading translations, especially for underrepresented
languages.
Ethical concern: reliability and trust in the system’s output.
2. Resource Allocation
MT research is skewed toward high-resource languages (e.g., English, French, Chinese).
Low-income or minority language communities receive less attention and funding.
Consequences:
Lack of MT support for many global languages.
Widening of the digital divide.
Inaccessibility of information, education, or services for large populations.
Ethical concern: fairness in access to technology.
THANK YOU

The Stylistic Peculiarities of Types of Translation
No ratings yet
The Stylistic Peculiarities of Types of Translation
22 pages
9027259216japanese PDF
100% (3)
9027259216japanese PDF
320 pages
Personal and Impersonal Passive Constructions
100% (1)
Personal and Impersonal Passive Constructions
3 pages
Unit 5
No ratings yet
Unit 5
42 pages
The FLoRes Evaluation Datasets for Low-Resource Machine Translation Nepali-English and Sinhala-English
No ratings yet
The FLoRes Evaluation Datasets for Low-Resource Machine Translation Nepali-English and Sinhala-English
14 pages
An Introduction To Machine Translation: Andy Way, DCU
No ratings yet
An Introduction To Machine Translation: Andy Way, DCU
23 pages
Leeds 2006
No ratings yet
Leeds 2006
34 pages
Machine Translation Systems and Quality Assessment A Systematic Review
No ratings yet
Machine Translation Systems and Quality Assessment A Systematic Review
27 pages
13 Machine Translation
No ratings yet
13 Machine Translation
22 pages
5. Machine Translation
No ratings yet
5. Machine Translation
58 pages
NLP Applications (Continued)
No ratings yet
NLP Applications (Continued)
14 pages
Group 4 - WPS Office
No ratings yet
Group 4 - WPS Office
5 pages
PaperReview
No ratings yet
PaperReview
41 pages
Machine Learning in Translation Corpora Processing
No ratings yet
Machine Learning in Translation Corpora Processing
281 pages
Coli A 00356
No ratings yet
Coli A 00356
44 pages
System Combination Using Joint, Binarised Feature Vectors: Christian F EDERMAN N
No ratings yet
System Combination Using Joint, Binarised Feature Vectors: Christian F EDERMAN N
8 pages
ChatGPTvs - GoogleTranslate HiT-IT-2023-proceedings
No ratings yet
ChatGPTvs - GoogleTranslate HiT-IT-2023-proceedings
12 pages
Large Language Models
From Everand
Large Language Models
A. Scholtens
2/5 (2)
Machine Translation Research Papers
100% (1)
Machine Translation Research Papers
8 pages
electronics-14-00243
No ratings yet
electronics-14-00243
30 pages
Machine Tannslation On Low Resource Langugages Arabic Telugu Kannada
No ratings yet
Machine Tannslation On Low Resource Langugages Arabic Telugu Kannada
9 pages
Comparative Study of Machine Translation Techniques
No ratings yet
Comparative Study of Machine Translation Techniques
16 pages
SeamlessM4T-Massively_Multilingual_Multimodal_Mach
No ratings yet
SeamlessM4T-Massively_Multilingual_Multimodal_Mach
102 pages
SeamlessM4T - Massively Multilingual & Multimodal Machine Research Paper
No ratings yet
SeamlessM4T - Massively Multilingual & Multimodal Machine Research Paper
111 pages
2016 Kituku, Muchemi & Nganga - Review On Machine Translation Approaches
No ratings yet
2016 Kituku, Muchemi & Nganga - Review On Machine Translation Approaches
8 pages
FN Paper 2
No ratings yet
FN Paper 2
13 pages
(Slide) Neural Machine Translation
No ratings yet
(Slide) Neural Machine Translation
37 pages
Translating Similar Languages: Role of Mutual Intelligibility in Multilingual Transformers
No ratings yet
Translating Similar Languages: Role of Mutual Intelligibility in Multilingual Transformers
7 pages
Translationese in Machine Translation Evaluation
No ratings yet
Translationese in Machine Translation Evaluation
17 pages
Natural Language Processing
No ratings yet
Natural Language Processing
12 pages
Language Translator 1a (1)
No ratings yet
Language Translator 1a (1)
18 pages
07
No ratings yet
07
10 pages
A Novel Based Translation Model From English to Telugu
No ratings yet
A Novel Based Translation Model From English to Telugu
4 pages
s11831-020-09449-7
No ratings yet
s11831-020-09449-7
29 pages
Proceedings of International Ethical Hacking Conference 2018
No ratings yet
Proceedings of International Ethical Hacking Conference 2018
5 pages
Machine Translation: Michael Melese (PHD) Michael - Melese@Aau - Edu.Et
No ratings yet
Machine Translation: Michael Melese (PHD) Michael - Melese@Aau - Edu.Et
22 pages
1st Review-Tarun
No ratings yet
1st Review-Tarun
19 pages
SuperText-vs-DeepL
No ratings yet
SuperText-vs-DeepL
5 pages
Machine Translation
No ratings yet
Machine Translation
5 pages
JETIR2211403
No ratings yet
JETIR2211403
6 pages
2401.06568v2
No ratings yet
2401.06568v2
17 pages
The State of Machine Translation For Websites
No ratings yet
The State of Machine Translation For Websites
23 pages
Báo Cáo Tiếng Anh chuyên ngành
No ratings yet
Báo Cáo Tiếng Anh chuyên ngành
14 pages
Seminar Sample Report
No ratings yet
Seminar Sample Report
20 pages
Machine Translation Comparison
No ratings yet
Machine Translation Comparison
19 pages
FAMT, HAMT, Pre and Post Editing
No ratings yet
FAMT, HAMT, Pre and Post Editing
29 pages
Machine Translation Systems For Indian Languages: Review of Modelling Techniques, Challenges, Open Issues and Future Research Directions
No ratings yet
Machine Translation Systems For Indian Languages: Review of Modelling Techniques, Challenges, Open Issues and Future Research Directions
29 pages
Teixeira 2011 NLPCS PDF
No ratings yet
Teixeira 2011 NLPCS PDF
12 pages
NLP_M5_Part-2_SPP
No ratings yet
NLP_M5_Part-2_SPP
62 pages
A short guide to post-editing cap 6
No ratings yet
A short guide to post-editing cap 6
6 pages
Machine Translation
No ratings yet
Machine Translation
13 pages
10.1007@978-3-319-91241-77
No ratings yet
10.1007@978-3-319-91241-77
30 pages
Marathi To English Neural Machine Translation With Near Perfect Corpus and Transformers
No ratings yet
Marathi To English Neural Machine Translation With Near Perfect Corpus and Transformers
5 pages
ASWIN_TS_Unit_3_NLP_Translations_Gen_AI[1]
No ratings yet
ASWIN_TS_Unit_3_NLP_Translations_Gen_AI[1]
5 pages
Lecture 11
No ratings yet
Lecture 11
5 pages
English Amharic Document Translation Using Hybrid Approach - by Samrawit Zewgneh - Addis Ababa University
100% (1)
English Amharic Document Translation Using Hybrid Approach - by Samrawit Zewgneh - Addis Ababa University
62 pages
Machine Translation Mondal 2023
No ratings yet
Machine Translation Mondal 2023
90 pages
Language Identification: Fundamentals and Applications
From Everand
Language Identification: Fundamentals and Applications
Fouad Sabry
No ratings yet
Extending Capabilities of English To Marathi Machi PDF
No ratings yet
Extending Capabilities of English To Marathi Machi PDF
8 pages
Extending Capabilities of English To Marathi Machine Translator
No ratings yet
Extending Capabilities of English To Marathi Machine Translator
8 pages
Pre-Trained Multilingual Sequence-to-Sequence Models
No ratings yet
Pre-Trained Multilingual Sequence-to-Sequence Models
10 pages
Administrador,+Brita+Banitzr+CT+40 1 PdfA
No ratings yet
Administrador,+Brita+Banitzr+CT+40 1 PdfA
18 pages
What Is Machine Translation?
No ratings yet
What Is Machine Translation?
4 pages
CB7_5_Module_Test2
No ratings yet
CB7_5_Module_Test2
2 pages
Second Language Acquisition Theory, Aplications, and Some Conjectures
No ratings yet
Second Language Acquisition Theory, Aplications, and Some Conjectures
3 pages
36 No-Prep Vocabulary Activities
No ratings yet
36 No-Prep Vocabulary Activities
21 pages
Grade 3 English Booklet
No ratings yet
Grade 3 English Booklet
18 pages
امتحان ثامن نهائي فصل ثاني
No ratings yet
امتحان ثامن نهائي فصل ثاني
2 pages
Level 4 - Communicate
No ratings yet
Level 4 - Communicate
7 pages
Cambridge Checkpoint: Overview and Preparation
No ratings yet
Cambridge Checkpoint: Overview and Preparation
50 pages
10024A - Year - B.A. F.Y. New CBCS Pattern Semester-II Subject - BA12A1 - Compulsory English
No ratings yet
10024A - Year - B.A. F.Y. New CBCS Pattern Semester-II Subject - BA12A1 - Compulsory English
4 pages
Daily Activity Kelas 2
No ratings yet
Daily Activity Kelas 2
13 pages
اللغه الانجليزيه - الاسبوع الثانى - الاداءات الصفيه والمنزليه والتقييم الاسبوعى
No ratings yet
اللغه الانجليزيه - الاسبوع الثانى - الاداءات الصفيه والمنزليه والتقييم الاسبوعى
6 pages
Fall 2024 - ENG525 - 1
No ratings yet
Fall 2024 - ENG525 - 1
2 pages
DepED Demo
No ratings yet
DepED Demo
3 pages
(123doc) Dap An Dich Dai Cuong
No ratings yet
(123doc) Dap An Dich Dai Cuong
39 pages
Using The Correct Form of The Passive Voice
No ratings yet
Using The Correct Form of The Passive Voice
3 pages
8682 French Language: MARK SCHEME For The October/November 2009 Question Paper For The Guidance of Teachers
No ratings yet
8682 French Language: MARK SCHEME For The October/November 2009 Question Paper For The Guidance of Teachers
4 pages
CC Quiz 3 Language 2 Visa
No ratings yet
CC Quiz 3 Language 2 Visa
2 pages
Context Free Languages
No ratings yet
Context Free Languages
47 pages
Hmths Grade 11 Spanish Month Topic Vocabulary Grammar: Term One
No ratings yet
Hmths Grade 11 Spanish Month Topic Vocabulary Grammar: Term One
3 pages
Guidance For Delivery Level 2 Writing v1 0 PDF - Ashx
100% (1)
Guidance For Delivery Level 2 Writing v1 0 PDF - Ashx
20 pages
Football Playwer Worksheet
No ratings yet
Football Playwer Worksheet
1 page
syllabus6
No ratings yet
syllabus6
22 pages
Barriers To Communication
No ratings yet
Barriers To Communication
39 pages
‏لقطة شاشة ٢٠٢١-١٢-٠٢ في ٩.٣٣.٣٧ م
No ratings yet
‏لقطة شاشة ٢٠٢١-١٢-٠٢ في ٩.٣٣.٣٧ م
84 pages
M1W1
No ratings yet
M1W1
19 pages
Eng 102-1
No ratings yet
Eng 102-1
17 pages
Final English Written Exam Room 1
No ratings yet
Final English Written Exam Room 1
2 pages
Clauses and Types of Clauses
100% (1)
Clauses and Types of Clauses
44 pages

MODULE 5

Uploaded by

MODULE 5

Uploaded by

MODUL

Challenges in Low-Resource Machine Translation

• Definition: Backtranslation is a data augmentation technique that leverages monolingual corpora to

1. Lack of Native Speaker Involvement:

2. Quality Concerns in Parallel Corpora:

3. Resource Allocation Bias:

3. Automatic Evaluation Metrics

A. BLEU (Bilingual Evaluation Understudy)

Unlike BLEU (which is word-based and precision-oriented), chrF is:

1. Representation and Involvement

You might also like