0% found this document useful (0 votes)

16 views

GTM N1 Used

This document discusses analyzing the correlation between human and automatic evaluations of machine translation at the sentence level. It describes an experiment comparing Chinese translations from four machine translation systems evaluated by both human judges and three automatic metrics (BLEU, TER, GTM). The results show the Spearman correlation coefficients between each automatic metric and individual human judges, with average correlations ranging from 0.29 to 0.30. However, averaging correlations from a small number of human judges may not be valid. More detailed analysis examines cases where automatic and human evaluations disagree at the sentence level.

Uploaded by

h.ohida

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views

GTM N1 Used

Uploaded by

h.ohida

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Mining the Correlation between Human and Automatic Evaluation

at Sentence Level
Yanli Sun
School of Applied Language and Intercultural Studies, Dublin City University
[email protected]

Abstract
Automatic evaluation metrics are fast and cost-effective measurements of the quality of a Machine Translation (MT) system. However,
as humans are the end-user of MT output, human judgement is the benchmark to assess the usefulness of automatic evaluation metrics.
While most studies report the correlation between human evaluation and automatic evaluation at corpus level, our study examines their
correlation at sentence level. In addition to the statistical correlation scores, such as Spearman's rank-order correlation coefficient, a
finer-grained and detailed examination of the sensitivity of automatic metrics compared to human evaluation is also reported in this
study. The results show that the threshold for human evaluators to agree with the judgements of automatic metrics varies with the
automatic metrics at sentence level. While the automatic scores for two translations are greatly different, human evaluators may
consider the translations to be qualitatively similar and vice versa. The detailed analysis of the correlation between automatic and
human evaluation allows us determine with increased confidence whether an increase in the automatic scores will be agreed by human
evaluators or not.

ranges from -1 to 1 representing negative correlation to

1. Introduction perfect positive correlation.
It is widely recognized that evaluation plays an important
role in the development of language technologies. In the As automatic metrics are more effective at corpus level,
area of Machine Translation (MT), there are two types of more effort has been taken on finding out which automatic
commonly used evaluation methods. While human metric correlates better with human evaluation at corpus
evaluation is still the most important means of providing level. Nevertheless, increasing attention is being paid to
valuable feedback on the further development of an MT correlation at sentence level. According to Lin and Och
system, its cost, labour-intensive and highly subjective (2004), high sentence level correlation of automatic and
characteristics have led to the popularity of automatic human evaluation is crucial for machine translation
evaluation metrics, such as BLEU (Bilingual Evaluation researchers. Russo-Lassner et al. (2005) also pointed out
Understudy) (Papineni et al. 2001), Precision and Recall that automatic metrics of high sentence level correlation
(Turian et al. 2003), TER (Translation Error Rate) could “provide a finer-grained assessment of translation
(Snover et al. 2006) etc. According to Coughlin (2001), quality” and could also “guide MT system development
automatic metrics have the advantages of high speed, by offering feedback on sentences that are particularly
convenience and comparatively lower-cost. However, as challenging”(p3).
humans are the end-users of MT, human judgement is
ultimately the benchmark to assess the usefulness of This paper extends the research on correlation at sentence
automatic metrics. How good an automatic metric is level, aiming at finding out which automatic metric
depends on its correlation with human evaluation. Two correlates better with human evaluation in terms of
major forms of human evaluation in the area of MT are: Chinese translation from English; and our second aim is to
scoring, which requires human evaluators to assign two investigate how big a difference between two automatic
scores (usually 1 to 5) representing the fluency and scores has to be in order to reflect the qualitative changes
accuracy of a translation (LDC, 2005); and ranking, which of the translations. The remainder of the paper is
asks human evaluators to compare the translations from organized as follows: Section two introduces the
different MT systems and assign rankings to them. The experiment setting; Section three reports the correlation
problem of scoring is that even with a clear guideline at level between automatic and human evaluation at sentence
hand, human evaluators still find it hard to assign level; Section four examines the detailed difference
appropriate scores to a translation. Ranking, on the other between the judgement of automatic evaluation and
hand, is found to be quite intuitive and reliable (Vilar et human evaluation; and Section five summarizes the
al., 2007). Callison-Burch et al. (2008) concluded from findings and points out future research questions.
their study that ranking was more reliable compared to
scoring. Duh (2008) also pointed out that ranking could 2. Experiment Setting
simplify the decision procedures for human evaluators The automatic evaluation and human evaluation results
compared to assigning scores. reported in this paper were collected from an experiment
comparing Chinese translations from different MT
Depending on the type of human evaluation used, the systems. However, the focus in this paper is to examine
correlation between automatic and human evaluation is the correlation between human evaluation and automatic
measured either by Pearson's correlation coefficient or evaluation and not to discuss the translation quality per se.
Spearman's correlation coefficient. The correlation value The corpus is an installation manual of an anti-virus
software composed in English from Symantec (Ireland).

1726
Altogether 570 sentences were randomly selected as the One approach to computing the correlation is Spearman's
test sample. The Chinese reference of the test sample was ranking correlation coefficient (ρ). The process of getting
extracted from the company’s Translation Memory. Four Spearman’s ranking correlation is as follows: first, the
MT systems (one Rule-Based system and three Statistical- scores assigned by the automatic metrics should be
Based systems) were employed to translate the test sample converted into rankings as well; second, for each of the
into Chinese for comparison. Both human and automatic 570 groups, calculate the p value between each automatic
evaluations were applied in order to rank the quality of the metric and each human evaluator using the four items;
output from the four systems. Four professional third, average all the p values to get the mean p value
translators were employed to rank the outputs from 1 to 4 between each metric and each human. Table 1 below
(1 being the best, 4 being the worst) sentence by sentence. reports the correlation values using this method.
BLEU, TER and GTM (General Text Matcher, an L1 L2 L3 L4 Average
implementation of precision and recall) were used to get
the automatic scores of each translation at both corpus GTM 0.32 0.50 0.14 0.26 0.30
level and sentence level. The reasons for using these three TER 0.33 0.48 0.12 0.24 0.29
metrics are: first, they can be used (and have been used) to
evaluate Asian language outputs (in this paper, Chinese); BLEU 0.34 0.44 0.13 0.26 0.29
second, they are among the most widely used metrics in
the area; third, they are relatively easy and cost-effective
Table 1: Spearman’s Correlation between Automatic and
to use. There are also many other automatic metrics, such
Human Evaluation
as Meteor (Banerjee & Lavie, 2005), TERp (Snover et al.,
However, the validity of this approach was questioned by
2009), etc. However, additional conditions are needed to
Callison-Burch et al. (2008) who claimed that getting the
get the best advantage from these metrics. For example,
general correlation value by averaging the p values from a
Meteor functions better with a database of synonyms,
limited number of (here only four) items is not appropriate.
such as the WordNet for English; TERp requires
Instead, in their study, they conducted pair-wise
paraphrases which also function as “synonyms” of phrases.
comparison of any two outputs, examining whether the
Since these resources for Chinese were not available in
automatic scores were consistent with human rankings
our pilot project, these metrics were not employed in this
given any two outputs (that is the higher-ranked system
paper. The next section compares the scores from the
received a higher score). Following this approach, the 570
automatic metrics with the rankings from human
groups were expanded into 3420 pairs (each of the 570
evaluators to check how consistent the two evaluation
groups can be expanded into 6 pairs). For each automatic
methods are at sentence level with detailed analysis
metric, the total number of consistent evaluations was
followed in section four.
divided by the total number of comparisons to get a
percentage. Table 2 reports the consistency.
3. Correlation Check L1 L2 L3 L4 Average
The correlation between automatic evaluation and human
evaluation at sentence level was obtained following the GTM 0.61 0.68 0.71 0.66 0.66
practice of Callison-Burch et al. (2008). As mentioned
TER 0.58 0.64 0.70 0.64 0.64
earlier, we have 570 source English sentences to be
translated by four MT systems into Chinese. Therefore, BLEU 0.51 0.55 0.65 0.59 0.56
for each source English sentence, four translations can be
produced which are ranked by four professional
translators and scored by three automatic evaluation Table 2: Consistency of Automatic Evaluation with
metrics. In other words, there are 570 groups (with four Human Evaluation
items per group) each of which contains four columns of Table 2 indicates that these automatic metrics could
rankings from the four human evaluators and three correctly predict the human rankings of any pair of
columns of scores from the three automatic metrics. translations more than half the time. GTM correlates
Figure 1 below shows a sample of the final results sheet. better with human evaluation than BLEU and TER at
L1, L2, L3, L4 in Figure 1 refer to the four human sentence level in Chinese output evaluation. Similar
evaluators respectively. findings have been reported by Cahill (2009) in German
evaluation which compared 6 metrics including the three
metrics used in this paper. Besides, Agarwal and Lavie
(2008) also mentioned that GTM and TER could produce
more reliable sentence level scores than BLEU.

4. Further Analysis
As shown in Table 2, even for the best correlated metric
GTM, there is only 66% consistency, indicating a large
amount of discrepancy between humans and automatic
evaluation metrics in ranking the quality of different
translations. In order to further investigate the consistency
and inconsistency at sentence level, we conducted a
micro-analysis on the cases where humans and automatic
Figure 1: Sample of the Final Results Sheet metrics agree/disagree on the rankings of two translations.
Given two translations of a source sentence, each of

1727
which is associated with an automatic score, these two
scores can suggest a difference in terms of the quality of 100%
these two translations. However, humans may or may not
80%
agree with the difference registered by the automatic
metrics. Nevertheless, intuitively, the greater the 60%

differences between two automatic scores of two 40%

translations, the more likely that these scores predict the 20%
judgements of humans about the quality of the two 0%
translations. Based on such consideration, for any pairs of

0.5 - 0.6
0.4 - 0.5
0.3 - 0.4
0.2 - 0.3
0.1 - 0.2
0 - 0.1
0.5 - 0.6
0.4 - 0.5
0.3 - 0.4
0.2 - 0.3
0.1 - 0.2
0 - 0.1
0.5 - 0.6
0.4 - 0.5
0.3 - 0.4
0.2 - 0.3
0.1 - 0.2
0 - 0.1
0.5 - 0.6
0.4 - 0.5
0.3 - 0.4
0.2 - 0.3
0.1 - 0.2
0 - 0.1
translations of a source sentence, the differences between
the two corresponding automatic evaluation scores can be
divided into different groups of scales. For example, if the L1 L2 L3 L4

GTM scores for two translations are 0.64 and 0.53

Humans Agree Humans Disagree Humans Assign Ties
respectively, the difference between these GTM scores
(0.11) falls into the difference scale (0.1-0.2). As
mentioned in section 3, altogether there are 3420 pairs for
comparison. For each automatic metric, the difference of Figure 2: Distribution of Human Evaluation within GTM
scores within each pair were collected and categorized Difference Scales
into different scales. Table 2 reports the number of pairs
distributed in the difference scales of each automatic 100%
metric. 80%
Difference GTM TER BLEU 60%
Scale #pairs #pairs #pairs 40%

0.9-1.0 / / 18 20%

/ / 0%
0.8-0.9 7 0.6 - 0.7
0.5 -0.6
0.4 - 0.5
0.3 - 0.4
0.2 - 0.3
0.1 - 0.2

0.5 -0.6
0 - 0.1
0.6 - 0.7

0.4 - 0.5
0.3 - 0.4
0.2 - 0.3
0.1 - 0.2

0.5 -0.6
0 - 0.1
0.6 - 0.7

0.4 - 0.5
0.3 - 0.4
0.2 - 0.3
0.1 - 0.2
0 - 0.1
0.6 - 0.7
0.5 -0.6
0.4 - 0.5
0.3 - 0.4
0.2 - 0.3
0.1 - 0.2
0 - 0.1
0.7-0.8 / / 28
0.6-0.7 / 4 35 L1 L2 L3 L4

0.5-0.6 4 11 58
Humans Agree Humans Disagree Humans Assign Ties
0.4-0.5 12 52 137
0.3-0.4 73 127 201
0.2-0.3 232 278 261 Figure 3: Distribution of Human Evaluation within TER
0.1-0.2 627 659 364 Difference Scales
0.0-0.1 1484 1026 776 100%

80%
Table 2: Number of Pairs Distributed in each Difference
Scale of each Automatic Metric 60%

40%
Table 2 shows that the difference between the automatic
20%
scores of two different translations is mostly quite small.
For example, 61.02% of the pairs have a difference below 0%
0.8 - 0.9
0.7 - 0.8
0.6 - 0.7
0.5 - 0.6
0.4 - 0.5
0.3 - 0.4
0.2 - 0.3
0.1 - 0.2

0.8 - 0.9
0.7 - 0.8
0.6 - 0.7
0.5 - 0.6
0.4 - 0.5
0.3 - 0.4
0.2 - 0.3
0.1 - 0.2
0.9 -1

0 - 0.1

0.8 - 0.9
0.7 - 0.8
0.6 - 0.7
0.5 - 0.6
0.4 - 0.5
0.3 - 0.4
0.2 - 0.3
0.1 - 0.2
0.9 -1

0 - 0.1
0.9 -1

0 - 0.1

0.8 - 0.9
0.7 - 0.8
0.6 - 0.7
0.5 - 0.6
0.4 - 0.5
0.3 - 0.4
0.2 - 0.3
0.1 - 0.2
0.9 -1

0 - 0.1
0.1 in terms of GTM score, and this amounts to 47.57% in
terms of TER and 41.17% in terms of BLEU.
L1 L2 L3 L4

It is worth pointing out that the scales refer to the Humans Agree Humans Disagree Humans Assign Ties
difference between two scores for a pair of outputs, not
the scale of the scores. The purpose of setting up these
difference scales is to see whether the greater the
Figure 4: Distribution of Human Evaluation within BLEU
difference between two scores, the more likely that Difference Scales
humans agree with automatic metrics. For each of the
The height of the solid grey bars in Figures 2 to 4 show
three automatic evaluation metrics, we consider the that for GTM (Figure 2), it is true that the greater the
following three scenarios: 1) the number of pairs for difference between two automatic scores, the more cases
which human rankings are consistent with the scores that humans agree with the judgements of GTM; the
assigned to the translations by the automatic metric
smaller the difference, the more cases that humans
(“Humans Agree”); 2) the number of pairs for which disagree with the judgements of GTM. On the contrary,
human rankings are contrary to the scores assigned by the
even with very high TER or BLEU score differences,
automatic metric (“Humans Disagree”); 3) although the humans may still disagree with the judgement of TER
two translations in a pair are different and received two (Figure 3) or BLEU (Figure 4). In this experiment, when
different automatic scores, humans do not think they are the difference between two GTM scores is bigger than
qualitatively different and rank the pair as ties (“Humans
0.11, the majority of the human evaluators agree with the
Assign Ties”) (see Figures 2, 3 and 4). judgement of the GTM score about which translation is
better. The average difference between two TER scores
and BLEU scores has to be bigger than 0.18 and 0.29

1728
before the majority of the human evaluators agree with the automatic evaluation and human evaluation at sentence
judgement of these automatic metrics. level in terms of Chinese translation evaluation. Several
conclusions have been drawn from this study: first, for
Figures 2 to 4 also reflect that different evaluators have evaluation of Chinese translations of English technical
different criteria in judging the quality of different document, GTM correlates better with human evaluation
translations. As can be seen from the Figures, L3 assigned than TER and BLEU do at sentence level; second, only
many more ties in pair-wise comparison than other when the difference between two scores is greater than a
evaluators. The inter-evaluator correlation within the four certain value will the majority of human evaluators agree
human evaluators was measured using the Kappa with the judgement of the automatic metrics; third, when
coefficient (K), a measurement of the agreement between two automatic scores of two translations are the same, it
categorical data (Boslaugh & Watters, 2008). One widely does not always mean there is no qualitative difference
accepted interpretation of Kappa was proposed by Landis between the translations. There are also questions
and Koch (1977): 0-.2 is slight correlation, .2-.4 is fair remained unanswered: first, the statistical significance of
correlation, .4-.6 is moderate correlation, .6-.8 is the correlation and consistency is not examined; second,
substantial correlation and .8-1 is almost perfect we are aware that the correlation between human and
correlation. Using the Microsoft Kappa Calculator automatic evaluation may vary depending on the MT
template (King, 2004), the inter-evaluator agreement system involved; however no such distinction was made
score between the four human evaluators is (K=.273). in this study. Therefore, there is a lot of further work to be
Excluding human evaluator L3, the K value increases done in the future. In addition to these, we have shown
to .381. that for a considerable number of paired, human
judgements are inconsistent with automatic metrics. In the
Generally speaking, even if there are slight differences in future, we plan to conduct a further analysis into the
two translations, automatic metrics could generate causes for such discrepancies in an attempt to provide
different scores for them. However, there are also cases some linguistically motivated patterns that may benefit the
where the automatic scores are the same for two different design of the automatic metrics. Finally, although human
translations. In this experiment, we found that for some evaluation has been regarded as the golden standard in the
pairs of different translations for which the automatic process of MT evaluation, the results in this paper reflects
metrics assigned the same scores, humans didn’t consider some problems of human evaluation. How to standardize
them qualitatively different either. On the other hand, human evaluation is another question worthy of exploring
there are some other translations that were evaluated as in the future.
qualitatively different by humans but not by automatic
metrics. For each automatic metric, we summed the Acknowledgement
number of pairs that received the same scores by This work was financed by Enterprise Ireland and
automatic evaluation but different rankings by human Symantec Corporation (Ireland). The author would like to
evaluators. As there are four human evaluators, only those thank Dr. Fred Hollowood for his inspiring ideas and
pairs that were differentiated by the majority of human suggestions, Dr. Sharon O’Brien, Dr. Minako O’Hagan
evaluators (i.e. three or more evaluators assigned different and Dr. Johann Roturier for their precious corrections and
rankings to the translations in one pair) were taken into comments. Thanks also to the anonymous reviewers for
consideration. Table 3 contains the total number of pairs their insightful comments. However, the author is
where no differentiation was made by the automatic responsible for any errors in the paper.
metrics but where humans differentiated.
GTM TER BLEU Reference
#pairs 141 209 331 Agarwal, A. & Lavie, A. (2008). 'Meteor, M-BLEU and
M-TER: Evaluation Metrics for High-Correlation with
Table 3: No. Pairs of Translations Differentiated by Human Rankings of Machine Translation Output.' In
Humans but not by Automatic Metrics Proceedings of the Third Workshop on Statistical
GTM appears to have the smallest number of pairs that Machine Translation, Columbus, Ohio, June, pp. 115-
were not differentiated demonstrating a stronger 118.
differentiation ability at sentence level more in line with Banerjee, S. & Lavie, A. (2005). 'METEOR: An
the human evaluation while BLEU left a large number of Automatic Metric for MT Evaluation with Improved
pairs undifferentiated showing its weakness at sentence Correlation with Human Judgments'. In Proceedings
level evaluation in relation to the human evaluation. This of the ACL-2005 Workshop on Intrinsic and Extrinsic
finding shows that in some cases automatic evaluation Evaluation Measures for MT and/or Summarization,
cannot reflect the difference between two translations Ann Arbor, Michigan, pp. 65-72.
which are apparent according to the human assessments. Boslaugh, S. & Watters, P.A. (2008). 'Statistics in a
Hence, if two scores show no sign of difference, it does Nutshell.' O’Reilly Media, Inc., the United States of
not always indicate there is no qualitative difference America.
between two translations. Cahill, A. (2009). 'Correlating Human and Automatic
Evaluation of a German Surface Realiser'. In
5. Conclusion and Future Work Proceedings of the ACL-IJNLP 2009 Conference Short
It is well known that precise automatic evaluation metrics Papers, Suntec, Singapore, August, pp. 97-100.
at sentence level can help MT developers determine what Callison-Burch, C., Fordyce, C., Koehn, P., Monz, C., &
sentence structures their MT system can or can not deal Schroeder, J. (2008). 'Further Meta-evaluation of
with appropriately. This study examines the correlation of Machine Translation'. In Proceedings of the Third

1729
Workshop on Statistical Machine Translation,
Columbus, Ohio, June, pp. 70-106.
Coughlin, D. (2001). 'Correlating Automated and Human
Assessments of Machine Translation Quality'. In
Proceedings of MT Summit IX, Santiago de
Compostela, Spain, September, pp. 63-70.
Duh, K. (2008). 'Ranking vs. Regression in Machine
Translation Evaluation'. In Proceedings of the Third
Workshop on Statistical Machine Translation,
Columbus, Ohio, June, pp.191–194.
Koehn, P., Hoang, H., Birch, A., Callison-Burch, C.,
Federico, M., Bertoldi, N., Cowan, B., Shen, W.,
Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin,
A.& Herbst, E. (2007). "Moses: Open Source Toolkit
for Statistical Machine Translation". In Proceedings of
Annual Meeting of the Association for Computational
Linguistics (ACL), demonstration session, Prague,
June, pp.177-180.
King, J. E. (2004). 'Software Solutions for Obtaining a
Kappa-type Statistic for Use with Multiple Raters'.
Presented at the Annual Meeting of the Southwest
Educational Research Association, Dallas, TX.
Landis, J.R. & Koch, G.G. (1977). 'The Measurement of
Observer Agreement for Categorical Data.' Biometics,
33:159-174.
LDC (2005). Linguistic Data Annotation Specification:
Assessment of fluency and adequacy in translations.
http://projects.ldc.upenn.edu/TIDES/tidesmt.html.
Lin, C. & Och, F.J. (2004). 'ORANGE: A Method for
Evaluating Automatic Evaluation Metrics for Machine
Translation'. In Proceedings of the 20th International
Conference on Computational Linguistics, Geneva,
Switzerland, August, pp. 501-508.
Papineni, K., Roukos, S., Ward, T., & Zhu, W. (2001).
'BLEU: A Method for Automatic Evaluation of
Machine Translation'. Research Report RC22176
(W0109-022), IBM T.J.Watson Research Center,
September.
Russo-Lassner, G., Lin, J. & Resnik, P. (2005). 'A
Paraphrase-Based Approach to Machine Translation
Evaluation'. Technical report, University of Maryland,
College Park.
Snover, M., Dorr, B., Schwartz, R., Micciulla, L. &
Weischedel, R. (2006). 'A Study of Translation Edit
Rate with Targeted Human Annotation'. In
Proceedings of AMTA, Cambridge, MA, August,
pp.223-231.
Snover, M., Madnani, N., Dorr, B.J. & Schwartz, R.
(2009). 'Fluency, Adequacy, or HTER? Exploring
Different Human Judgments with a Tunable MT
Metric'. In Proceedings of the EACL-2009 Workshop
on Statistical Machine Translation (WMT09), Athens,
pp. 259-268.
Turian, J.P., Shen, L., & Melamed, I.D. (2003).
'Evaluation of Machine Translation and its Evaluation'.
In Proceedings of the MT Summit IX, New Orleans,
LA, September, pp. 386-393.
Vilar, D., Leusch, G., Ney, H., & Bachs, R. (2007).
'Human Evaluation of Machine Translation Through
Binary System Comparisons'. In Proceedings of the
Second Workshop on Statistical Machine Translation,
Prague, June, pp.96–103.

1730

(Routledge Studies in Translation Technology 2018 - 1) Chan, Sin-Wai - The Human Factor in Machine Translation-Routledge, Taylor & Francis Group (2018)
No ratings yet
(Routledge Studies in Translation Technology 2018 - 1) Chan, Sin-Wai - The Human Factor in Machine Translation-Routledge, Taylor & Francis Group (2018)
269 pages
Chan Sin-Wai - Dictionary of Translation Technology
No ratings yet
Chan Sin-Wai - Dictionary of Translation Technology
568 pages
Supplemental Activities
100% (1)
Supplemental Activities
12 pages
Rubrics in Assessing The Confidence of The Students Regarding To Their English Speaking Communication Skill
100% (1)
Rubrics in Assessing The Confidence of The Students Regarding To Their English Speaking Communication Skill
1 page
Grade:: Examination For The Certifi Cate of Competency in English Score Report
No ratings yet
Grade:: Examination For The Certifi Cate of Competency in English Score Report
2 pages
Coli A 00356
No ratings yet
Coli A 00356
44 pages
Machine Translation
No ratings yet
Machine Translation
1 page
Automatic Quality Evaluation of MT Output
No ratings yet
Automatic Quality Evaluation of MT Output
4 pages
Convergences and Divergences Between Automatic Assessment and Human 2401.05176v2
No ratings yet
Convergences and Divergences Between Automatic Assessment and Human 2401.05176v2
20 pages
Human Translation vs. Machine Translation
100% (1)
Human Translation vs. Machine Translation
20 pages
Towards A Combination of Metrics For Machine Translation: Mawloud Mosbah
No ratings yet
Towards A Combination of Metrics For Machine Translation: Mawloud Mosbah
18 pages
W05-0909
No ratings yet
W05-0909
8 pages
Semi-Automatic Simultaneous Interpreting Quality Evaluation
No ratings yet
Semi-Automatic Simultaneous Interpreting Quality Evaluation
12 pages
Eye Tracking As A Tool For Machine Translation Error Analysis
No ratings yet
Eye Tracking As A Tool For Machine Translation Error Analysis
6 pages
The Impacts and Challenges of Artificial Intelligence Dydcsup9
No ratings yet
The Impacts and Challenges of Artificial Intelligence Dydcsup9
6 pages
Further Evidence For A Functionalist Approach To Translation Quality Evaluation
100% (1)
Further Evidence For A Functionalist Approach To Translation Quality Evaluation
30 pages
JOST 2009 Fiederer
No ratings yet
JOST 2009 Fiederer
18 pages
Comparing Human Translation And Google Translate - oral health
No ratings yet
Comparing Human Translation And Google Translate - oral health
10 pages
Peña_2023_SP-ENG Machine translation_20p
No ratings yet
Peña_2023_SP-ENG Machine translation_20p
26 pages
EAP113 Coursework
No ratings yet
EAP113 Coursework
6 pages
AI Translation
No ratings yet
AI Translation
11 pages
Evaluation of Machine Translation
No ratings yet
Evaluation of Machine Translation
5 pages
Aslib 1981 Sager PDF
No ratings yet
Aslib 1981 Sager PDF
9 pages
Comparing AI Translation to Neural Machine Translation a Corpus-Based Analysis
No ratings yet
Comparing AI Translation to Neural Machine Translation a Corpus-Based Analysis
39 pages
Exploring The Differences Between Human and Machine Translation
No ratings yet
Exploring The Differences Between Human and Machine Translation
17 pages
Comparison N2
No ratings yet
Comparison N2
8 pages
Translation & Technology (Session 9)
No ratings yet
Translation & Technology (Session 9)
3 pages
Li Haiying, Graesser, Arthur C. & Cai Zhiqiang - Comparison of Google Translation With Human Translation PDF
No ratings yet
Li Haiying, Graesser, Arthur C. & Cai Zhiqiang - Comparison of Google Translation With Human Translation PDF
6 pages
Coli 2006 32 4 471
No ratings yet
Coli 2006 32 4 471
14 pages
Adequacy in Machine vs. Human Translation: A Comparative Study of English and Persian Languages
No ratings yet
Adequacy in Machine vs. Human Translation: A Comparative Study of English and Persian Languages
21 pages
How To Do Human Evaluation A Brief Introduction To User Studies in NLP
No ratings yet
How To Do Human Evaluation A Brief Introduction To User Studies in NLP
24 pages
P18-1060
No ratings yet
P18-1060
11 pages
PPNCKH
No ratings yet
PPNCKH
17 pages
A Quantitative Method For Evaluation of CAT Tools Based On User Preferences. Anna Zaretskaya
No ratings yet
A Quantitative Method For Evaluation of CAT Tools Based On User Preferences. Anna Zaretskaya
5 pages
Dialnet-ToolsForEnglishSpanishCrossLinguisticAppliedResear-3094671
No ratings yet
Dialnet-ToolsForEnglishSpanishCrossLinguisticAppliedResear-3094671
16 pages
233 Paper
No ratings yet
233 Paper
6 pages
Popović, Maja + Hermann Ney (2011) - Towards automatic error analysis of machine translation output
No ratings yet
Popović, Maja + Hermann Ney (2011) - Towards automatic error analysis of machine translation output
32 pages
TRANSLATOR FROM YORUBA TO ENGLISH
No ratings yet
TRANSLATOR FROM YORUBA TO ENGLISH
18 pages
Means Ends Analysis: Fundamentals and Applications
From Everand
Means Ends Analysis: Fundamentals and Applications
Fouad Sabry
No ratings yet
Machine Translation Spanish-To-English Translation System Using RNNs
No ratings yet
Machine Translation Spanish-To-English Translation System Using RNNs
9 pages
Assessing The Usability of Raw Machine Translated Output Doherty & O'Brien
No ratings yet
Assessing The Usability of Raw Machine Translated Output Doherty & O'Brien
38 pages
Translation Rubric
No ratings yet
Translation Rubric
28 pages
C18-1281
No ratings yet
C18-1281
12 pages
Productivity of Machine Translation
No ratings yet
Productivity of Machine Translation
2 pages
Comparative Analysis of Accuracy and Reliability in Machine Translation Versus Human Translation: An Empirical Investigation in Real-Life Conditions
No ratings yet
Comparative Analysis of Accuracy and Reliability in Machine Translation Versus Human Translation: An Empirical Investigation in Real-Life Conditions
11 pages
Usability Analysis of the Concordia Tool Applying Novel Concordance Searching
No ratings yet
Usability Analysis of the Concordia Tool Applying Novel Concordance Searching
11 pages
Towards Science of Machine Translation
No ratings yet
Towards Science of Machine Translation
9 pages
Tradition and Trends in Translation Quality Assessment: Jitka Zehnalová
No ratings yet
Tradition and Trends in Translation Quality Assessment: Jitka Zehnalová
18 pages
1 s20 S187704281300253X Main - 221023 - 054036
No ratings yet
1 s20 S187704281300253X Main - 221023 - 054036
11 pages
Master Thesis Proposal
No ratings yet
Master Thesis Proposal
4 pages
Papineni, K., Roukos, S., Ward T. & Zhu W. 2002. BLEU a Method for Automatic Evaluation of Machine Translation. Computational Linguistics 1(1) 311-318.
No ratings yet
Papineni, K., Roukos, S., Ward T. & Zhu W. 2002. BLEU a Method for Automatic Evaluation of Machine Translation. Computational Linguistics 1(1) 311-318.
9 pages
10.1007@978-3-319-91241-77
No ratings yet
10.1007@978-3-319-91241-77
30 pages
Terminology and Translation - Bringing Research and Professional Training Together Through Technology, Belinda Maia
No ratings yet
Terminology and Translation - Bringing Research and Professional Training Together Through Technology, Belinda Maia
7 pages
Assessing The Performance Quality of Google Translate in Translating English and Persian Newspaper Texts Based On The MQM-DQF Model
No ratings yet
Assessing The Performance Quality of Google Translate in Translating English and Persian Newspaper Texts Based On The MQM-DQF Model
12 pages
03MT FEMTI MT Eval - 2
No ratings yet
03MT FEMTI MT Eval - 2
33 pages
EIdoma Translator
No ratings yet
EIdoma Translator
33 pages
Lexical and Grammatical Peculiarities of Scientific-Technical Texts
67% (3)
Lexical and Grammatical Peculiarities of Scientific-Technical Texts
39 pages
Retrieving Terminological Information On The Net. Are Linguistic Tools Still Useful?
No ratings yet
Retrieving Terminological Information On The Net. Are Linguistic Tools Still Useful?
8 pages
The Application of Natural Language Processing and Automated Scoring in Second Language Assessment
No ratings yet
The Application of Natural Language Processing and Automated Scoring in Second Language Assessment
3 pages
Translation Quality Assessment in The Li 244fd6b6
No ratings yet
Translation Quality Assessment in The Li 244fd6b6
9 pages
Thesis PDF
100% (1)
Thesis PDF
162 pages
Named Entity Recognition: Fundamentals and Applications
From Everand
Named Entity Recognition: Fundamentals and Applications
Fouad Sabry
No ratings yet
Differential Evolution: Fundamentals and Applications
From Everand
Differential Evolution: Fundamentals and Applications
Fouad Sabry
No ratings yet
Hoffmeister Chapter Difference Not Disability 2006
No ratings yet
Hoffmeister Chapter Difference Not Disability 2006
15 pages
Parallel Corpora & Alignment: Aaron Smith
No ratings yet
Parallel Corpora & Alignment: Aaron Smith
45 pages
040 - Jerzy Tomaszczyk (Lodz) - The Bilingual Dictionary Under Review
No ratings yet
040 - Jerzy Tomaszczyk (Lodz) - The Bilingual Dictionary Under Review
9 pages
Vocab Table
No ratings yet
Vocab Table
10 pages
Termeni Folositi de Agentile de Traduceri
No ratings yet
Termeni Folositi de Agentile de Traduceri
5 pages
Mat Comprehensive Exam Format and Area A Reading List
No ratings yet
Mat Comprehensive Exam Format and Area A Reading List
4 pages
Audio-Lingual Presentation
No ratings yet
Audio-Lingual Presentation
22 pages
Gamified
No ratings yet
Gamified
3 pages
Analysis of Formal and Informal Bilingual Expressive Language Assessment
No ratings yet
Analysis of Formal and Informal Bilingual Expressive Language Assessment
2 pages
Duolingo Guide
No ratings yet
Duolingo Guide
18 pages
CM 5: Mother Tongue-Based Multilingual Education: (Eng ELT 2: Language Programs and Policies in Multilingual Societies)
No ratings yet
CM 5: Mother Tongue-Based Multilingual Education: (Eng ELT 2: Language Programs and Policies in Multilingual Societies)
69 pages
What Is Applied Linguistics and Its Areas?
No ratings yet
What Is Applied Linguistics and Its Areas?
4 pages
Chapter 8
No ratings yet
Chapter 8
5 pages
UTS TOEFL Answer Sheet
No ratings yet
UTS TOEFL Answer Sheet
1 page
Session Guide - Fluency - SOR - Anna Marlaine Litonjua
100% (1)
Session Guide - Fluency - SOR - Anna Marlaine Litonjua
8 pages
EF4e-int-endtest-B-AK
No ratings yet
EF4e-int-endtest-B-AK
4 pages
Discourse Aspects of Interlanguage
No ratings yet
Discourse Aspects of Interlanguage
20 pages
MTB Reviewer
No ratings yet
MTB Reviewer
6 pages
May 13 - May 17
No ratings yet
May 13 - May 17
2 pages
Keynote SB3 Mid Term Test_answers
No ratings yet
Keynote SB3 Mid Term Test_answers
3 pages
EGRA Consolidation Form 1 Grade 2 2
No ratings yet
EGRA Consolidation Form 1 Grade 2 2
4 pages
Phil - Iri Pre Test
No ratings yet
Phil - Iri Pre Test
2 pages
Adult ESL Reading
No ratings yet
Adult ESL Reading
4 pages
Presentation READING
No ratings yet
Presentation READING
74 pages
Ela A Work in Progress Lesson Plans
No ratings yet
Ela A Work in Progress Lesson Plans
6 pages
1 NEW. Didactic Evolution of Languages
No ratings yet
1 NEW. Didactic Evolution of Languages
9 pages
TFG 2
No ratings yet
TFG 2
9 pages

GTM N1 Used

Uploaded by

GTM N1 Used

Uploaded by

Mining the Correlation between Human and Automatic Evaluation

ranges from -1 to 1 representing negative correlation to

differences between two automatic scores of two 40%

GTM scores for two translations are 0.64 and 0.53

You might also like