GTM N1 Used
GTM N1 Used
at Sentence Level
Yanli Sun
School of Applied Language and Intercultural Studies, Dublin City University
[email protected]
Abstract
Automatic evaluation metrics are fast and cost-effective measurements of the quality of a Machine Translation (MT) system. However,
as humans are the end-user of MT output, human judgement is the benchmark to assess the usefulness of automatic evaluation metrics.
While most studies report the correlation between human evaluation and automatic evaluation at corpus level, our study examines their
correlation at sentence level. In addition to the statistical correlation scores, such as Spearman's rank-order correlation coefficient, a
finer-grained and detailed examination of the sensitivity of automatic metrics compared to human evaluation is also reported in this
study. The results show that the threshold for human evaluators to agree with the judgements of automatic metrics varies with the
automatic metrics at sentence level. While the automatic scores for two translations are greatly different, human evaluators may
consider the translations to be qualitatively similar and vice versa. The detailed analysis of the correlation between automatic and
human evaluation allows us determine with increased confidence whether an increase in the automatic scores will be agreed by human
evaluators or not.
1726
Altogether 570 sentences were randomly selected as the One approach to computing the correlation is Spearman's
test sample. The Chinese reference of the test sample was ranking correlation coefficient (ρ). The process of getting
extracted from the company’s Translation Memory. Four Spearman’s ranking correlation is as follows: first, the
MT systems (one Rule-Based system and three Statistical- scores assigned by the automatic metrics should be
Based systems) were employed to translate the test sample converted into rankings as well; second, for each of the
into Chinese for comparison. Both human and automatic 570 groups, calculate the p value between each automatic
evaluations were applied in order to rank the quality of the metric and each human evaluator using the four items;
output from the four systems. Four professional third, average all the p values to get the mean p value
translators were employed to rank the outputs from 1 to 4 between each metric and each human. Table 1 below
(1 being the best, 4 being the worst) sentence by sentence. reports the correlation values using this method.
BLEU, TER and GTM (General Text Matcher, an L1 L2 L3 L4 Average
implementation of precision and recall) were used to get
the automatic scores of each translation at both corpus GTM 0.32 0.50 0.14 0.26 0.30
level and sentence level. The reasons for using these three TER 0.33 0.48 0.12 0.24 0.29
metrics are: first, they can be used (and have been used) to
evaluate Asian language outputs (in this paper, Chinese); BLEU 0.34 0.44 0.13 0.26 0.29
second, they are among the most widely used metrics in
the area; third, they are relatively easy and cost-effective
Table 1: Spearman’s Correlation between Automatic and
to use. There are also many other automatic metrics, such
Human Evaluation
as Meteor (Banerjee & Lavie, 2005), TERp (Snover et al.,
However, the validity of this approach was questioned by
2009), etc. However, additional conditions are needed to
Callison-Burch et al. (2008) who claimed that getting the
get the best advantage from these metrics. For example,
general correlation value by averaging the p values from a
Meteor functions better with a database of synonyms,
limited number of (here only four) items is not appropriate.
such as the WordNet for English; TERp requires
Instead, in their study, they conducted pair-wise
paraphrases which also function as “synonyms” of phrases.
comparison of any two outputs, examining whether the
Since these resources for Chinese were not available in
automatic scores were consistent with human rankings
our pilot project, these metrics were not employed in this
given any two outputs (that is the higher-ranked system
paper. The next section compares the scores from the
received a higher score). Following this approach, the 570
automatic metrics with the rankings from human
groups were expanded into 3420 pairs (each of the 570
evaluators to check how consistent the two evaluation
groups can be expanded into 6 pairs). For each automatic
methods are at sentence level with detailed analysis
metric, the total number of consistent evaluations was
followed in section four.
divided by the total number of comparisons to get a
percentage. Table 2 reports the consistency.
3. Correlation Check L1 L2 L3 L4 Average
The correlation between automatic evaluation and human
evaluation at sentence level was obtained following the GTM 0.61 0.68 0.71 0.66 0.66
practice of Callison-Burch et al. (2008). As mentioned
TER 0.58 0.64 0.70 0.64 0.64
earlier, we have 570 source English sentences to be
translated by four MT systems into Chinese. Therefore, BLEU 0.51 0.55 0.65 0.59 0.56
for each source English sentence, four translations can be
produced which are ranked by four professional
translators and scored by three automatic evaluation Table 2: Consistency of Automatic Evaluation with
metrics. In other words, there are 570 groups (with four Human Evaluation
items per group) each of which contains four columns of Table 2 indicates that these automatic metrics could
rankings from the four human evaluators and three correctly predict the human rankings of any pair of
columns of scores from the three automatic metrics. translations more than half the time. GTM correlates
Figure 1 below shows a sample of the final results sheet. better with human evaluation than BLEU and TER at
L1, L2, L3, L4 in Figure 1 refer to the four human sentence level in Chinese output evaluation. Similar
evaluators respectively. findings have been reported by Cahill (2009) in German
evaluation which compared 6 metrics including the three
metrics used in this paper. Besides, Agarwal and Lavie
(2008) also mentioned that GTM and TER could produce
more reliable sentence level scores than BLEU.
4. Further Analysis
As shown in Table 2, even for the best correlated metric
GTM, there is only 66% consistency, indicating a large
amount of discrepancy between humans and automatic
evaluation metrics in ranking the quality of different
translations. In order to further investigate the consistency
and inconsistency at sentence level, we conducted a
micro-analysis on the cases where humans and automatic
Figure 1: Sample of the Final Results Sheet metrics agree/disagree on the rankings of two translations.
Given two translations of a source sentence, each of
1727
which is associated with an automatic score, these two
scores can suggest a difference in terms of the quality of 100%
these two translations. However, humans may or may not
80%
agree with the difference registered by the automatic
metrics. Nevertheless, intuitively, the greater the 60%
0.5 - 0.6
0.4 - 0.5
0.3 - 0.4
0.2 - 0.3
0.1 - 0.2
0 - 0.1
0.5 - 0.6
0.4 - 0.5
0.3 - 0.4
0.2 - 0.3
0.1 - 0.2
0 - 0.1
0.5 - 0.6
0.4 - 0.5
0.3 - 0.4
0.2 - 0.3
0.1 - 0.2
0 - 0.1
0.5 - 0.6
0.4 - 0.5
0.3 - 0.4
0.2 - 0.3
0.1 - 0.2
0 - 0.1
translations of a source sentence, the differences between
the two corresponding automatic evaluation scores can be
divided into different groups of scales. For example, if the L1 L2 L3 L4
0.9-1.0 / / 18 20%
/ / 0%
0.8-0.9 7 0.6 - 0.7
0.5 -0.6
0.4 - 0.5
0.3 - 0.4
0.2 - 0.3
0.1 - 0.2
0.5 -0.6
0 - 0.1
0.6 - 0.7
0.4 - 0.5
0.3 - 0.4
0.2 - 0.3
0.1 - 0.2
0.5 -0.6
0 - 0.1
0.6 - 0.7
0.4 - 0.5
0.3 - 0.4
0.2 - 0.3
0.1 - 0.2
0 - 0.1
0.6 - 0.7
0.5 -0.6
0.4 - 0.5
0.3 - 0.4
0.2 - 0.3
0.1 - 0.2
0 - 0.1
0.7-0.8 / / 28
0.6-0.7 / 4 35 L1 L2 L3 L4
0.5-0.6 4 11 58
Humans Agree Humans Disagree Humans Assign Ties
0.4-0.5 12 52 137
0.3-0.4 73 127 201
0.2-0.3 232 278 261 Figure 3: Distribution of Human Evaluation within TER
0.1-0.2 627 659 364 Difference Scales
0.0-0.1 1484 1026 776 100%
80%
Table 2: Number of Pairs Distributed in each Difference
Scale of each Automatic Metric 60%
40%
Table 2 shows that the difference between the automatic
20%
scores of two different translations is mostly quite small.
For example, 61.02% of the pairs have a difference below 0%
0.8 - 0.9
0.7 - 0.8
0.6 - 0.7
0.5 - 0.6
0.4 - 0.5
0.3 - 0.4
0.2 - 0.3
0.1 - 0.2
0.8 - 0.9
0.7 - 0.8
0.6 - 0.7
0.5 - 0.6
0.4 - 0.5
0.3 - 0.4
0.2 - 0.3
0.1 - 0.2
0.9 -1
0 - 0.1
0.8 - 0.9
0.7 - 0.8
0.6 - 0.7
0.5 - 0.6
0.4 - 0.5
0.3 - 0.4
0.2 - 0.3
0.1 - 0.2
0.9 -1
0 - 0.1
0.9 -1
0 - 0.1
0.8 - 0.9
0.7 - 0.8
0.6 - 0.7
0.5 - 0.6
0.4 - 0.5
0.3 - 0.4
0.2 - 0.3
0.1 - 0.2
0.9 -1
0 - 0.1
0.1 in terms of GTM score, and this amounts to 47.57% in
terms of TER and 41.17% in terms of BLEU.
L1 L2 L3 L4
It is worth pointing out that the scales refer to the Humans Agree Humans Disagree Humans Assign Ties
difference between two scores for a pair of outputs, not
the scale of the scores. The purpose of setting up these
difference scales is to see whether the greater the
Figure 4: Distribution of Human Evaluation within BLEU
difference between two scores, the more likely that Difference Scales
humans agree with automatic metrics. For each of the
The height of the solid grey bars in Figures 2 to 4 show
three automatic evaluation metrics, we consider the that for GTM (Figure 2), it is true that the greater the
following three scenarios: 1) the number of pairs for difference between two automatic scores, the more cases
which human rankings are consistent with the scores that humans agree with the judgements of GTM; the
assigned to the translations by the automatic metric
smaller the difference, the more cases that humans
(“Humans Agree”); 2) the number of pairs for which disagree with the judgements of GTM. On the contrary,
human rankings are contrary to the scores assigned by the
even with very high TER or BLEU score differences,
automatic metric (“Humans Disagree”); 3) although the humans may still disagree with the judgement of TER
two translations in a pair are different and received two (Figure 3) or BLEU (Figure 4). In this experiment, when
different automatic scores, humans do not think they are the difference between two GTM scores is bigger than
qualitatively different and rank the pair as ties (“Humans
0.11, the majority of the human evaluators agree with the
Assign Ties”) (see Figures 2, 3 and 4). judgement of the GTM score about which translation is
better. The average difference between two TER scores
and BLEU scores has to be bigger than 0.18 and 0.29
1728
before the majority of the human evaluators agree with the automatic evaluation and human evaluation at sentence
judgement of these automatic metrics. level in terms of Chinese translation evaluation. Several
conclusions have been drawn from this study: first, for
Figures 2 to 4 also reflect that different evaluators have evaluation of Chinese translations of English technical
different criteria in judging the quality of different document, GTM correlates better with human evaluation
translations. As can be seen from the Figures, L3 assigned than TER and BLEU do at sentence level; second, only
many more ties in pair-wise comparison than other when the difference between two scores is greater than a
evaluators. The inter-evaluator correlation within the four certain value will the majority of human evaluators agree
human evaluators was measured using the Kappa with the judgement of the automatic metrics; third, when
coefficient (K), a measurement of the agreement between two automatic scores of two translations are the same, it
categorical data (Boslaugh & Watters, 2008). One widely does not always mean there is no qualitative difference
accepted interpretation of Kappa was proposed by Landis between the translations. There are also questions
and Koch (1977): 0-.2 is slight correlation, .2-.4 is fair remained unanswered: first, the statistical significance of
correlation, .4-.6 is moderate correlation, .6-.8 is the correlation and consistency is not examined; second,
substantial correlation and .8-1 is almost perfect we are aware that the correlation between human and
correlation. Using the Microsoft Kappa Calculator automatic evaluation may vary depending on the MT
template (King, 2004), the inter-evaluator agreement system involved; however no such distinction was made
score between the four human evaluators is (K=.273). in this study. Therefore, there is a lot of further work to be
Excluding human evaluator L3, the K value increases done in the future. In addition to these, we have shown
to .381. that for a considerable number of paired, human
judgements are inconsistent with automatic metrics. In the
Generally speaking, even if there are slight differences in future, we plan to conduct a further analysis into the
two translations, automatic metrics could generate causes for such discrepancies in an attempt to provide
different scores for them. However, there are also cases some linguistically motivated patterns that may benefit the
where the automatic scores are the same for two different design of the automatic metrics. Finally, although human
translations. In this experiment, we found that for some evaluation has been regarded as the golden standard in the
pairs of different translations for which the automatic process of MT evaluation, the results in this paper reflects
metrics assigned the same scores, humans didn’t consider some problems of human evaluation. How to standardize
them qualitatively different either. On the other hand, human evaluation is another question worthy of exploring
there are some other translations that were evaluated as in the future.
qualitatively different by humans but not by automatic
metrics. For each automatic metric, we summed the Acknowledgement
number of pairs that received the same scores by This work was financed by Enterprise Ireland and
automatic evaluation but different rankings by human Symantec Corporation (Ireland). The author would like to
evaluators. As there are four human evaluators, only those thank Dr. Fred Hollowood for his inspiring ideas and
pairs that were differentiated by the majority of human suggestions, Dr. Sharon O’Brien, Dr. Minako O’Hagan
evaluators (i.e. three or more evaluators assigned different and Dr. Johann Roturier for their precious corrections and
rankings to the translations in one pair) were taken into comments. Thanks also to the anonymous reviewers for
consideration. Table 3 contains the total number of pairs their insightful comments. However, the author is
where no differentiation was made by the automatic responsible for any errors in the paper.
metrics but where humans differentiated.
GTM TER BLEU Reference
#pairs 141 209 331 Agarwal, A. & Lavie, A. (2008). 'Meteor, M-BLEU and
M-TER: Evaluation Metrics for High-Correlation with
Table 3: No. Pairs of Translations Differentiated by Human Rankings of Machine Translation Output.' In
Humans but not by Automatic Metrics Proceedings of the Third Workshop on Statistical
GTM appears to have the smallest number of pairs that Machine Translation, Columbus, Ohio, June, pp. 115-
were not differentiated demonstrating a stronger 118.
differentiation ability at sentence level more in line with Banerjee, S. & Lavie, A. (2005). 'METEOR: An
the human evaluation while BLEU left a large number of Automatic Metric for MT Evaluation with Improved
pairs undifferentiated showing its weakness at sentence Correlation with Human Judgments'. In Proceedings
level evaluation in relation to the human evaluation. This of the ACL-2005 Workshop on Intrinsic and Extrinsic
finding shows that in some cases automatic evaluation Evaluation Measures for MT and/or Summarization,
cannot reflect the difference between two translations Ann Arbor, Michigan, pp. 65-72.
which are apparent according to the human assessments. Boslaugh, S. & Watters, P.A. (2008). 'Statistics in a
Hence, if two scores show no sign of difference, it does Nutshell.' O’Reilly Media, Inc., the United States of
not always indicate there is no qualitative difference America.
between two translations. Cahill, A. (2009). 'Correlating Human and Automatic
Evaluation of a German Surface Realiser'. In
5. Conclusion and Future Work Proceedings of the ACL-IJNLP 2009 Conference Short
It is well known that precise automatic evaluation metrics Papers, Suntec, Singapore, August, pp. 97-100.
at sentence level can help MT developers determine what Callison-Burch, C., Fordyce, C., Koehn, P., Monz, C., &
sentence structures their MT system can or can not deal Schroeder, J. (2008). 'Further Meta-evaluation of
with appropriately. This study examines the correlation of Machine Translation'. In Proceedings of the Third
1729
Workshop on Statistical Machine Translation,
Columbus, Ohio, June, pp. 70-106.
Coughlin, D. (2001). 'Correlating Automated and Human
Assessments of Machine Translation Quality'. In
Proceedings of MT Summit IX, Santiago de
Compostela, Spain, September, pp. 63-70.
Duh, K. (2008). 'Ranking vs. Regression in Machine
Translation Evaluation'. In Proceedings of the Third
Workshop on Statistical Machine Translation,
Columbus, Ohio, June, pp.191–194.
Koehn, P., Hoang, H., Birch, A., Callison-Burch, C.,
Federico, M., Bertoldi, N., Cowan, B., Shen, W.,
Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin,
A.& Herbst, E. (2007). "Moses: Open Source Toolkit
for Statistical Machine Translation". In Proceedings of
Annual Meeting of the Association for Computational
Linguistics (ACL), demonstration session, Prague,
June, pp.177-180.
King, J. E. (2004). 'Software Solutions for Obtaining a
Kappa-type Statistic for Use with Multiple Raters'.
Presented at the Annual Meeting of the Southwest
Educational Research Association, Dallas, TX.
Landis, J.R. & Koch, G.G. (1977). 'The Measurement of
Observer Agreement for Categorical Data.' Biometics,
33:159-174.
LDC (2005). Linguistic Data Annotation Specification:
Assessment of fluency and adequacy in translations.
http://projects.ldc.upenn.edu/TIDES/tidesmt.html.
Lin, C. & Och, F.J. (2004). 'ORANGE: A Method for
Evaluating Automatic Evaluation Metrics for Machine
Translation'. In Proceedings of the 20th International
Conference on Computational Linguistics, Geneva,
Switzerland, August, pp. 501-508.
Papineni, K., Roukos, S., Ward, T., & Zhu, W. (2001).
'BLEU: A Method for Automatic Evaluation of
Machine Translation'. Research Report RC22176
(W0109-022), IBM T.J.Watson Research Center,
September.
Russo-Lassner, G., Lin, J. & Resnik, P. (2005). 'A
Paraphrase-Based Approach to Machine Translation
Evaluation'. Technical report, University of Maryland,
College Park.
Snover, M., Dorr, B., Schwartz, R., Micciulla, L. &
Weischedel, R. (2006). 'A Study of Translation Edit
Rate with Targeted Human Annotation'. In
Proceedings of AMTA, Cambridge, MA, August,
pp.223-231.
Snover, M., Madnani, N., Dorr, B.J. & Schwartz, R.
(2009). 'Fluency, Adequacy, or HTER? Exploring
Different Human Judgments with a Tunable MT
Metric'. In Proceedings of the EACL-2009 Workshop
on Statistical Machine Translation (WMT09), Athens,
pp. 259-268.
Turian, J.P., Shen, L., & Melamed, I.D. (2003).
'Evaluation of Machine Translation and its Evaluation'.
In Proceedings of the MT Summit IX, New Orleans,
LA, September, pp. 386-393.
Vilar, D., Leusch, G., Ney, H., & Bachs, R. (2007).
'Human Evaluation of Machine Translation Through
Binary System Comparisons'. In Proceedings of the
Second Workshop on Statistical Machine Translation,
Prague, June, pp.96–103.
1730