0% found this document useful (0 votes)
16 views

GTM N1 Used

This document discusses analyzing the correlation between human and automatic evaluations of machine translation at the sentence level. It describes an experiment comparing Chinese translations from four machine translation systems evaluated by both human judges and three automatic metrics (BLEU, TER, GTM). The results show the Spearman correlation coefficients between each automatic metric and individual human judges, with average correlations ranging from 0.29 to 0.30. However, averaging correlations from a small number of human judges may not be valid. More detailed analysis examines cases where automatic and human evaluations disagree at the sentence level.

Uploaded by

h.ohida
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

GTM N1 Used

This document discusses analyzing the correlation between human and automatic evaluations of machine translation at the sentence level. It describes an experiment comparing Chinese translations from four machine translation systems evaluated by both human judges and three automatic metrics (BLEU, TER, GTM). The results show the Spearman correlation coefficients between each automatic metric and individual human judges, with average correlations ranging from 0.29 to 0.30. However, averaging correlations from a small number of human judges may not be valid. More detailed analysis examines cases where automatic and human evaluations disagree at the sentence level.

Uploaded by

h.ohida
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Mining the Correlation between Human and Automatic Evaluation

at Sentence Level
Yanli Sun
School of Applied Language and Intercultural Studies, Dublin City University
[email protected]

Abstract
Automatic evaluation metrics are fast and cost-effective measurements of the quality of a Machine Translation (MT) system. However,
as humans are the end-user of MT output, human judgement is the benchmark to assess the usefulness of automatic evaluation metrics.
While most studies report the correlation between human evaluation and automatic evaluation at corpus level, our study examines their
correlation at sentence level. In addition to the statistical correlation scores, such as Spearman's rank-order correlation coefficient, a
finer-grained and detailed examination of the sensitivity of automatic metrics compared to human evaluation is also reported in this
study. The results show that the threshold for human evaluators to agree with the judgements of automatic metrics varies with the
automatic metrics at sentence level. While the automatic scores for two translations are greatly different, human evaluators may
consider the translations to be qualitatively similar and vice versa. The detailed analysis of the correlation between automatic and
human evaluation allows us determine with increased confidence whether an increase in the automatic scores will be agreed by human
evaluators or not.

ranges from -1 to 1 representing negative correlation to


1. Introduction perfect positive correlation.
It is widely recognized that evaluation plays an important
role in the development of language technologies. In the As automatic metrics are more effective at corpus level,
area of Machine Translation (MT), there are two types of more effort has been taken on finding out which automatic
commonly used evaluation methods. While human metric correlates better with human evaluation at corpus
evaluation is still the most important means of providing level. Nevertheless, increasing attention is being paid to
valuable feedback on the further development of an MT correlation at sentence level. According to Lin and Och
system, its cost, labour-intensive and highly subjective (2004), high sentence level correlation of automatic and
characteristics have led to the popularity of automatic human evaluation is crucial for machine translation
evaluation metrics, such as BLEU (Bilingual Evaluation researchers. Russo-Lassner et al. (2005) also pointed out
Understudy) (Papineni et al. 2001), Precision and Recall that automatic metrics of high sentence level correlation
(Turian et al. 2003), TER (Translation Error Rate) could “provide a finer-grained assessment of translation
(Snover et al. 2006) etc. According to Coughlin (2001), quality” and could also “guide MT system development
automatic metrics have the advantages of high speed, by offering feedback on sentences that are particularly
convenience and comparatively lower-cost. However, as challenging”(p3).
humans are the end-users of MT, human judgement is
ultimately the benchmark to assess the usefulness of This paper extends the research on correlation at sentence
automatic metrics. How good an automatic metric is level, aiming at finding out which automatic metric
depends on its correlation with human evaluation. Two correlates better with human evaluation in terms of
major forms of human evaluation in the area of MT are: Chinese translation from English; and our second aim is to
scoring, which requires human evaluators to assign two investigate how big a difference between two automatic
scores (usually 1 to 5) representing the fluency and scores has to be in order to reflect the qualitative changes
accuracy of a translation (LDC, 2005); and ranking, which of the translations. The remainder of the paper is
asks human evaluators to compare the translations from organized as follows: Section two introduces the
different MT systems and assign rankings to them. The experiment setting; Section three reports the correlation
problem of scoring is that even with a clear guideline at level between automatic and human evaluation at sentence
hand, human evaluators still find it hard to assign level; Section four examines the detailed difference
appropriate scores to a translation. Ranking, on the other between the judgement of automatic evaluation and
hand, is found to be quite intuitive and reliable (Vilar et human evaluation; and Section five summarizes the
al., 2007). Callison-Burch et al. (2008) concluded from findings and points out future research questions.
their study that ranking was more reliable compared to
scoring. Duh (2008) also pointed out that ranking could 2. Experiment Setting
simplify the decision procedures for human evaluators The automatic evaluation and human evaluation results
compared to assigning scores. reported in this paper were collected from an experiment
comparing Chinese translations from different MT
Depending on the type of human evaluation used, the systems. However, the focus in this paper is to examine
correlation between automatic and human evaluation is the correlation between human evaluation and automatic
measured either by Pearson's correlation coefficient or evaluation and not to discuss the translation quality per se.
Spearman's correlation coefficient. The correlation value The corpus is an installation manual of an anti-virus
software composed in English from Symantec (Ireland).

1726
Altogether 570 sentences were randomly selected as the One approach to computing the correlation is Spearman's
test sample. The Chinese reference of the test sample was ranking correlation coefficient (ρ). The process of getting
extracted from the company’s Translation Memory. Four Spearman’s ranking correlation is as follows: first, the
MT systems (one Rule-Based system and three Statistical- scores assigned by the automatic metrics should be
Based systems) were employed to translate the test sample converted into rankings as well; second, for each of the
into Chinese for comparison. Both human and automatic 570 groups, calculate the p value between each automatic
evaluations were applied in order to rank the quality of the metric and each human evaluator using the four items;
output from the four systems. Four professional third, average all the p values to get the mean p value
translators were employed to rank the outputs from 1 to 4 between each metric and each human. Table 1 below
(1 being the best, 4 being the worst) sentence by sentence. reports the correlation values using this method.
BLEU, TER and GTM (General Text Matcher, an L1 L2 L3 L4 Average
implementation of precision and recall) were used to get
the automatic scores of each translation at both corpus GTM 0.32 0.50 0.14 0.26 0.30
level and sentence level. The reasons for using these three TER 0.33 0.48 0.12 0.24 0.29
metrics are: first, they can be used (and have been used) to
evaluate Asian language outputs (in this paper, Chinese); BLEU 0.34 0.44 0.13 0.26 0.29
second, they are among the most widely used metrics in
the area; third, they are relatively easy and cost-effective
Table 1: Spearman’s Correlation between Automatic and
to use. There are also many other automatic metrics, such
Human Evaluation
as Meteor (Banerjee & Lavie, 2005), TERp (Snover et al.,
However, the validity of this approach was questioned by
2009), etc. However, additional conditions are needed to
Callison-Burch et al. (2008) who claimed that getting the
get the best advantage from these metrics. For example,
general correlation value by averaging the p values from a
Meteor functions better with a database of synonyms,
limited number of (here only four) items is not appropriate.
such as the WordNet for English; TERp requires
Instead, in their study, they conducted pair-wise
paraphrases which also function as “synonyms” of phrases.
comparison of any two outputs, examining whether the
Since these resources for Chinese were not available in
automatic scores were consistent with human rankings
our pilot project, these metrics were not employed in this
given any two outputs (that is the higher-ranked system
paper. The next section compares the scores from the
received a higher score). Following this approach, the 570
automatic metrics with the rankings from human
groups were expanded into 3420 pairs (each of the 570
evaluators to check how consistent the two evaluation
groups can be expanded into 6 pairs). For each automatic
methods are at sentence level with detailed analysis
metric, the total number of consistent evaluations was
followed in section four.
divided by the total number of comparisons to get a
percentage. Table 2 reports the consistency.
3. Correlation Check L1 L2 L3 L4 Average
The correlation between automatic evaluation and human
evaluation at sentence level was obtained following the GTM 0.61 0.68 0.71 0.66 0.66
practice of Callison-Burch et al. (2008). As mentioned
TER 0.58 0.64 0.70 0.64 0.64
earlier, we have 570 source English sentences to be
translated by four MT systems into Chinese. Therefore, BLEU 0.51 0.55 0.65 0.59 0.56
for each source English sentence, four translations can be
produced which are ranked by four professional
translators and scored by three automatic evaluation Table 2: Consistency of Automatic Evaluation with
metrics. In other words, there are 570 groups (with four Human Evaluation
items per group) each of which contains four columns of Table 2 indicates that these automatic metrics could
rankings from the four human evaluators and three correctly predict the human rankings of any pair of
columns of scores from the three automatic metrics. translations more than half the time. GTM correlates
Figure 1 below shows a sample of the final results sheet. better with human evaluation than BLEU and TER at
L1, L2, L3, L4 in Figure 1 refer to the four human sentence level in Chinese output evaluation. Similar
evaluators respectively. findings have been reported by Cahill (2009) in German
evaluation which compared 6 metrics including the three
metrics used in this paper. Besides, Agarwal and Lavie
(2008) also mentioned that GTM and TER could produce
more reliable sentence level scores than BLEU.

4. Further Analysis
As shown in Table 2, even for the best correlated metric
GTM, there is only 66% consistency, indicating a large
amount of discrepancy between humans and automatic
evaluation metrics in ranking the quality of different
translations. In order to further investigate the consistency
and inconsistency at sentence level, we conducted a
micro-analysis on the cases where humans and automatic
Figure 1: Sample of the Final Results Sheet metrics agree/disagree on the rankings of two translations.
Given two translations of a source sentence, each of

1727
which is associated with an automatic score, these two
scores can suggest a difference in terms of the quality of 100%
these two translations. However, humans may or may not
80%
agree with the difference registered by the automatic
metrics. Nevertheless, intuitively, the greater the 60%

differences between two automatic scores of two 40%


translations, the more likely that these scores predict the 20%
judgements of humans about the quality of the two 0%
translations. Based on such consideration, for any pairs of

0.5 - 0.6
0.4 - 0.5
0.3 - 0.4
0.2 - 0.3
0.1 - 0.2
0 - 0.1
0.5 - 0.6
0.4 - 0.5
0.3 - 0.4
0.2 - 0.3
0.1 - 0.2
0 - 0.1
0.5 - 0.6
0.4 - 0.5
0.3 - 0.4
0.2 - 0.3
0.1 - 0.2
0 - 0.1
0.5 - 0.6
0.4 - 0.5
0.3 - 0.4
0.2 - 0.3
0.1 - 0.2
0 - 0.1
translations of a source sentence, the differences between
the two corresponding automatic evaluation scores can be
divided into different groups of scales. For example, if the L1 L2 L3 L4

GTM scores for two translations are 0.64 and 0.53


Humans Agree Humans Disagree Humans Assign Ties
respectively, the difference between these GTM scores
(0.11) falls into the difference scale (0.1-0.2). As
mentioned in section 3, altogether there are 3420 pairs for
comparison. For each automatic metric, the difference of Figure 2: Distribution of Human Evaluation within GTM
scores within each pair were collected and categorized Difference Scales
into different scales. Table 2 reports the number of pairs
distributed in the difference scales of each automatic 100%
metric. 80%
Difference GTM TER BLEU 60%
Scale #pairs #pairs #pairs 40%

0.9-1.0 / / 18 20%

/ / 0%
0.8-0.9 7 0.6 - 0.7
0.5 -0.6
0.4 - 0.5
0.3 - 0.4
0.2 - 0.3
0.1 - 0.2

0.5 -0.6
0 - 0.1
0.6 - 0.7

0.4 - 0.5
0.3 - 0.4
0.2 - 0.3
0.1 - 0.2

0.5 -0.6
0 - 0.1
0.6 - 0.7

0.4 - 0.5
0.3 - 0.4
0.2 - 0.3
0.1 - 0.2
0 - 0.1
0.6 - 0.7
0.5 -0.6
0.4 - 0.5
0.3 - 0.4
0.2 - 0.3
0.1 - 0.2
0 - 0.1
0.7-0.8 / / 28
0.6-0.7 / 4 35 L1 L2 L3 L4

0.5-0.6 4 11 58
Humans Agree Humans Disagree Humans Assign Ties
0.4-0.5 12 52 137
0.3-0.4 73 127 201
0.2-0.3 232 278 261 Figure 3: Distribution of Human Evaluation within TER
0.1-0.2 627 659 364 Difference Scales
0.0-0.1 1484 1026 776 100%

80%
Table 2: Number of Pairs Distributed in each Difference
Scale of each Automatic Metric 60%

40%
Table 2 shows that the difference between the automatic
20%
scores of two different translations is mostly quite small.
For example, 61.02% of the pairs have a difference below 0%
0.8 - 0.9
0.7 - 0.8
0.6 - 0.7
0.5 - 0.6
0.4 - 0.5
0.3 - 0.4
0.2 - 0.3
0.1 - 0.2

0.8 - 0.9
0.7 - 0.8
0.6 - 0.7
0.5 - 0.6
0.4 - 0.5
0.3 - 0.4
0.2 - 0.3
0.1 - 0.2
0.9 -1

0 - 0.1

0.8 - 0.9
0.7 - 0.8
0.6 - 0.7
0.5 - 0.6
0.4 - 0.5
0.3 - 0.4
0.2 - 0.3
0.1 - 0.2
0.9 -1

0 - 0.1
0.9 -1

0 - 0.1

0.8 - 0.9
0.7 - 0.8
0.6 - 0.7
0.5 - 0.6
0.4 - 0.5
0.3 - 0.4
0.2 - 0.3
0.1 - 0.2
0.9 -1

0 - 0.1
0.1 in terms of GTM score, and this amounts to 47.57% in
terms of TER and 41.17% in terms of BLEU.
L1 L2 L3 L4

It is worth pointing out that the scales refer to the Humans Agree Humans Disagree Humans Assign Ties
difference between two scores for a pair of outputs, not
the scale of the scores. The purpose of setting up these
difference scales is to see whether the greater the
Figure 4: Distribution of Human Evaluation within BLEU
difference between two scores, the more likely that Difference Scales
humans agree with automatic metrics. For each of the
The height of the solid grey bars in Figures 2 to 4 show
three automatic evaluation metrics, we consider the that for GTM (Figure 2), it is true that the greater the
following three scenarios: 1) the number of pairs for difference between two automatic scores, the more cases
which human rankings are consistent with the scores that humans agree with the judgements of GTM; the
assigned to the translations by the automatic metric
smaller the difference, the more cases that humans
(“Humans Agree”); 2) the number of pairs for which disagree with the judgements of GTM. On the contrary,
human rankings are contrary to the scores assigned by the
even with very high TER or BLEU score differences,
automatic metric (“Humans Disagree”); 3) although the humans may still disagree with the judgement of TER
two translations in a pair are different and received two (Figure 3) or BLEU (Figure 4). In this experiment, when
different automatic scores, humans do not think they are the difference between two GTM scores is bigger than
qualitatively different and rank the pair as ties (“Humans
0.11, the majority of the human evaluators agree with the
Assign Ties”) (see Figures 2, 3 and 4). judgement of the GTM score about which translation is
better. The average difference between two TER scores
and BLEU scores has to be bigger than 0.18 and 0.29

1728
before the majority of the human evaluators agree with the automatic evaluation and human evaluation at sentence
judgement of these automatic metrics. level in terms of Chinese translation evaluation. Several
conclusions have been drawn from this study: first, for
Figures 2 to 4 also reflect that different evaluators have evaluation of Chinese translations of English technical
different criteria in judging the quality of different document, GTM correlates better with human evaluation
translations. As can be seen from the Figures, L3 assigned than TER and BLEU do at sentence level; second, only
many more ties in pair-wise comparison than other when the difference between two scores is greater than a
evaluators. The inter-evaluator correlation within the four certain value will the majority of human evaluators agree
human evaluators was measured using the Kappa with the judgement of the automatic metrics; third, when
coefficient (K), a measurement of the agreement between two automatic scores of two translations are the same, it
categorical data (Boslaugh & Watters, 2008). One widely does not always mean there is no qualitative difference
accepted interpretation of Kappa was proposed by Landis between the translations. There are also questions
and Koch (1977): 0-.2 is slight correlation, .2-.4 is fair remained unanswered: first, the statistical significance of
correlation, .4-.6 is moderate correlation, .6-.8 is the correlation and consistency is not examined; second,
substantial correlation and .8-1 is almost perfect we are aware that the correlation between human and
correlation. Using the Microsoft Kappa Calculator automatic evaluation may vary depending on the MT
template (King, 2004), the inter-evaluator agreement system involved; however no such distinction was made
score between the four human evaluators is (K=.273). in this study. Therefore, there is a lot of further work to be
Excluding human evaluator L3, the K value increases done in the future. In addition to these, we have shown
to .381. that for a considerable number of paired, human
judgements are inconsistent with automatic metrics. In the
Generally speaking, even if there are slight differences in future, we plan to conduct a further analysis into the
two translations, automatic metrics could generate causes for such discrepancies in an attempt to provide
different scores for them. However, there are also cases some linguistically motivated patterns that may benefit the
where the automatic scores are the same for two different design of the automatic metrics. Finally, although human
translations. In this experiment, we found that for some evaluation has been regarded as the golden standard in the
pairs of different translations for which the automatic process of MT evaluation, the results in this paper reflects
metrics assigned the same scores, humans didn’t consider some problems of human evaluation. How to standardize
them qualitatively different either. On the other hand, human evaluation is another question worthy of exploring
there are some other translations that were evaluated as in the future.
qualitatively different by humans but not by automatic
metrics. For each automatic metric, we summed the Acknowledgement
number of pairs that received the same scores by This work was financed by Enterprise Ireland and
automatic evaluation but different rankings by human Symantec Corporation (Ireland). The author would like to
evaluators. As there are four human evaluators, only those thank Dr. Fred Hollowood for his inspiring ideas and
pairs that were differentiated by the majority of human suggestions, Dr. Sharon O’Brien, Dr. Minako O’Hagan
evaluators (i.e. three or more evaluators assigned different and Dr. Johann Roturier for their precious corrections and
rankings to the translations in one pair) were taken into comments. Thanks also to the anonymous reviewers for
consideration. Table 3 contains the total number of pairs their insightful comments. However, the author is
where no differentiation was made by the automatic responsible for any errors in the paper.
metrics but where humans differentiated.
GTM TER BLEU Reference
#pairs 141 209 331 Agarwal, A. & Lavie, A. (2008). 'Meteor, M-BLEU and
M-TER: Evaluation Metrics for High-Correlation with
Table 3: No. Pairs of Translations Differentiated by Human Rankings of Machine Translation Output.' In
Humans but not by Automatic Metrics Proceedings of the Third Workshop on Statistical
GTM appears to have the smallest number of pairs that Machine Translation, Columbus, Ohio, June, pp. 115-
were not differentiated demonstrating a stronger 118.
differentiation ability at sentence level more in line with Banerjee, S. & Lavie, A. (2005). 'METEOR: An
the human evaluation while BLEU left a large number of Automatic Metric for MT Evaluation with Improved
pairs undifferentiated showing its weakness at sentence Correlation with Human Judgments'. In Proceedings
level evaluation in relation to the human evaluation. This of the ACL-2005 Workshop on Intrinsic and Extrinsic
finding shows that in some cases automatic evaluation Evaluation Measures for MT and/or Summarization,
cannot reflect the difference between two translations Ann Arbor, Michigan, pp. 65-72.
which are apparent according to the human assessments. Boslaugh, S. & Watters, P.A. (2008). 'Statistics in a
Hence, if two scores show no sign of difference, it does Nutshell.' O’Reilly Media, Inc., the United States of
not always indicate there is no qualitative difference America.
between two translations. Cahill, A. (2009). 'Correlating Human and Automatic
Evaluation of a German Surface Realiser'. In
5. Conclusion and Future Work Proceedings of the ACL-IJNLP 2009 Conference Short
It is well known that precise automatic evaluation metrics Papers, Suntec, Singapore, August, pp. 97-100.
at sentence level can help MT developers determine what Callison-Burch, C., Fordyce, C., Koehn, P., Monz, C., &
sentence structures their MT system can or can not deal Schroeder, J. (2008). 'Further Meta-evaluation of
with appropriately. This study examines the correlation of Machine Translation'. In Proceedings of the Third

1729
Workshop on Statistical Machine Translation,
Columbus, Ohio, June, pp. 70-106.
Coughlin, D. (2001). 'Correlating Automated and Human
Assessments of Machine Translation Quality'. In
Proceedings of MT Summit IX, Santiago de
Compostela, Spain, September, pp. 63-70.
Duh, K. (2008). 'Ranking vs. Regression in Machine
Translation Evaluation'. In Proceedings of the Third
Workshop on Statistical Machine Translation,
Columbus, Ohio, June, pp.191–194.
Koehn, P., Hoang, H., Birch, A., Callison-Burch, C.,
Federico, M., Bertoldi, N., Cowan, B., Shen, W.,
Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin,
A.& Herbst, E. (2007). "Moses: Open Source Toolkit
for Statistical Machine Translation". In Proceedings of
Annual Meeting of the Association for Computational
Linguistics (ACL), demonstration session, Prague,
June, pp.177-180.
King, J. E. (2004). 'Software Solutions for Obtaining a
Kappa-type Statistic for Use with Multiple Raters'.
Presented at the Annual Meeting of the Southwest
Educational Research Association, Dallas, TX.
Landis, J.R. & Koch, G.G. (1977). 'The Measurement of
Observer Agreement for Categorical Data.' Biometics,
33:159-174.
LDC (2005). Linguistic Data Annotation Specification:
Assessment of fluency and adequacy in translations.
http://projects.ldc.upenn.edu/TIDES/tidesmt.html.
Lin, C. & Och, F.J. (2004). 'ORANGE: A Method for
Evaluating Automatic Evaluation Metrics for Machine
Translation'. In Proceedings of the 20th International
Conference on Computational Linguistics, Geneva,
Switzerland, August, pp. 501-508.
Papineni, K., Roukos, S., Ward, T., & Zhu, W. (2001).
'BLEU: A Method for Automatic Evaluation of
Machine Translation'. Research Report RC22176
(W0109-022), IBM T.J.Watson Research Center,
September.
Russo-Lassner, G., Lin, J. & Resnik, P. (2005). 'A
Paraphrase-Based Approach to Machine Translation
Evaluation'. Technical report, University of Maryland,
College Park.
Snover, M., Dorr, B., Schwartz, R., Micciulla, L. &
Weischedel, R. (2006). 'A Study of Translation Edit
Rate with Targeted Human Annotation'. In
Proceedings of AMTA, Cambridge, MA, August,
pp.223-231.
Snover, M., Madnani, N., Dorr, B.J. & Schwartz, R.
(2009). 'Fluency, Adequacy, or HTER? Exploring
Different Human Judgments with a Tunable MT
Metric'. In Proceedings of the EACL-2009 Workshop
on Statistical Machine Translation (WMT09), Athens,
pp. 259-268.
Turian, J.P., Shen, L., & Melamed, I.D. (2003).
'Evaluation of Machine Translation and its Evaluation'.
In Proceedings of the MT Summit IX, New Orleans,
LA, September, pp. 386-393.
Vilar, D., Leusch, G., Ney, H., & Bachs, R. (2007).
'Human Evaluation of Machine Translation Through
Binary System Comparisons'. In Proceedings of the
Second Workshop on Statistical Machine Translation,
Prague, June, pp.96–103.

1730

You might also like