0% found this document useful (0 votes)
5 views11 pages

P18-1060

This paper addresses the limitations of automatic metrics like BLEU and ROUGE in evaluating natural language generation systems, which often correlate poorly with human judgment. The authors propose a control variates estimator that combines automatic metrics with human evaluation to achieve a more accurate and cost-effective assessment. Their findings indicate that while the proposed method offers some cost reduction, improvements in both the automatic metrics and the evaluation prompts are necessary for greater efficiency.

Uploaded by

vinod
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views11 pages

P18-1060

This paper addresses the limitations of automatic metrics like BLEU and ROUGE in evaluating natural language generation systems, which often correlate poorly with human judgment. The authors propose a control variates estimator that combines automatic metrics with human evaluation to achieve a more accurate and cost-effective assessment. Their findings indicate that while the proposed method offers some cost reduction, improvements in both the automatic metrics and the evaluation prompts are necessary for greater efficiency.

Uploaded by

vinod
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

The price of debiasing automatic metrics in natural language evaluation

Arun Tejasvi Chaganty∗ and Stephen Mussmann∗ and Percy Liang


Computer Science Department, Stanford University
{chaganty,mussmann,pliang}@cs.stanford.edu

Abstract ing automatic metrics such as BLEU (Papineni


et al., 2002), ROUGE (Lin and Rey, 2004), ME-
For evaluating generation systems, auto- TEOR (Lavie and Denkowski, 2009; Denkowski
matic metrics such as BLEU cost noth- and Lavie, 2014) and CiDER (Vedantam et al.,
ing to run but have been shown to corre- 2015). However, these have shown to be biased,
late poorly with human judgment, leading correlating poorly with human metrics across dif-
to systematic bias against certain model ferent datasets and systems (Liu et al., 2016b;
improvements. On the other hand, av- Novikova et al., 2017).
eraging human judgments, the unbiased
Can we combine automatic metrics and human
gold standard, is often too expensive. In
evaluation to obtain an unbiased estimate at lower
this paper, we use control variates to com-
cost than human evaluation alone? In this paper,
bine automatic metrics with human evalu-
we propose a simple estimator based on control
ation to obtain an unbiased estimator with
variates (Ripley, 2009), where we average differ-
lower cost than human evaluation alone.
ences between human judgments and automatic
In practice, however, we obtain only a 7–
metrics rather than averaging the human judg-
13% cost reduction on evaluating summa-
ments alone. Provided the two are correlated, our
rization and open-response question an-
estimator will have lower variance and thus reduce
swering systems. We then prove that our
cost.
estimator is optimal: there is no unbi-
We prove that our estimator is optimal in the
ased estimator with lower cost. Our the-
sense that no unbiased estimator using the same
ory further highlights the two fundamen-
automatic metric can have lower variance. We
tal bottlenecks—the automatic metric and
also analyze its data efficiency (equivalently, cost
the prompt shown to human evaluators—
savings)—the factor reduction in number of hu-
both of which need to be improved to ob-
man judgments needed to obtain the same accu-
tain greater cost savings.
racy versus naive human evaluation—and show
1 Introduction that it depends solely on two factors: (a) the an-
notator variance (which is a function of the hu-
In recent years, there has been an increasing in- man evaluation prompt) and (b) the correlation be-
terest in tasks that require generating natural lan- tween human judgments and the automatic met-
guage, including abstractive summarization (Nal- ric. This factorization allows us to calculate typi-
lapati et al., 2016), open-response question an- cal and best-case data efficiencies and accordingly
swering (Nguyen et al., 2016; Kočisky et al., refine the evaluation prompt or automatic metric.
2017), image captioning (Lin et al., 2014), and Finally, we evaluate our estimator on state-
open-domain dialogue (Lowe et al., 2017b). Un- of-the-art systems from two tasks, summariza-
fortunately, the evaluation of these systems re- tion on the CNN/Daily Mail dataset (Hermann
mains a thorny issue because of the diversity of et al., 2015; Nallapati et al., 2016) and open-
possible correct responses. As the gold standard response question answering on the MS MAR-
of performing human evaluation is often too ex- COv1.0 dataset (Nguyen et al., 2016). To study
pensive, there has been a large effort develop- our estimators offline, we preemptively collected

Authors contributed equally. 10,000 human judgments which cover several

643
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Long Papers), pages 643–653
Melbourne, Australia, July 15 - 20, 2018. c 2018 Association for Computational Linguistics
1.0
fastqa
fastqa ext 0.8
0.4
snet
ROUGE-L

ROUGE-L
snet.ens 0.6
0.3
0.4

0.2
0.2
0.0
0.5 0.6 0.7 0.8 0.0 0.5 1.0
Human judgement Human judgment

(a) System-level correlation on the MS MARCO task (b) Instance-level correlation for the fastqa system

Figure 1: (a) At a system-level, automatic metrics (ROUGE-L) and human judgment correlate well, but
(b) the instance-level correlation plot (where each point is a system prediction) shows that the instance-
level correlation is quite low (ρ = 0.31). As a consequence, if we try to locally improve systems to
produce better answers (. in (a)), they do not significantly improve ROUGE scores and vice versa (M).

tasks and systems.1 As predicted by the theory, have been found to have high system-level correla-
we find that the data efficiency depends not only tions (Novikova et al., 2017). What, then, are the
on the correlation between the human and auto- implications of having a low instance-level corre-
matic metrics, but also on the evaluation prompt. lation?
If the automatic metric had perfect correlation, our As a case study, consider the task of open-
data efficiency would be around 3, while if we had response question answering: here, a system re-
noiseless human judgments, our data efficiency ceives a human-generated question and must gen-
would be about 1.5. In reality, the reduction in erate an answer from some given context, e.g. a
cost we obtained was only about 10%, suggesting document or several webpages. We collected the
that improvements in both automatic metric and responses of several systems on the MS MAR-
evaluation prompt are needed. As one case study COv1 dataset (Nguyen et al., 2016) and crowd-
in improving the latter, we show that, when com- sourced human evaluations of the system output
pared to a Likert survey, measuring the amount of (see Section 4 for details).
post-editing needed to fix a generated sentence re- The instance-level correlation (Figure 1b) is
duced the annotator variance by three-fold. only ρ = 0.31. A closer look at the instance-level
correlation reveals that while ROUGE is able to
2 Bias in automatic evaluation correctly assign low scores to bad examples (lower
It is well understood that current automatic met- left), it is bad at judging good examples and often
rics tend to correlate poorly with human judg- assigns them low ROUGE scores (lower right)—
ment at the instance-level. For example, Novikova see Table 1 for examples. This observation agrees
et al. (2017) report correlations less than 0.3 for with a finding reported in Novikova et al. (2017)
a large suite of word-based and grammar-based that automatic metrics correlate better with human
evaluation methods on a generation task. Sim- judgments on bad examples than average or good
ilarly, Liu et al. (2016b) find correlations less examples.
than 0.35 for automatic metrics on a dialog gen- Thus, as Figure 1(a) shows, we can improve
eration task in one domain, but find correlations low-scoring ROUGE examples without improving
with the same metric dropped significantly to less their human judgment (M) and vice versa (.). In-
than 0.16 when used in another domain. Still, deed, Conroy and Dang (2008) report that sum-
somewhat surprisingly, several automatic metrics marization systems were optimized for ROUGE
1
during the DUC challenge (Dang, 2006) until they
An anonymized version of this data and the annota-
tion interfaces used can be found at https://bit.ly/ were indistinguishable from the ROUGE scores
price-of-debiasing. of human-generated summaries, but the systems

644
Question and reference answer System answer (System; Corr / ROUGE-L)
Examples where system is correct and ROUGE-L > 0.5 (19.6% or 285 of 1455 unique responses)
Q. what is anti-mullerian hormone it is a protein hormone produced by granulosa cells
A. Anti-Mullerian Hormone (AMH) is a protein hormone (cells lining the egg sacs or follicles) within the ovary.
produced by granulosa cells (cells lining the egg sacs or fol- (snet.ens; X / 0.86)
licles) within the ovary.

Examples where system is incorrect and ROUGE-L > 0.5 (1.3% or 19 of 1455 unique responses)
Q. at what gestational age can you feel a fetus move 37 to 41 weeks (fastqa, fastqa.ext; × / 1.0)
A. 37 to 41 weeks (incorrect reference answer)

Examples where system is correct and ROUGE-L < 0.5 (56.0% or 815 of 1455 unique responses)
Q. what is the definition of onomatopoeia the naming of a thing or action by a vocal imitation of the
A. It is defined as a word, which imitates the natural sounds sound associated with it (as buzz, hiss). (fastqa; X / 0.23)
of a thing.

Examples where system is incorrect and ROUGE-L < 0.5 (23.1% or 336 of 1455 unique responses)
Q. what kind root stem does a dandelion have vitamin a, vitamin c, vitamin d and vitamin b complex, as
A. Fibrous roots and hollow stem. well as zinc, iron and potassium. (snet, snet.ens; × /
0.09)

(a) MS MARCO. Human annotators rated answer correctness (AnyCorrect) and the automatic metric used is ROUGE-L
(higher is better).

Reference summary System summary (System; Edit / VecSim)


Examples where system Edit < 0.3 and VecSim > 0.5 (53.9% or 1078 of 2000 responses)
Bhullar is set to sign a -day contract with the Kings. Bhullar andThe Kings are signing Bhullar to a -day contract.
The -year-old will become the NBA’s first player of The -year-old will be on the roster on friday when David Wear’s
Indian descent. Bhullar will be on the roster when the -season contract expires thursday. Bhullar is set to become the
Kings host New Orleans Pelicans. NBA’s first player of Indian descent. (ml; 0.13 / 0.82)

Examples where system Edit > 0.3 and VecSim > 0.5 (18.0% or 360 of 2000 responses)
The Direct Marketing Commission probing B2C  Data obtained by the Mail’s marketing commission said it
Data and Data Bubble. Investigating whether they would probe both companies over claims that they had breached
breached rules on the sale of private data. Chief com- the rules on the sale of private data. The FSA said it would probe
missioner described allegations made about firms as both companies over claims they had breached the rules on the
‘serious’. sale of private data. (se2seq; 1.00 / 0.72)

Examples where system Edit < 0.3 and VecSim < 0.5 (14.5% or 290 of 2000 responses)
Death toll rises to more than . Pemba Tamang, , Six of Despite Nepal’s tragedy, life triumphed in Kathmandu’s
shows no apparent signs of serious injury after rescue. hard-hit neighborhoods. Rescuers pulled an 15-year-old from the
Americans special forces helicopter , including  rubble of a multistory residential building. He was wearing a New
Americans, to safety. York shirt and a blue neck brace. (pointer; 0.04 / 0.27)

Examples where system Edit > 0.3 and VecSim < 0.5 (13.6% or 272 of 2000 responses)
“Mad Men’s” final seven episodes begin airing April ‘This’s “Mad Men” is the end of a series of an era’, This he says.
. The show has never had high ratings but is con- Stores have created fashion lines inspired by the show.“The So-
sidered one of the great TV series. It’s unknown what pranos”. The in  the Kent State shootings in may  or Richard
will happen to characters, but we can always guess. Nixonś  re-election.. (ml+rl; 0.95 / 0.24)

(b) CNN/Daily Mail. Human judgment scores used are post-edit distance (Edit) (lower is better) and the automatic metric
used is sentence vector similarity with the reference (higher is better).

Table 1: Examples highlighting the different modes in which the automatic metric and human judgments
may agree or disagree. On the MS MARCO task, a majority of responses from systems were actually
correct but poorly scored according to ROUGE-L. On the CNN/Daily Mail task, a significant number of
examples which are scored highly by VecSim are poorly rated by humans, and likewise many examples
scored poorly by VecSim are highly rated by humans.

645
def
had hardly improved on human evaluation. Hill- We can define σf2 = Var(f (z)) as the variance
climbing on ROUGE can also lead to a system def
of the human metric and σa2 = Ez [Var(Y (z))] as
that does worse on human scores, e.g. in machine
the variance of human judgment averaged over Z.
translation (Wu et al., 2016). Conversely, genuine
By the law of total variance, the variance of our
quality improvements might not be reflected in im-
estimator is
provements in ROUGE. This bias also appears in
pool-based evaluation for knowledge base popula- 1
Var(µ̂mean ) = (σf2 + σa2 ). (2)
tion (Chaganty et al., 2017). Thus the problems n
with automatic metrics clearly motivate the need 3.2 Control variates estimator
for human evaluation, but can we still use the au- Now let us see how an automatic metric g can re-
tomatic metrics somehow to save costs? duce variance. If there is no annotator variance
(σa2 = 0) so that Y (z) = f (z), we should ex-
3 Statistical estimation for unbiased
pect the variance of f (z) − g(z) to be lower than
evaluation the variance of f (z), assuming g is correlated with
We will now formalize the problem of combining f —see Figure 2 for an illustration.
human evaluation with an automatic metric. Let The actual control variates estimator needs to
X be a set of inputs (e.g., articles), and let S be handle noisy Y (z) (i.e. σa2 > 0) and guard against
the system (e.g. for summarization), which takes a g(z) with low correlation. Let us standardize g
x ∈ X and returns output S(x) (e.g. a summary). to have zero mean and unit variance, because we
Let Z = {(x, S(x)) : x ∈ X } be the set of system have assumed it is free to evaluate. As before, let
predictions. Let Y (z) be the random variable rep- z (1) , . . . , z (n) be independent samples from Z and
resenting the human judgment according to some draw y (i) = Y (z (i) ) independently as well. We
evaluation prompt (e.g. grammaticality or correct- define the control variates estimator as
ness), and define f (z) = E[Y (z)] to be the (un- n
1 X (i)
known) human metric corresponding to averaging µ̂cv = y − αg(z (i) ), (3)
n
over an infinite number of human judgments. Our i=1

goal is to estimate the average across all examples: where


def
def 1 X α = Cov(f (z), g(z)). (4)
µ = Ez [f (z)] = f (z) (1)
|Z|
z∈Z Intuitively, we have averaged over y (i) to handle
the noise introduced by Y (z), and scaled g(z) to
with as few queries to Y as possible.
prevent an uncorrelated automatic metric from in-
Let g be an automatic metric (e.g. ROUGE),
troducing too much noise.
which maps z to a real number. We assume eval-
An important quantity governing the quality of
uating g(z) is free. The central question is how
an automatic metric g is the correlation between
to use g in conjunction with calls to Y to produce
f (z) and g(z) (recall that g has unit variance):
an unbiased estimate µ̂ (that is, E[µ̂] = µ). In this
section, we will construct a simple estimator based def α
ρ = . (5)
on control variates (Ripley, 2009), and prove that σf
it is minimax optimal. We can show that among all distributions with
fixed σf2 , σa2 , and α (equivalently ρ), this estimator
3.1 Sample mean
is minimax optimal, i.e. it has the least variance
We warm up with the most basic unbiased esti- among all unbiased estimators:
mate, the sample mean. We sample z (1) , . . . , z (n)
Theorem 3.1. Among all unbiased estimators that
independently with replacement from Z. Then,
are functions of y (i) and g(z (i) ), and for all distri-
we sample each human judgment y (i) = Y (z (i) )
2 butions with a given σf2 , σa2 , and α,
independently. Define the estimator to be
1 Pn
µ̂mean = n i=1 y (i) . Note that µ̂mean is unbiased 1 2
Var(µ̂cv ) = (σ (1 − ρ2 ) + σa2 ), (6)
(E[µ̂mean ] = µ). n f
2
Note that this independence assumption isn’t quite true and no other estimator has a lower worst-case
in practice since we do not control who annotates our data. variance.

646
(ρ = 1), the data efficiency is still capped by
𝑓(𝑧) 1+γ
γ : unless γ → 0 the data efficiency cannot in-
crease unboundedly. Intuitively, even if we knew
𝑧 that ρ = 1, f (z) would be undetermined up to a
𝑔(𝑧) constant additive shift and just estimating the shift
would incur a variance of n1 σa2 .
𝜇
3.3 Using the control variates estimator
Samples of 𝑓(𝑧) Samples of 𝑓 𝑧 − 𝑔(𝑧) The control variates estimator can be easily inte-
grated into an existing evaluation: we run human
Figure 2: The samples from f (z) have a higher
evaluation on a random sample of system outputs,
variance than the samples from f (z) − g(z) but
automatic evaluation on all the system outputs,
the same mean. This is the key idea behind using
and plug in these results into Algorithm 1.
control variates to reduce variance.
It is vital that we are able to evaluate the au-
tomatic metric on a significantly larger set of ex-
1.0 1.0 amples than those with human evaluations to reli-
Automatic metric correlation (ρ)

ably normalize g(z): without these additional ex-


Inverse data efficiency

0.8 0.8
amples, it be can shown that the optimal minimax
0.6 0.6 estimator for µ is simply the naive estimate µ̂mean .
Intuitively, this is because estimating the mean of
0.4 0.4 g(z) incurs an equally large variance as estimating
µ. In other words, g(z) is only useful if we have
0.2 0.2 additional information about g beyond the samples
{z (i) }.
0.0 0.0
0.00 0.25 0.50 0.75 1.00 Algorithm 1 shows the estimator. In practice,
Normalized annotator variance (γ) we do not know α = Cov(f (z), g(z)), so we use
a plug-in estimate α̂ in line 3 to compute the esti-
Figure 3: Inverse data efficiency for various val- mate µ e in line 4. We note that estimating α from
ues of γ and ρ. We need both low γ and high ρ to data does introduce a O(1/n) bias, but when com-
obtain significant gains. pared to the standard deviation which decays as

Θ(1/ n), this bias quickly goes to 0.

Comparing the variances of the two estimators Proposition 3.1. The estimator µ
e in Algorithm 1
((2) and (6)), we define the data efficiency as the has O(1/n) bias.
ratio of the variances:
Algorithm 1 Control variates estimator
Var(µ̂mean )
def 1+γ
DE = = , (7) 1: Input: n human evaluations y (i) on system
Var(µ̂cv ) 1 − ρ2 + γ
outputsP
z (i) , normalized automatic metric g
1
def 2: y = n i y (i)
where γ = σa2 /σf2 is the normalized annotator P
variance. Data efficiency is the key quantity in 3: α̂ = n1 i (y (i) − y)g(z (i) )
P
this paper: it is the multiplicative reduction in the 4: e = n1 i y (i) − α̂g(z (i) )
µ
number of samples required when using the con- 5: return µ
e
trol variates estimator µ̂cv versus the sample mean
µ̂mean . Figure 3 shows the inverse data efficiency An additional question that arises when apply-
contours as a function of the correlation ρ and γ. ing Algorithm 1 is figuring out how many samples
When there is no correlation between human n to use. Given a target variance, the number of
and automatic metrics (ρ = 0), the data efficiency samples can be estimated using (6) with conserva-
is naturally 1 (no gain). In order to achieve a tive estimates of σf2 , σa2 and ρ. Alternatively, our
data efficiency
√ of 2 (half the labeling cost), we estimator can be combined with a dynamic stop-
need |ρ| ≥ 2/2 ≈ 0.707. Interestingly, even ping rule (Mnih et al., 2008) to stop data collection
for an automatic metric with perfect correlation once we reach a target confidence interval.

647
Task Eval. σa2 σf2 γ= σa2 Evaluating language quality in automatic sum-
σf2
marization. In automatic summarization, sys-
CDM Fluency 0.32 0.26 1.23 tems must generate a short (on average two or
CDM Redund. 0.26 0.43 0.61 three sentence) summary of an article: for our
CDM Overall 0.28 0.28 1.00 study, we chose articles from the CNN/Daily Mail
CDM Edit 0.07 0.18 0.36 (CDM) dataset (Hermann et al., 2015; Nallapati
et al., 2016) which come paired with reference
MS MARCO AnyCorr. 0.14 0.15 0.95
summaries in the form of story highlights. We
MS MARCO AvgCorr. 0.12 0.13 0.91
focus on the language quality of summaries and
leave evaluating content selection to future work.
Table 2: A summary of the key statistics, human
For each summary, we collected human judg-
metric variance (σf2 ) and annotator variance (σa2 )
ments on a scale from 1–3 (Figure 4a) for flu-
for different datasets, CNN/Daily Mail (CDM)
ency, (lack of) redundancy, and overall quality of
and MS MARCO in our evaluation benchmark.
the summary using guidelines from the DUC sum-
We observe that the relative variance (γ) is fairly
marization challenge (Dang, 2006). As an alter-
high for most evaluation prompts, upper bounding
nate human metric, we also asked workers to post-
the data efficiency on these tasks. A notable ex-
edit the system’s summary to improve its qual-
ception is the Edit prompt wherein systems are
ity, similar to the post-editing step in MT evalu-
compared on the number of post-edits required to
ations (Snover et al., 2006). Obtaining judgments
improve their quality.
costs about $0.15 per summary and this cost rises
to about $0.40 per summary for post-editing.
We collected judgments on the summaries gen-
3.4 Discussion of assumptions
erated by the seq2seq and pointer models
We will soon see that empirical instantiations of γ of See et al. (2017), the ml and ml+rl mod-
and ρ lead to rather underwhelming data efficien- els of Paulus et al. (2018), and the reference
cies in practice. In light of our optimality result, summaries.3 Before presenting the summaries to
does this mean there is no hope for gains? Let us human annotators, we performed some minimal
probe our assumptions. We assumed that the hu- post-processing: we true-cased and de-tokenized
man judgments are uncorrelated across different the output of seq2seq and pointer using
system outputs; it is possible that a more accurate Stanford CoreNLP (Manning et al., 2014) and re-
model of human annotators (e.g. Passonneau and placed “unknown” tokens in each system with a
Carpenter (2014)) could offer improvements. Per- special symbol ().
haps with additional information about g(z) such
Evaluating answer correctness. Next, we look
as calibrated confidence estimates, we would be
at evaluating the correctness of system outputs
able to sample more adaptively. Of course the
in question answering using the MS MARCO
most direct routes to improvement involve increas-
question answering dataset (Nguyen et al., 2016).
ing the correlation of g with human judgments and
Here, each system is provided with a question and
reducing annotator variance, which we will dis-
up to 10 paragraphs of context. The system gener-
cuss more later.
ates open-response answers that do not need to be
tied to a span in any paragraph.
4 Tasks and datasets We first ask annotators to judge if the output
is even plausible for the question, and if yes, ask
In order to compare different approaches to evalu- them identify if it is correct according to each con-
ating systems, we first collected human judgments text paragraph. We found that requiring annotators
for the output of several automatic summariza- to highlight regions in the text that support their
tion and open-response question answering sys- decision substantially improved the quality of the
tems using Amazon Mechanical Turk. Details of output without increasing costs. Annotations cost
instructions provided and quality assurance steps $0.40 per system response.4
taken are provided in Appendix A of the supple- 3
All system output was obtained from the original authors
mentary material. In this section, we’ll briefly de- through private communication.
scribe how we collected this data. 4
This cost could be significantly reduced if systems also

648
(a) Interface to evaluate language quality on CNN/Daily (b) Interface to judge answer correctness on MS MARCO
Mail

Figure 4: Screenshots of the annotation interfaces we used to measure (a) summary language quality on
CNN/Daily Mail and (b) answer correctness on MS MARCO tasks.

While our goal is to evaluate the correctness of Recall that our primary quantity of interest is data
the provided answer, we found that there are of- efficiency, the ratio of the number of human judg-
ten answers which may be correct or incorrect de- ments required to estimate the overall human eval-
pending on the context. For example, the question uation score for the control variates estimator ver-
“what is a pothole” is typically understood to refer sus the sample mean. We’ll briefly review the au-
to a hole in a roadway, but also refers to a geo- tomatic metrics used in our evaluation before ana-
logical feature (Figure 4b). This is reflected when lyzing the results.
annotators mark one context paragraph to support
the given answer but mark another to contradict it. Automatic metrics. We consider the follow-
We evaluated systems based on both the average ing frequently used automatic word-overlap based
correctness (AvgCorrect) of their answers across metrics in our work: BLEU (Papineni et al.,
all paragraphs as well as whether their answer is 2002), ROUGE (Lin and Rey, 2004) and ME-
correct according to any paragraph (AnyCorrect). TEOR (Lavie and Denkowski, 2009). Following
Novikova et al. (2017) and Liu et al. (2016b), we
We collected annotations on the systems gen-
also compared a vector-based sentence-similarity
erated by the fastqa and fastqa ext from
using sent2vec (Pagliardini et al., 2017) to
Weissenborn et al. (2017) and the snet and
compare sentences (VecSim). Figure 5 shows how
snet.ens(emble) models from Tan et al. (2018),
each of these metrics is correlated with human
along with reference answers. The answers gener-
judgment for the systems being evaluated. Un-
ated by the systems were used without any post-
surprisingly, the correlation varies considerably
processing. Surprisingly, we found that the cor-
across systems, with token-based metrics correlat-
rectness of the reference answers (according to
ing more strongly for systems that are more ex-
the AnyCorrect metric) was only 73.5%, only 2%
tractive in nature (fastqa and fastqa ext).
above that of the leading system (snet.ens).
We manually inspected 30 reference answers Results.5 In Section 3 we proved that the con-
which were annotated incorrectly and found that trol variates estimator is not only unbiased but also
of those, about 95% were indeed incorrect. How- has the least variance among other unbiased esti-
ever, 62% are actually answerable from some mators. Figure 6 plots the width of the 80% con-
paragraph, indicating that the real ceiling perfor- fidence interval, estimated using bootstrap, mea-
mance on this dataset is around 90% and that there sured as a function of the number of samples col-
is still room for improvement on this task. lected for different tasks and prompts. As ex-
pected, the control variates estimator reduces the
5 Experimental results width of the confidence interval. We measure data
We are now ready to evaluate the performance efficiency by the averaging of the ratio of squared
of our control variates estimator proposed in Sec- confidence intervals between the human baseline
tion 3 using the datasets presented in Section 4. 5
Extended results for other systems, metrics
and prompts can be found at https://bit.ly/
specify which passage they used to generate the answer. price-of-debiasing/.

649
0.50 0.50
VecSim 0.45 VecSim 0.45
BLEU-2 0.40 BLEU-2 0.40

Pearson ρ
0.35

Pearson ρ
Metrics

METEOR 0.35

Metrics
METEOR
0.30 0.30
ROUGE-2 0.25 ROUGE-2 0.25
ROUGE-1 0.20 ROUGE-1 0.20
ROUGE-L 0.15
ROUGE-L 0.15
0.10
0.10

l
a

et
t

s
Al
ex

.en
tq

sn

rl

l
q

er
m

Al
s

se

l+
et
a
fa

int
tq

q2
sn

m
po
s

se
fa

Systems Systems

(a) MS MARCO with the AnyCorrect prompt (b) CNN/Daily Mail with the Edit prompt

Figure 5: Correlations of different automatic metrics on the MS MARCO and CNN/Daily Mail tasks.
Certain systems are more correlated with certain automatic metrics than others, but overall the correlation
is low to moderate for most systems and metrics.

and control variates estimates. We observe that the 6 Related work


data efficiency depends on the task, prompt and
system, ranging from about 1.08 (a 7% cost reduc- In this work, we focus on using existing automatic
tion) to 1.15 (a 13% cost reduction) using current metrics to decrease the cost of human evaluations.
automatic metrics. There has been much work on improving the qual-
As we showed in Section 3, further gains are ity of automatic metrics. In particular, there is
fundamentally limited by the quality of the evalu- interest in learning models (Lowe et al., 2017a;
ation prompts and automatic metrics. Figures 6a Dusek et al., 2017) that are able to optimize for im-
and 6b show how improving the quality of the proved correlations with human judgment. How-
evaluation prompt from a Likert-scale prompt for ever, in our experience, we have found that these
quality (Overall) to using post-editing (Edit) learned automatic metrics have trouble generaliz-
noticeably decreases variance and hence allows ing to different systems. The framework we pro-
better automatic metrics to increase data effi- vide allows us to safely incorporate such models
ciency. Likewise, Figure 6c shows how using into evaluation, exploiting them when their corre-
a better automatic metric (ROUGE-L instead of lation is high but also not introducing bias when it
VecSim) also reduces variance. is low.
Figure 6 also shows the conjectured confidence Our key technical tool is control variates, a stan-
intervals if we were able to eliminate noise in hu- dard statistical technique used to reduce the vari-
man judgments (noiseless humans) or have a au- ance of Monte Carlo estimates (Ripley, 2009).
tomatic metric that correlated perfectly with aver- The technique has also been used in machine
age human judgment (perfect metric). In particu- learning and reinforcement learning to lower vari-
lar, we use the mean of all (2–3) humans on each ance estimates of gradients (Greensmith et al.,
z for the perfect g(z) and use the mean of all hu- 2004; Paisley et al., 2012; Ranganath et al., 2014).
mans on each z for the “noiseless” Y (z). To the best of our knowledge, we are the first to ap-
In both cases, we are able to significantly in- ply this technique in the context of language eval-
crease data efficiency (i.e. decrease estimator vari- uation.
ance). With zero annotator variance and using ex- Our work also highlights the importance of hu-
isting automatic metrics, the data efficiency ranges man evaluation. Chaganty et al. (2017) identified
from 1.42 to 1.69. With automatic metrics with a similar problem of systematic bias in evaluation
perfect correlation and current variance of human metrics in the setting of knowledge base popula-
judgments, it ranges from 2.38 to 7.25. Thus, tion and also propose statistical estimators that re-
we conclude that it is important not only to im- lies on human evaluation to correct bias. Unfortu-
prove our automatic metrics but also the evalua- nately, their technique relies on having a structured
tion prompts we use during human evaluation. output (relation triples) that are shared between

650
0.20 0.20 0.200
Humans Humans Humans
80% confidence interval 0.18 Humans + VecSim 0.18 Humans + VecSim 0.175 Humans + VecSim

80% confidence interval

80% confidence interval


0.16 Noiseless humans + VecSim 0.16 Noiseless humans + VecSim Humans + ROUGE-1
0.150
Humans + perfect metric Humans + perfect metric Humans + perfect metric
0.14 0.14
0.125
0.12 0.12
0.100
0.10 0.10
0.075
0.08 0.08
0.050
0.06 0.06
0 100 200 300 400 500 0 100 200 300 400 500 0 100 200 300 400 500
Number of samples Number of samples Number of samples

(a) seq2seq on CNN/Daily Mail using (b) seq2seq on CNN/Daily Mail using (c) fastqa ext on MS MARCO using
the Overall Edit AnyCorrect

Figure 6: 80% bootstrap confidence interval length as a function of the number of human judgments
used when evaluating the indicated systems on their respective datasets and prompts. (a) We see a modest
reduction in variance (and hence cost) relative to human evaluation by using the VecSim automatic metric
with the proposed control variates estimator to estimate Overall scores on the CNN/Daily Mail task;
the data efficiency (DE) is 1.06. (b) By improving the evaluation prompt to use Edits instead, it is
possible to further reduce variance relative to humans (DE is 1.15). (c) Another way to reduce variance
relative to humans is to improve the automatic metric evaluation; here using ROUGE-1 instead of VecSim
improves the DE from 1.03 to 1.16.

systems and does not apply to evaluating natu- Without making stronger assumptions, the con-
ral language generation. In a similar vein, Chang trol variates estimator we proposed outlines the
et al. (2017) dynamically collect human feedback limitations of unbiased estimation. Where do we
to learn better dialog policies. go from here? Certainly, we can try to improve
the automatic metric (which is potentially as diffi-
7 Discussion cult as solving the task) and brainstorming alterna-
tive ways of soliciting evaluation (which has been
Prior work has shown that existing automatic less explored). Alternatively, we could give up on
metrics have poor instance-level correlation with measuring absolute scores, and seek instead to find
mean human judgment and that they score many techniques stably rank methods and thus improve
good quality responses poorly. As a result, the them. As the NLP community tackles increasingly
evaluation is systematically biased against genuine difficult tasks, human evaluation will only become
system improvements that would lead to higher more important. We hope our work provides some
human evaluation scores but not improve auto- clarity on to how to make it more cost effective.
matic metrics. In this paper, we have explored us-
ing an automatic metric to decrease the cost of hu- Reproducibility
man evaluation without introducing bias. In prac-
tice, we find that with current automatic metrics All code, data, and experiments for this paper are
and evaluation prompts data efficiencies are only available on the CodaLab platform at https://
1.08–1.15 (7–13% cost reduction). Our theory bit.ly/price-of-debiasing.
shows that further improvements are only possi-
Acknowledgments
ble by improving the correlation of the automatic
metric and reducing the annotator variance of the We are extremely grateful to the authors of the
evaluation prompt. As an example of how evalu- systems we evaluated for sharing their systems’
ation prompts could be improved, we found that output with us. We also would like to thank Ur-
using post-edits of summarizes decreased normal- vashi Khandelwal and Peng Qi for feedback on
ized annotator variance by a factor of three relative an earlier draft of the paper, the crowdworkers
to using a Likert scale survey. It should be noted on Amazon Mechanical Turk and TurkNation for
that changing the evaluation prompt also changes their work and feedback during the data collection
the underlying ground truth f (z): it is up to us process, and the anonymous reviewers for their
to find a prompt that still captures the essence of constructive feedback.
what we want to measure.

651
References C. Liu, R. Lowe, I. V. Serban, M. Noseworthy, L. Char-
lin, and J. Pineau. 2016b. How NOT to evaluate
A. Chaganty, A. Paranjape, P. Liang, and C. Man- your dialogue system: An empirical study of un-
ning. 2017. Importance sampling for unbiased on- supervised evaluation metrics for dialogue response
demand evaluation of knowledge base population. generation. In Empirical Methods in Natural Lan-
In Empirical Methods in Natural Language Process- guage Processing (EMNLP).
ing (EMNLP).
R. Lowe, M. Noseworthy, I. V. Serban, N. Angelard-
C. Chang, R. Yang, L. Chen, X. Zhou, and K. Yu. Gontier, Y. Bengio, and J. Pineau. 2017a. Towards
2017. Affordable on-line dialogue policy learning. an automatic turing test: Learning to evaluate dia-
In Empirical Methods in Natural Language Process- logue responses. In Association for Computational
ing (EMNLP). pages 223–231. Linguistics (ACL).
J. M. Conroy and H. T. Dang. 2008. Mind the gap :
R. T. Lowe, N. Pow, I. Serban, L. Charlin, C. Liu, and
Dangers of divorcing evaluations of summary con-
J. Pineau. 2017b. Training end-to-end dialogue sys-
tent from linguistic quality. In International Con-
tems with the ubuntu dialogue corpus. Dialogue and
ference on Computational Linguistics (COLING).
Discourse 8.
pages 145–152.
C. D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. J.
H. T. Dang. 2006. Overview of DUC 2006. In Docu-
Bethard, and D. McClosky. 2014. The stanford
ment Understanding Conference.
coreNLP natural language processing toolkit. In
M. Denkowski and A. Lavie. 2014. Meteor universal: ACL system demonstrations.
Language specific translation evaluation for any tar-
get language. In Workshop on Statistical Machine V. Mnih, C. Szepesv’ari, and J. Audibert. 2008. Empir-
Translation. ical berstein stopping. In International Conference
on Machine Learning (ICML).
O. Dusek, J. Novikova, and V. Rieser. 2017. Refer-
enceless quality estimation for natural language gen- R. Nallapati, B. Zhou, C. Gulcehre, B. Xiang,
eration. arXiv . et al. 2016. Abstractive text summarization us-
ing sequence-to-sequence rnns and beyond. arXiv
E. Greensmith, P. L. Bartlett, and J. Baxter. 2004. Vari- preprint arXiv:1602.06023 .
ance reduction techniques for gradient estimates in
reinforcement learning. Journal of Machine Learn- T. Nguyen, M. Rosenberg, X. Song, J. Gao, S. Tiwary,
ing Research (JMLR) 5:1471–1530. R. Majumder, and L. Deng. 2016. MS MARCO:
A human generated machine reading comprehension
K. M. Hermann, T. Koisk, E. Grefenstette, L. Espe- dataset. In Workshop on Cognitive Computing at
holt, W. Kay, M. Suleyman, and P. Blunsom. 2015. NIPS.
Teaching machines to read and comprehend. In Ad-
vances in Neural Information Processing Systems J. Novikova, O. Duek, A. C. Curry, and V. Rieser. 2017.
(NIPS). Why we need new evaluation metrics for NLG. In
Empirical Methods in Natural Language Processing
T. Kočisky, J. Schwarz, P. Blunsom, C. Dyer, K. M. (EMNLP).
Hermann, G. Melis, and E. Grefenstette. 2017.
The NarrativeQA reading comprehension challenge. M. Pagliardini, P. Gupta, and M. Jaggi. 2017. Unsuper-
arXiv preprint arXiv:1712.07040 . vised learning of sentence embeddings using com-
positional n-gram features. arXiv .
A. Lavie and M. Denkowski. 2009. The meteor met-
ric for automatic evaluation of machine translation. J. Paisley, D. M. Blei, and M. I. Jordan. 2012. Vari-
Machine Translation 23. ational Bayesian inference with stochastic search.
In International Conference on Machine Learning
C. Lin and M. Rey. 2004. Looking for a few good met- (ICML). pages 1363–1370.
rics: ROUGE and its evaluation. In NTCIR Work-
shop. K. Papineni, S. Roukos, T. Ward, and W. Zhu. 2002.
BLEU: A method for automatic evaluation of ma-
T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, chine translation. In Association for Computational
D. Ramanan, P. Doll’ar, and C. L. Zitnick. 2014. Linguistics (ACL).
Microsoft COCO: Common objects in context. In
European Conference on Computer Vision (ECCV). R. J. Passonneau and B. Carpenter. 2014. The benefits
pages 740–755. of a model of annotation. In Association for Com-
putational Linguistics (ACL).
A. Liu, S. Soderland, J. Bragg, C. H. Lin, X. Ling, and
D. S. Weld. 2016a. Effective crowd annotation for R. Paulus, C. Xiong, and R. Socher. 2018. A deep re-
relation extraction. In North American Association inforced model for abstractive summarization. In
for Computational Linguistics (NAACL). pages 897– International Conference on Learning Representa-
906. tions (ICLR).

652
R. Ranganath, S. Gerrish, and D. Blei. 2014. Black
box variational inference. In Artificial Intelligence
and Statistics (AISTATS). pages 814–822.
B. D. Ripley. 2009. Stochastic simulation. John Wiley
& Sons.

A. See, P. J. Liu, and C. D. Manning. 2017. Get to the


point: Summarization with pointer-generator net-
works. In Association for Computational Linguis-
tics (ACL).

M. Snover, B. Dorr, R. Schwartz, L. Micciulla, and


J. Makhoul. 2006. A study of translation edit rate
with targeted human annotation. In Association for
Machine Translation in the Americas. pages 223–
231.
C. Tan, F. Wei, N. Yang, W. Lv, and M. Zhou. 2018.
S-Net: From answer extraction to answer genera-
tion for machine reading comprehension. In Associ-
ation for the Advancement of Artificial Intelligence
(AAAI).
R. Vedantam, C. L. Zitnick, and D. Parikh. 2015.
CIDEr: Consensus-based image description evalu-
ation. In Computer Vision and Pattern Recognition
(CVPR). pages 4566–4575.
D. Weissenborn, G. Wiese, and L. Seiffe. 2017. Mak-
ing neural QA as simple as possible but not sim-
pler. In Computational Natural Language Learning
(CoNLL).
Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi,
W. Macherey, M. Krikun, Y. Cao, Q. Gao,
K. Macherey, et al. 2016. Google’s neural ma-
chine translation system: Bridging the gap between
human and machine translation. arXiv preprint
arXiv:1609.08144 .

653

You might also like