0% found this document useful (0 votes)

5 views11 pages

P18-1060

This paper addresses the limitations of automatic metrics like BLEU and ROUGE in evaluating natural language generation systems, which often correlate poorly with human judgment. The authors propose a control variates estimator that combines automatic metrics with human evaluation to achieve a more accurate and cost-effective assessment. Their findings indicate that while the proposed method offers some cost reduction, improvements in both the automatic metrics and the evaluation prompts are necessary for greater efficiency.

Uploaded by

vinod

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views11 pages

P18-1060

Uploaded by

vinod

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

The price of debiasing automatic metrics in natural language evaluation

Arun Tejasvi Chaganty∗ and Stephen Mussmann∗ and Percy Liang

Computer Science Department, Stanford University
{chaganty,mussmann,pliang}@cs.stanford.edu

Abstract ing automatic metrics such as BLEU (Papineni

et al., 2002), ROUGE (Lin and Rey, 2004), ME-
For evaluating generation systems, auto- TEOR (Lavie and Denkowski, 2009; Denkowski
matic metrics such as BLEU cost noth- and Lavie, 2014) and CiDER (Vedantam et al.,
ing to run but have been shown to corre- 2015). However, these have shown to be biased,
late poorly with human judgment, leading correlating poorly with human metrics across dif-
to systematic bias against certain model ferent datasets and systems (Liu et al., 2016b;
improvements. On the other hand, av- Novikova et al., 2017).
eraging human judgments, the unbiased
Can we combine automatic metrics and human
gold standard, is often too expensive. In
evaluation to obtain an unbiased estimate at lower
this paper, we use control variates to com-
cost than human evaluation alone? In this paper,
bine automatic metrics with human evalu-
we propose a simple estimator based on control
ation to obtain an unbiased estimator with
variates (Ripley, 2009), where we average differ-
lower cost than human evaluation alone.
ences between human judgments and automatic
In practice, however, we obtain only a 7–
metrics rather than averaging the human judg-
13% cost reduction on evaluating summa-
ments alone. Provided the two are correlated, our
rization and open-response question an-
estimator will have lower variance and thus reduce
swering systems. We then prove that our
cost.
estimator is optimal: there is no unbi-
We prove that our estimator is optimal in the
ased estimator with lower cost. Our the-
sense that no unbiased estimator using the same
ory further highlights the two fundamen-
automatic metric can have lower variance. We
tal bottlenecks—the automatic metric and
also analyze its data efficiency (equivalently, cost
the prompt shown to human evaluators—
savings)—the factor reduction in number of hu-
both of which need to be improved to ob-
man judgments needed to obtain the same accu-
tain greater cost savings.
racy versus naive human evaluation—and show
1 Introduction that it depends solely on two factors: (a) the an-
notator variance (which is a function of the hu-
In recent years, there has been an increasing in- man evaluation prompt) and (b) the correlation be-
terest in tasks that require generating natural lan- tween human judgments and the automatic met-
guage, including abstractive summarization (Nal- ric. This factorization allows us to calculate typi-
lapati et al., 2016), open-response question an- cal and best-case data efficiencies and accordingly
swering (Nguyen et al., 2016; Kočisky et al., refine the evaluation prompt or automatic metric.
2017), image captioning (Lin et al., 2014), and Finally, we evaluate our estimator on state-
open-domain dialogue (Lowe et al., 2017b). Un- of-the-art systems from two tasks, summariza-
fortunately, the evaluation of these systems re- tion on the CNN/Daily Mail dataset (Hermann
mains a thorny issue because of the diversity of et al., 2015; Nallapati et al., 2016) and open-
possible correct responses. As the gold standard response question answering on the MS MAR-
of performing human evaluation is often too ex- COv1.0 dataset (Nguyen et al., 2016). To study
pensive, there has been a large effort develop- our estimators offline, we preemptively collected
∗
Authors contributed equally. 10,000 human judgments which cover several

643
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Long Papers), pages 643–653
Melbourne, Australia, July 15 - 20, 2018. c 2018 Association for Computational Linguistics
1.0
fastqa
fastqa ext 0.8
0.4
snet
ROUGE-L

ROUGE-L
snet.ens 0.6
0.3
0.4

0.2
0.2
0.0
0.5 0.6 0.7 0.8 0.0 0.5 1.0
Human judgement Human judgment

(a) System-level correlation on the MS MARCO task (b) Instance-level correlation for the fastqa system

Figure 1: (a) At a system-level, automatic metrics (ROUGE-L) and human judgment correlate well, but
(b) the instance-level correlation plot (where each point is a system prediction) shows that the instance-
level correlation is quite low (ρ = 0.31). As a consequence, if we try to locally improve systems to
produce better answers (. in (a)), they do not significantly improve ROUGE scores and vice versa (M).

tasks and systems.1 As predicted by the theory, have been found to have high system-level correla-
we find that the data efficiency depends not only tions (Novikova et al., 2017). What, then, are the
on the correlation between the human and auto- implications of having a low instance-level corre-
matic metrics, but also on the evaluation prompt. lation?
If the automatic metric had perfect correlation, our As a case study, consider the task of open-
data efficiency would be around 3, while if we had response question answering: here, a system re-
noiseless human judgments, our data efficiency ceives a human-generated question and must gen-
would be about 1.5. In reality, the reduction in erate an answer from some given context, e.g. a
cost we obtained was only about 10%, suggesting document or several webpages. We collected the
that improvements in both automatic metric and responses of several systems on the MS MAR-
evaluation prompt are needed. As one case study COv1 dataset (Nguyen et al., 2016) and crowd-
in improving the latter, we show that, when com- sourced human evaluations of the system output
pared to a Likert survey, measuring the amount of (see Section 4 for details).
post-editing needed to fix a generated sentence re- The instance-level correlation (Figure 1b) is
duced the annotator variance by three-fold. only ρ = 0.31. A closer look at the instance-level
correlation reveals that while ROUGE is able to
2 Bias in automatic evaluation correctly assign low scores to bad examples (lower
It is well understood that current automatic met- left), it is bad at judging good examples and often
rics tend to correlate poorly with human judg- assigns them low ROUGE scores (lower right)—
ment at the instance-level. For example, Novikova see Table 1 for examples. This observation agrees
et al. (2017) report correlations less than 0.3 for with a finding reported in Novikova et al. (2017)
a large suite of word-based and grammar-based that automatic metrics correlate better with human
evaluation methods on a generation task. Sim- judgments on bad examples than average or good
ilarly, Liu et al. (2016b) find correlations less examples.
than 0.35 for automatic metrics on a dialog gen- Thus, as Figure 1(a) shows, we can improve
eration task in one domain, but find correlations low-scoring ROUGE examples without improving
with the same metric dropped significantly to less their human judgment (M) and vice versa (.). In-
than 0.16 when used in another domain. Still, deed, Conroy and Dang (2008) report that sum-
somewhat surprisingly, several automatic metrics marization systems were optimized for ROUGE
1
during the DUC challenge (Dang, 2006) until they
An anonymized version of this data and the annota-
tion interfaces used can be found at https://bit.ly/ were indistinguishable from the ROUGE scores
price-of-debiasing. of human-generated summaries, but the systems

644
Question and reference answer System answer (System; Corr / ROUGE-L)
Examples where system is correct and ROUGE-L > 0.5 (19.6% or 285 of 1455 unique responses)
Q. what is anti-mullerian hormone it is a protein hormone produced by granulosa cells
A. Anti-Mullerian Hormone (AMH) is a protein hormone (cells lining the egg sacs or follicles) within the ovary.
produced by granulosa cells (cells lining the egg sacs or fol- (snet.ens; X / 0.86)
licles) within the ovary.

Examples where system is incorrect and ROUGE-L > 0.5 (1.3% or 19 of 1455 unique responses)
Q. at what gestational age can you feel a fetus move 37 to 41 weeks (fastqa, fastqa.ext; × / 1.0)
A. 37 to 41 weeks (incorrect reference answer)

Examples where system is correct and ROUGE-L < 0.5 (56.0% or 815 of 1455 unique responses)
Q. what is the definition of onomatopoeia the naming of a thing or action by a vocal imitation of the
A. It is defined as a word, which imitates the natural sounds sound associated with it (as buzz, hiss). (fastqa; X / 0.23)
of a thing.

Examples where system is incorrect and ROUGE-L < 0.5 (23.1% or 336 of 1455 unique responses)
Q. what kind root stem does a dandelion have vitamin a, vitamin c, vitamin d and vitamin b complex, as
A. Fibrous roots and hollow stem. well as zinc, iron and potassium. (snet, snet.ens; × /
0.09)

(a) MS MARCO. Human annotators rated answer correctness (AnyCorrect) and the automatic metric used is ROUGE-L
(higher is better).

Reference summary System summary (System; Edit / VecSim)

Examples where system Edit < 0.3 and VecSim > 0.5 (53.9% or 1078 of 2000 responses)
Bhullar is set to sign a -day contract with the Kings. Bhullar andThe Kings are signing Bhullar to a -day contract.
The -year-old will become the NBA’s first player of The -year-old will be on the roster on friday when David Wear’s
Indian descent. Bhullar will be on the roster when the -season contract expires thursday. Bhullar is set to become the
Kings host New Orleans Pelicans. NBA’s first player of Indian descent. (ml; 0.13 / 0.82)

Examples where system Edit > 0.3 and VecSim > 0.5 (18.0% or 360 of 2000 responses)
The Direct Marketing Commission probing B2C Data obtained by the Mail’s marketing commission said it
Data and Data Bubble. Investigating whether they would probe both companies over claims that they had breached
breached rules on the sale of private data. Chief com- the rules on the sale of private data. The FSA said it would probe
missioner described allegations made about firms as both companies over claims they had breached the rules on the
‘serious’. sale of private data. (se2seq; 1.00 / 0.72)

Examples where system Edit < 0.3 and VecSim < 0.5 (14.5% or 290 of 2000 responses)
Death toll rises to more than . Pemba Tamang, , Six of Despite Nepal’s tragedy, life triumphed in Kathmandu’s
shows no apparent signs of serious injury after rescue. hard-hit neighborhoods. Rescuers pulled an 15-year-old from the
Americans special forces helicopter , including rubble of a multistory residential building. He was wearing a New
Americans, to safety. York shirt and a blue neck brace. (pointer; 0.04 / 0.27)

Examples where system Edit > 0.3 and VecSim < 0.5 (13.6% or 272 of 2000 responses)
“Mad Men’s” final seven episodes begin airing April ‘This’s “Mad Men” is the end of a series of an era’, This he says.
. The show has never had high ratings but is con- Stores have created fashion lines inspired by the show.“The So-
sidered one of the great TV series. It’s unknown what pranos”. The in the Kent State shootings in may or Richard
will happen to characters, but we can always guess. Nixonś re-election.. (ml+rl; 0.95 / 0.24)

(b) CNN/Daily Mail. Human judgment scores used are post-edit distance (Edit) (lower is better) and the automatic metric
used is sentence vector similarity with the reference (higher is better).

Table 1: Examples highlighting the different modes in which the automatic metric and human judgments
may agree or disagree. On the MS MARCO task, a majority of responses from systems were actually
correct but poorly scored according to ROUGE-L. On the CNN/Daily Mail task, a significant number of
examples which are scored highly by VecSim are poorly rated by humans, and likewise many examples
scored poorly by VecSim are highly rated by humans.

645
def
had hardly improved on human evaluation. Hill- We can define σf2 = Var(f (z)) as the variance
climbing on ROUGE can also lead to a system def
of the human metric and σa2 = Ez [Var(Y (z))] as
that does worse on human scores, e.g. in machine
the variance of human judgment averaged over Z.
translation (Wu et al., 2016). Conversely, genuine
By the law of total variance, the variance of our
quality improvements might not be reflected in im-
estimator is
provements in ROUGE. This bias also appears in
pool-based evaluation for knowledge base popula- 1
Var(µ̂mean ) = (σf2 + σa2 ). (2)
tion (Chaganty et al., 2017). Thus the problems n
with automatic metrics clearly motivate the need 3.2 Control variates estimator
for human evaluation, but can we still use the au- Now let us see how an automatic metric g can re-
tomatic metrics somehow to save costs? duce variance. If there is no annotator variance
(σa2 = 0) so that Y (z) = f (z), we should ex-
3 Statistical estimation for unbiased
pect the variance of f (z) − g(z) to be lower than
evaluation the variance of f (z), assuming g is correlated with
We will now formalize the problem of combining f —see Figure 2 for an illustration.
human evaluation with an automatic metric. Let The actual control variates estimator needs to
X be a set of inputs (e.g., articles), and let S be handle noisy Y (z) (i.e. σa2 > 0) and guard against
the system (e.g. for summarization), which takes a g(z) with low correlation. Let us standardize g
x ∈ X and returns output S(x) (e.g. a summary). to have zero mean and unit variance, because we
Let Z = {(x, S(x)) : x ∈ X } be the set of system have assumed it is free to evaluate. As before, let
predictions. Let Y (z) be the random variable rep- z (1) , . . . , z (n) be independent samples from Z and
resenting the human judgment according to some draw y (i) = Y (z (i) ) independently as well. We
evaluation prompt (e.g. grammaticality or correct- define the control variates estimator as
ness), and define f (z) = E[Y (z)] to be the (un- n
1 X (i)
known) human metric corresponding to averaging µ̂cv = y − αg(z (i) ), (3)
n
over an infinite number of human judgments. Our i=1

goal is to estimate the average across all examples: where

def
def 1 X α = Cov(f (z), g(z)). (4)
µ = Ez [f (z)] = f (z) (1)
|Z|
z∈Z Intuitively, we have averaged over y (i) to handle
the noise introduced by Y (z), and scaled g(z) to
with as few queries to Y as possible.
prevent an uncorrelated automatic metric from in-
Let g be an automatic metric (e.g. ROUGE),
troducing too much noise.
which maps z to a real number. We assume eval-
An important quantity governing the quality of
uating g(z) is free. The central question is how
an automatic metric g is the correlation between
to use g in conjunction with calls to Y to produce
f (z) and g(z) (recall that g has unit variance):
an unbiased estimate µ̂ (that is, E[µ̂] = µ). In this
section, we will construct a simple estimator based def α
ρ = . (5)
on control variates (Ripley, 2009), and prove that σf
it is minimax optimal. We can show that among all distributions with
fixed σf2 , σa2 , and α (equivalently ρ), this estimator
3.1 Sample mean
is minimax optimal, i.e. it has the least variance
We warm up with the most basic unbiased esti- among all unbiased estimators:
mate, the sample mean. We sample z (1) , . . . , z (n)
Theorem 3.1. Among all unbiased estimators that
independently with replacement from Z. Then,
are functions of y (i) and g(z (i) ), and for all distri-
we sample each human judgment y (i) = Y (z (i) )
2 butions with a given σf2 , σa2 , and α,
independently. Define the estimator to be
1 Pn
µ̂mean = n i=1 y (i) . Note that µ̂mean is unbiased 1 2
Var(µ̂cv ) = (σ (1 − ρ2 ) + σa2 ), (6)
(E[µ̂mean ] = µ). n f
2
Note that this independence assumption isn’t quite true and no other estimator has a lower worst-case
in practice since we do not control who annotates our data. variance.

646
(ρ = 1), the data efficiency is still capped by
𝑓(𝑧) 1+γ
γ : unless γ → 0 the data efficiency cannot in-
crease unboundedly. Intuitively, even if we knew
𝑧 that ρ = 1, f (z) would be undetermined up to a
𝑔(𝑧) constant additive shift and just estimating the shift
would incur a variance of n1 σa2 .
𝜇
3.3 Using the control variates estimator
Samples of 𝑓(𝑧) Samples of 𝑓 𝑧 − 𝑔(𝑧) The control variates estimator can be easily inte-
grated into an existing evaluation: we run human
Figure 2: The samples from f (z) have a higher
evaluation on a random sample of system outputs,
variance than the samples from f (z) − g(z) but
automatic evaluation on all the system outputs,
the same mean. This is the key idea behind using
and plug in these results into Algorithm 1.
control variates to reduce variance.
It is vital that we are able to evaluate the au-
tomatic metric on a significantly larger set of ex-
1.0 1.0 amples than those with human evaluations to reli-
Automatic metric correlation (ρ)

ably normalize g(z): without these additional ex-

Inverse data efficiency

0.8 0.8
amples, it be can shown that the optimal minimax
0.6 0.6 estimator for µ is simply the naive estimate µ̂mean .
Intuitively, this is because estimating the mean of
0.4 0.4 g(z) incurs an equally large variance as estimating
µ. In other words, g(z) is only useful if we have
0.2 0.2 additional information about g beyond the samples
{z (i) }.
0.0 0.0
0.00 0.25 0.50 0.75 1.00 Algorithm 1 shows the estimator. In practice,
Normalized annotator variance (γ) we do not know α = Cov(f (z), g(z)), so we use
a plug-in estimate α̂ in line 3 to compute the esti-
Figure 3: Inverse data efficiency for various val- mate µ e in line 4. We note that estimating α from
ues of γ and ρ. We need both low γ and high ρ to data does introduce a O(1/n) bias, but when com-
obtain significant gains. pared to the standard deviation which decays as
√
Θ(1/ n), this bias quickly goes to 0.

Comparing the variances of the two estimators Proposition 3.1. The estimator µ
e in Algorithm 1
((2) and (6)), we define the data efficiency as the has O(1/n) bias.
ratio of the variances:
Algorithm 1 Control variates estimator
Var(µ̂mean )
def 1+γ
DE = = , (7) 1: Input: n human evaluations y (i) on system
Var(µ̂cv ) 1 − ρ2 + γ
outputsP
z (i) , normalized automatic metric g
1
def 2: y = n i y (i)
where γ = σa2 /σf2 is the normalized annotator P
variance. Data efficiency is the key quantity in 3: α̂ = n1 i (y (i) − y)g(z (i) )
P
this paper: it is the multiplicative reduction in the 4: e = n1 i y (i) − α̂g(z (i) )
µ
number of samples required when using the con- 5: return µ
e
trol variates estimator µ̂cv versus the sample mean
µ̂mean . Figure 3 shows the inverse data efficiency An additional question that arises when apply-
contours as a function of the correlation ρ and γ. ing Algorithm 1 is figuring out how many samples
When there is no correlation between human n to use. Given a target variance, the number of
and automatic metrics (ρ = 0), the data efficiency samples can be estimated using (6) with conserva-
is naturally 1 (no gain). In order to achieve a tive estimates of σf2 , σa2 and ρ. Alternatively, our
data efficiency
√ of 2 (half the labeling cost), we estimator can be combined with a dynamic stop-
need |ρ| ≥ 2/2 ≈ 0.707. Interestingly, even ping rule (Mnih et al., 2008) to stop data collection
for an automatic metric with perfect correlation once we reach a target confidence interval.

647
Task Eval. σa2 σf2 γ= σa2 Evaluating language quality in automatic sum-
σf2
marization. In automatic summarization, sys-
CDM Fluency 0.32 0.26 1.23 tems must generate a short (on average two or
CDM Redund. 0.26 0.43 0.61 three sentence) summary of an article: for our
CDM Overall 0.28 0.28 1.00 study, we chose articles from the CNN/Daily Mail
CDM Edit 0.07 0.18 0.36 (CDM) dataset (Hermann et al., 2015; Nallapati
et al., 2016) which come paired with reference
MS MARCO AnyCorr. 0.14 0.15 0.95
summaries in the form of story highlights. We
MS MARCO AvgCorr. 0.12 0.13 0.91
focus on the language quality of summaries and
leave evaluating content selection to future work.
Table 2: A summary of the key statistics, human
For each summary, we collected human judg-
metric variance (σf2 ) and annotator variance (σa2 )
ments on a scale from 1–3 (Figure 4a) for flu-
for different datasets, CNN/Daily Mail (CDM)
ency, (lack of) redundancy, and overall quality of
and MS MARCO in our evaluation benchmark.
the summary using guidelines from the DUC sum-
We observe that the relative variance (γ) is fairly
marization challenge (Dang, 2006). As an alter-
high for most evaluation prompts, upper bounding
nate human metric, we also asked workers to post-
the data efficiency on these tasks. A notable ex-
edit the system’s summary to improve its qual-
ception is the Edit prompt wherein systems are
ity, similar to the post-editing step in MT evalu-
compared on the number of post-edits required to
ations (Snover et al., 2006). Obtaining judgments
improve their quality.
costs about $0.15 per summary and this cost rises
to about $0.40 per summary for post-editing.
We collected judgments on the summaries gen-
3.4 Discussion of assumptions
erated by the seq2seq and pointer models
We will soon see that empirical instantiations of γ of See et al. (2017), the ml and ml+rl mod-
and ρ lead to rather underwhelming data efficien- els of Paulus et al. (2018), and the reference
cies in practice. In light of our optimality result, summaries.3 Before presenting the summaries to
does this mean there is no hope for gains? Let us human annotators, we performed some minimal
probe our assumptions. We assumed that the hu- post-processing: we true-cased and de-tokenized
man judgments are uncorrelated across different the output of seq2seq and pointer using
system outputs; it is possible that a more accurate Stanford CoreNLP (Manning et al., 2014) and re-
model of human annotators (e.g. Passonneau and placed “unknown” tokens in each system with a
Carpenter (2014)) could offer improvements. Per- special symbol ().
haps with additional information about g(z) such
Evaluating answer correctness. Next, we look
as calibrated confidence estimates, we would be
at evaluating the correctness of system outputs
able to sample more adaptively. Of course the
in question answering using the MS MARCO
most direct routes to improvement involve increas-
question answering dataset (Nguyen et al., 2016).
ing the correlation of g with human judgments and
Here, each system is provided with a question and
reducing annotator variance, which we will dis-
up to 10 paragraphs of context. The system gener-
cuss more later.
ates open-response answers that do not need to be
tied to a span in any paragraph.
4 Tasks and datasets We first ask annotators to judge if the output
is even plausible for the question, and if yes, ask
In order to compare different approaches to evalu- them identify if it is correct according to each con-
ating systems, we first collected human judgments text paragraph. We found that requiring annotators
for the output of several automatic summariza- to highlight regions in the text that support their
tion and open-response question answering sys- decision substantially improved the quality of the
tems using Amazon Mechanical Turk. Details of output without increasing costs. Annotations cost
instructions provided and quality assurance steps $0.40 per system response.4
taken are provided in Appendix A of the supple- 3
All system output was obtained from the original authors
mentary material. In this section, we’ll briefly de- through private communication.
scribe how we collected this data. 4
This cost could be significantly reduced if systems also

648
(a) Interface to evaluate language quality on CNN/Daily (b) Interface to judge answer correctness on MS MARCO
Mail

Figure 4: Screenshots of the annotation interfaces we used to measure (a) summary language quality on
CNN/Daily Mail and (b) answer correctness on MS MARCO tasks.

While our goal is to evaluate the correctness of Recall that our primary quantity of interest is data
the provided answer, we found that there are of- efficiency, the ratio of the number of human judg-
ten answers which may be correct or incorrect de- ments required to estimate the overall human eval-
pending on the context. For example, the question uation score for the control variates estimator ver-
“what is a pothole” is typically understood to refer sus the sample mean. We’ll briefly review the au-
to a hole in a roadway, but also refers to a geo- tomatic metrics used in our evaluation before ana-
logical feature (Figure 4b). This is reflected when lyzing the results.
annotators mark one context paragraph to support
the given answer but mark another to contradict it. Automatic metrics. We consider the follow-
We evaluated systems based on both the average ing frequently used automatic word-overlap based
correctness (AvgCorrect) of their answers across metrics in our work: BLEU (Papineni et al.,
all paragraphs as well as whether their answer is 2002), ROUGE (Lin and Rey, 2004) and ME-
correct according to any paragraph (AnyCorrect). TEOR (Lavie and Denkowski, 2009). Following
Novikova et al. (2017) and Liu et al. (2016b), we
We collected annotations on the systems gen-
also compared a vector-based sentence-similarity
erated by the fastqa and fastqa ext from
using sent2vec (Pagliardini et al., 2017) to
Weissenborn et al. (2017) and the snet and
compare sentences (VecSim). Figure 5 shows how
snet.ens(emble) models from Tan et al. (2018),
each of these metrics is correlated with human
along with reference answers. The answers gener-
judgment for the systems being evaluated. Un-
ated by the systems were used without any post-
surprisingly, the correlation varies considerably
processing. Surprisingly, we found that the cor-
across systems, with token-based metrics correlat-
rectness of the reference answers (according to
ing more strongly for systems that are more ex-
the AnyCorrect metric) was only 73.5%, only 2%
tractive in nature (fastqa and fastqa ext).
above that of the leading system (snet.ens).
We manually inspected 30 reference answers Results.5 In Section 3 we proved that the con-
which were annotated incorrectly and found that trol variates estimator is not only unbiased but also
of those, about 95% were indeed incorrect. How- has the least variance among other unbiased esti-
ever, 62% are actually answerable from some mators. Figure 6 plots the width of the 80% con-
paragraph, indicating that the real ceiling perfor- fidence interval, estimated using bootstrap, mea-
mance on this dataset is around 90% and that there sured as a function of the number of samples col-
is still room for improvement on this task. lected for different tasks and prompts. As ex-
pected, the control variates estimator reduces the
5 Experimental results width of the confidence interval. We measure data
We are now ready to evaluate the performance efficiency by the averaging of the ratio of squared
of our control variates estimator proposed in Sec- confidence intervals between the human baseline
tion 3 using the datasets presented in Section 4. 5
Extended results for other systems, metrics
and prompts can be found at https://bit.ly/
specify which passage they used to generate the answer. price-of-debiasing/.

649
0.50 0.50
VecSim 0.45 VecSim 0.45
BLEU-2 0.40 BLEU-2 0.40

Pearson ρ
0.35

Pearson ρ
Metrics

METEOR 0.35

Metrics
METEOR
0.30 0.30
ROUGE-2 0.25 ROUGE-2 0.25
ROUGE-1 0.20 ROUGE-1 0.20
ROUGE-L 0.15
ROUGE-L 0.15
0.10
0.10

l
a

et
t

s
Al
ex

.en
tq

l
q

er
m

Al
s

l+
et
a
fa

int
tq

q2
sn

m
po
s

se
fa

Systems Systems

(a) MS MARCO with the AnyCorrect prompt (b) CNN/Daily Mail with the Edit prompt

Figure 5: Correlations of different automatic metrics on the MS MARCO and CNN/Daily Mail tasks.
Certain systems are more correlated with certain automatic metrics than others, but overall the correlation
is low to moderate for most systems and metrics.

and control variates estimates. We observe that the 6 Related work

data efficiency depends on the task, prompt and
system, ranging from about 1.08 (a 7% cost reduc- In this work, we focus on using existing automatic
tion) to 1.15 (a 13% cost reduction) using current metrics to decrease the cost of human evaluations.
automatic metrics. There has been much work on improving the qual-
As we showed in Section 3, further gains are ity of automatic metrics. In particular, there is
fundamentally limited by the quality of the evalu- interest in learning models (Lowe et al., 2017a;
ation prompts and automatic metrics. Figures 6a Dusek et al., 2017) that are able to optimize for im-
and 6b show how improving the quality of the proved correlations with human judgment. How-
evaluation prompt from a Likert-scale prompt for ever, in our experience, we have found that these
quality (Overall) to using post-editing (Edit) learned automatic metrics have trouble generaliz-
noticeably decreases variance and hence allows ing to different systems. The framework we pro-
better automatic metrics to increase data effi- vide allows us to safely incorporate such models
ciency. Likewise, Figure 6c shows how using into evaluation, exploiting them when their corre-
a better automatic metric (ROUGE-L instead of lation is high but also not introducing bias when it
VecSim) also reduces variance. is low.
Figure 6 also shows the conjectured confidence Our key technical tool is control variates, a stan-
intervals if we were able to eliminate noise in hu- dard statistical technique used to reduce the vari-
man judgments (noiseless humans) or have a au- ance of Monte Carlo estimates (Ripley, 2009).
tomatic metric that correlated perfectly with aver- The technique has also been used in machine
age human judgment (perfect metric). In particu- learning and reinforcement learning to lower vari-
lar, we use the mean of all (2–3) humans on each ance estimates of gradients (Greensmith et al.,
z for the perfect g(z) and use the mean of all hu- 2004; Paisley et al., 2012; Ranganath et al., 2014).
mans on each z for the “noiseless” Y (z). To the best of our knowledge, we are the first to ap-
In both cases, we are able to significantly in- ply this technique in the context of language eval-
crease data efficiency (i.e. decrease estimator vari- uation.
ance). With zero annotator variance and using ex- Our work also highlights the importance of hu-
isting automatic metrics, the data efficiency ranges man evaluation. Chaganty et al. (2017) identified
from 1.42 to 1.69. With automatic metrics with a similar problem of systematic bias in evaluation
perfect correlation and current variance of human metrics in the setting of knowledge base popula-
judgments, it ranges from 2.38 to 7.25. Thus, tion and also propose statistical estimators that re-
we conclude that it is important not only to im- lies on human evaluation to correct bias. Unfortu-
prove our automatic metrics but also the evalua- nately, their technique relies on having a structured
tion prompts we use during human evaluation. output (relation triples) that are shared between

650
0.20 0.20 0.200
Humans Humans Humans
80% confidence interval 0.18 Humans + VecSim 0.18 Humans + VecSim 0.175 Humans + VecSim

80% confidence interval

0.16 Noiseless humans + VecSim 0.16 Noiseless humans + VecSim Humans + ROUGE-1
0.150
Humans + perfect metric Humans + perfect metric Humans + perfect metric
0.14 0.14
0.125
0.12 0.12
0.100
0.10 0.10
0.075
0.08 0.08
0.050
0.06 0.06
0 100 200 300 400 500 0 100 200 300 400 500 0 100 200 300 400 500
Number of samples Number of samples Number of samples

(a) seq2seq on CNN/Daily Mail using (b) seq2seq on CNN/Daily Mail using (c) fastqa ext on MS MARCO using
the Overall Edit AnyCorrect

Figure 6: 80% bootstrap confidence interval length as a function of the number of human judgments
used when evaluating the indicated systems on their respective datasets and prompts. (a) We see a modest
reduction in variance (and hence cost) relative to human evaluation by using the VecSim automatic metric
with the proposed control variates estimator to estimate Overall scores on the CNN/Daily Mail task;
the data efficiency (DE) is 1.06. (b) By improving the evaluation prompt to use Edits instead, it is
possible to further reduce variance relative to humans (DE is 1.15). (c) Another way to reduce variance
relative to humans is to improve the automatic metric evaluation; here using ROUGE-1 instead of VecSim
improves the DE from 1.03 to 1.16.

systems and does not apply to evaluating natu- Without making stronger assumptions, the con-
ral language generation. In a similar vein, Chang trol variates estimator we proposed outlines the
et al. (2017) dynamically collect human feedback limitations of unbiased estimation. Where do we
to learn better dialog policies. go from here? Certainly, we can try to improve
the automatic metric (which is potentially as diffi-
7 Discussion cult as solving the task) and brainstorming alterna-
tive ways of soliciting evaluation (which has been
Prior work has shown that existing automatic less explored). Alternatively, we could give up on
metrics have poor instance-level correlation with measuring absolute scores, and seek instead to find
mean human judgment and that they score many techniques stably rank methods and thus improve
good quality responses poorly. As a result, the them. As the NLP community tackles increasingly
evaluation is systematically biased against genuine difficult tasks, human evaluation will only become
system improvements that would lead to higher more important. We hope our work provides some
human evaluation scores but not improve auto- clarity on to how to make it more cost effective.
matic metrics. In this paper, we have explored us-
ing an automatic metric to decrease the cost of hu- Reproducibility
man evaluation without introducing bias. In prac-
tice, we find that with current automatic metrics All code, data, and experiments for this paper are
and evaluation prompts data efficiencies are only available on the CodaLab platform at https://
1.08–1.15 (7–13% cost reduction). Our theory bit.ly/price-of-debiasing.
shows that further improvements are only possi-
Acknowledgments
ble by improving the correlation of the automatic
metric and reducing the annotator variance of the We are extremely grateful to the authors of the
evaluation prompt. As an example of how evalu- systems we evaluated for sharing their systems’
ation prompts could be improved, we found that output with us. We also would like to thank Ur-
using post-edits of summarizes decreased normal- vashi Khandelwal and Peng Qi for feedback on
ized annotator variance by a factor of three relative an earlier draft of the paper, the crowdworkers
to using a Likert scale survey. It should be noted on Amazon Mechanical Turk and TurkNation for
that changing the evaluation prompt also changes their work and feedback during the data collection
the underlying ground truth f (z): it is up to us process, and the anonymous reviewers for their
to find a prompt that still captures the essence of constructive feedback.
what we want to measure.

651
References C. Liu, R. Lowe, I. V. Serban, M. Noseworthy, L. Char-
lin, and J. Pineau. 2016b. How NOT to evaluate
A. Chaganty, A. Paranjape, P. Liang, and C. Man- your dialogue system: An empirical study of un-
ning. 2017. Importance sampling for unbiased on- supervised evaluation metrics for dialogue response
demand evaluation of knowledge base population. generation. In Empirical Methods in Natural Lan-
In Empirical Methods in Natural Language Process- guage Processing (EMNLP).
ing (EMNLP).
R. Lowe, M. Noseworthy, I. V. Serban, N. Angelard-
C. Chang, R. Yang, L. Chen, X. Zhou, and K. Yu. Gontier, Y. Bengio, and J. Pineau. 2017a. Towards
2017. Affordable on-line dialogue policy learning. an automatic turing test: Learning to evaluate dia-
In Empirical Methods in Natural Language Process- logue responses. In Association for Computational
ing (EMNLP). pages 223–231. Linguistics (ACL).
J. M. Conroy and H. T. Dang. 2008. Mind the gap :
R. T. Lowe, N. Pow, I. Serban, L. Charlin, C. Liu, and
Dangers of divorcing evaluations of summary con-
J. Pineau. 2017b. Training end-to-end dialogue sys-
tent from linguistic quality. In International Con-
tems with the ubuntu dialogue corpus. Dialogue and
ference on Computational Linguistics (COLING).
Discourse 8.
pages 145–152.
C. D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. J.
H. T. Dang. 2006. Overview of DUC 2006. In Docu-
Bethard, and D. McClosky. 2014. The stanford
ment Understanding Conference.
coreNLP natural language processing toolkit. In
M. Denkowski and A. Lavie. 2014. Meteor universal: ACL system demonstrations.
Language specific translation evaluation for any tar-
get language. In Workshop on Statistical Machine V. Mnih, C. Szepesv’ari, and J. Audibert. 2008. Empir-
Translation. ical berstein stopping. In International Conference
on Machine Learning (ICML).
O. Dusek, J. Novikova, and V. Rieser. 2017. Refer-
enceless quality estimation for natural language gen- R. Nallapati, B. Zhou, C. Gulcehre, B. Xiang,
eration. arXiv . et al. 2016. Abstractive text summarization us-
ing sequence-to-sequence rnns and beyond. arXiv
E. Greensmith, P. L. Bartlett, and J. Baxter. 2004. Vari- preprint arXiv:1602.06023 .
ance reduction techniques for gradient estimates in
reinforcement learning. Journal of Machine Learn- T. Nguyen, M. Rosenberg, X. Song, J. Gao, S. Tiwary,
ing Research (JMLR) 5:1471–1530. R. Majumder, and L. Deng. 2016. MS MARCO:
A human generated machine reading comprehension
K. M. Hermann, T. Koisk, E. Grefenstette, L. Espe- dataset. In Workshop on Cognitive Computing at
holt, W. Kay, M. Suleyman, and P. Blunsom. 2015. NIPS.
Teaching machines to read and comprehend. In Ad-
vances in Neural Information Processing Systems J. Novikova, O. Duek, A. C. Curry, and V. Rieser. 2017.
(NIPS). Why we need new evaluation metrics for NLG. In
Empirical Methods in Natural Language Processing
T. Kočisky, J. Schwarz, P. Blunsom, C. Dyer, K. M. (EMNLP).
Hermann, G. Melis, and E. Grefenstette. 2017.
The NarrativeQA reading comprehension challenge. M. Pagliardini, P. Gupta, and M. Jaggi. 2017. Unsuper-
arXiv preprint arXiv:1712.07040 . vised learning of sentence embeddings using com-
positional n-gram features. arXiv .
A. Lavie and M. Denkowski. 2009. The meteor met-
ric for automatic evaluation of machine translation. J. Paisley, D. M. Blei, and M. I. Jordan. 2012. Vari-
Machine Translation 23. ational Bayesian inference with stochastic search.
In International Conference on Machine Learning
C. Lin and M. Rey. 2004. Looking for a few good met- (ICML). pages 1363–1370.
rics: ROUGE and its evaluation. In NTCIR Work-
shop. K. Papineni, S. Roukos, T. Ward, and W. Zhu. 2002.
BLEU: A method for automatic evaluation of ma-
T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, chine translation. In Association for Computational
D. Ramanan, P. Doll’ar, and C. L. Zitnick. 2014. Linguistics (ACL).
Microsoft COCO: Common objects in context. In
European Conference on Computer Vision (ECCV). R. J. Passonneau and B. Carpenter. 2014. The benefits
pages 740–755. of a model of annotation. In Association for Com-
putational Linguistics (ACL).
A. Liu, S. Soderland, J. Bragg, C. H. Lin, X. Ling, and
D. S. Weld. 2016a. Effective crowd annotation for R. Paulus, C. Xiong, and R. Socher. 2018. A deep re-
relation extraction. In North American Association inforced model for abstractive summarization. In
for Computational Linguistics (NAACL). pages 897– International Conference on Learning Representa-
906. tions (ICLR).

652
R. Ranganath, S. Gerrish, and D. Blei. 2014. Black
box variational inference. In Artificial Intelligence
and Statistics (AISTATS). pages 814–822.
B. D. Ripley. 2009. Stochastic simulation. John Wiley
& Sons.

A. See, P. J. Liu, and C. D. Manning. 2017. Get to the

point: Summarization with pointer-generator net-
works. In Association for Computational Linguis-
tics (ACL).

M. Snover, B. Dorr, R. Schwartz, L. Micciulla, and

J. Makhoul. 2006. A study of translation edit rate
with targeted human annotation. In Association for
Machine Translation in the Americas. pages 223–
231.
C. Tan, F. Wei, N. Yang, W. Lv, and M. Zhou. 2018.
S-Net: From answer extraction to answer genera-
tion for machine reading comprehension. In Associ-
ation for the Advancement of Artificial Intelligence
(AAAI).
R. Vedantam, C. L. Zitnick, and D. Parikh. 2015.
CIDEr: Consensus-based image description evalu-
ation. In Computer Vision and Pattern Recognition
(CVPR). pages 4566–4575.
D. Weissenborn, G. Wiese, and L. Seiffe. 2017. Mak-
ing neural QA as simple as possible but not sim-
pler. In Computational Natural Language Learning
(CoNLL).
Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi,
W. Macherey, M. Krikun, Y. Cao, Q. Gao,
K. Macherey, et al. 2016. Google’s neural ma-
chine translation system: Bridging the gap between
human and machine translation. arXiv preprint
arXiv:1609.08144 .

653

ASCL LAB Ejemplos en Pureza
100% (2)
ASCL LAB Ejemplos en Pureza
65 pages
Linear Regression-1: Prof. Asim Tewari IIT Bombay
No ratings yet
Linear Regression-1: Prof. Asim Tewari IIT Bombay
27 pages
Modal Assurance Criterion - Álvaro Rivero
50% (2)
Modal Assurance Criterion - Álvaro Rivero
48 pages
Akaike 1987
No ratings yet
Akaike 1987
16 pages
Program Evaluation and Review Technique
No ratings yet
Program Evaluation and Review Technique
7 pages
Minsky - Steps Toward Artificial Intelligence
No ratings yet
Minsky - Steps Toward Artificial Intelligence
23 pages
Hedging Oil Price Risk: Lessons From Metallgesellschaft
No ratings yet
Hedging Oil Price Risk: Lessons From Metallgesellschaft
20 pages
Advances in Data Science Methodologies and Applications 2021
No ratings yet
Advances in Data Science Methodologies and Applications 2021
472 pages
Probabilistics Methods in Robotics (KOM 613E) : Hakan Temeltas, Prof - DR
No ratings yet
Probabilistics Methods in Robotics (KOM 613E) : Hakan Temeltas, Prof - DR
50 pages
Stajkovic y Luthans 1997 Metaanalysis of The Efects of Organizational Behavior
No ratings yet
Stajkovic y Luthans 1997 Metaanalysis of The Efects of Organizational Behavior
29 pages
Evaluation of LLM Using Automatic Metrics
No ratings yet
Evaluation of LLM Using Automatic Metrics
32 pages
Adaptive Assessment
No ratings yet
Adaptive Assessment
7 pages
Re-Evaluating Evaluation in Text Summarization
No ratings yet
Re-Evaluating Evaluation in Text Summarization
13 pages
Topic: Yam Suitability Mapping in Minna Niger State (As Instructed by Our Instructor)
No ratings yet
Topic: Yam Suitability Mapping in Minna Niger State (As Instructed by Our Instructor)
19 pages
Junita Et Al (2018) The Effect of Budget Variances On The Local Government Budget Changes With Legislature Size As Moderator
No ratings yet
Junita Et Al (2018) The Effect of Budget Variances On The Local Government Budget Changes With Legislature Size As Moderator
12 pages
11411
No ratings yet
11411
8 pages
Applying Metrics To Machine-Learning Tools: A Knowledge Engineering Approach
No ratings yet
Applying Metrics To Machine-Learning Tools: A Knowledge Engineering Approach
13 pages
Convergences and Divergences Between Automatic Assessment and Human 2401.05176v2
No ratings yet
Convergences and Divergences Between Automatic Assessment and Human 2401.05176v2
20 pages
Residual Demand Estimation For Market Delineation
No ratings yet
Residual Demand Estimation For Market Delineation
17 pages
Reviewer in Poped
No ratings yet
Reviewer in Poped
7 pages
Scaling and Content Analysis and Data Processing
No ratings yet
Scaling and Content Analysis and Data Processing
29 pages
Met A Metrics
No ratings yet
Met A Metrics
26 pages
2002.05193v2
No ratings yet
2002.05193v2
68 pages
2404.08008
No ratings yet
2404.08008
32 pages
Automatic Generation and Evaluation of Reading Comprehension Test Items With Large Language Models
No ratings yet
Automatic Generation and Evaluation of Reading Comprehension Test Items With Large Language Models
16 pages
Giving AI Personalities Leads to More Human-Like Reasoning
No ratings yet
Giving AI Personalities Leads to More Human-Like Reasoning
40 pages
BLEURT: Learning Robust Metrics For Text Generation
No ratings yet
BLEURT: Learning Robust Metrics For Text Generation
12 pages
Can You Trust Your Metric
No ratings yet
Can You Trust Your Metric
13 pages
Randomized Classifiers
No ratings yet
Randomized Classifiers
46 pages
Green AI
No ratings yet
Green AI
10 pages
Green AI
No ratings yet
Green AI
12 pages
Achieving Accurate Conclusions in Evaluation of Automatic Machine Translation Metrics
No ratings yet
Achieving Accurate Conclusions in Evaluation of Automatic Machine Translation Metrics
10 pages
Webinar LLM tv2
No ratings yet
Webinar LLM tv2
20 pages
Likert
No ratings yet
Likert
9 pages
Ensemble Methods
100% (1)
Ensemble Methods
15 pages
Artificial Intelligence and Neural Networks (NEW)
No ratings yet
Artificial Intelligence and Neural Networks (NEW)
25 pages
Book Review Artificial Intelligence
No ratings yet
Book Review Artificial Intelligence
6 pages
Basic Statistical Tools in Research and Data Analysis
No ratings yet
Basic Statistical Tools in Research and Data Analysis
6 pages
GTM N1 Used
No ratings yet
GTM N1 Used
5 pages
Ieee Ai Edtech Neu v0
No ratings yet
Ieee Ai Edtech Neu v0
13 pages
A Meta Analytic Review of Relationships Between Team Design Features and Team Performance - Stewart
No ratings yet
A Meta Analytic Review of Relationships Between Team Design Features and Team Performance - Stewart
28 pages
W03 Benchmarking
No ratings yet
W03 Benchmarking
25 pages
Coli A 00356
No ratings yet
Coli A 00356
44 pages
ECN3222 - English
No ratings yet
ECN3222 - English
4 pages
Rastefer
No ratings yet
Rastefer
7 pages
Scsa1702 - Ai - Unit V
No ratings yet
Scsa1702 - Ai - Unit V
17 pages
2504.17544v1
No ratings yet
2504.17544v1
34 pages
Measuring Massive Multitask Language Understanding
No ratings yet
Measuring Massive Multitask Language Understanding
27 pages
How To Do Human Evaluation A Brief Introduction To User Studies in NLP
No ratings yet
How To Do Human Evaluation A Brief Introduction To User Studies in NLP
24 pages
2022.wmt-1.2
No ratings yet
2022.wmt-1.2
23 pages
Artificial Intelligence ACTING HUMANLY: When A Computer Acts Like A Human, It
No ratings yet
Artificial Intelligence ACTING HUMANLY: When A Computer Acts Like A Human, It
2 pages
Factors Predicting ChatGPT Adoption For Assessment Support
No ratings yet
Factors Predicting ChatGPT Adoption For Assessment Support
14 pages
Unit - 5
No ratings yet
Unit - 5
58 pages
302 Assignment 1 COMPLETE
No ratings yet
302 Assignment 1 COMPLETE
8 pages
W19-8643
No ratings yet
W19-8643
14 pages
My Computer is an Honor Studen
No ratings yet
My Computer is an Honor Studen
9 pages
Download
No ratings yet
Download
3 pages
Large language models-based metric for generative question answering systems
No ratings yet
Large language models-based metric for generative question answering systems
8 pages
hogwarts-sols
No ratings yet
hogwarts-sols
8 pages
Green AI
No ratings yet
Green AI
12 pages
Assess White Paper
No ratings yet
Assess White Paper
6 pages
Subjective Answer Evaluation Using NLP
No ratings yet
Subjective Answer Evaluation Using NLP
12 pages
Science Adg7314
No ratings yet
Science Adg7314
5 pages
applsci-14-09125
No ratings yet
applsci-14-09125
19 pages
Measuring Massive Multitask Language Understanding
No ratings yet
Measuring Massive Multitask Language Understanding
25 pages
MCQ Questionnaire Part 01
No ratings yet
MCQ Questionnaire Part 01
9 pages
Cs224n 2025 Lecture12 Evaluation Final
No ratings yet
Cs224n 2025 Lecture12 Evaluation Final
59 pages
Final Exam in Statistics
No ratings yet
Final Exam in Statistics
7 pages
2020.eval4nlp-1.9
No ratings yet
2020.eval4nlp-1.9
13 pages
CAPE Applied Mathematics 2009 U1 P2
No ratings yet
CAPE Applied Mathematics 2009 U1 P2
12 pages
1603.08023v2 (2)
No ratings yet
1603.08023v2 (2)
15 pages
Data Analysis 3rd Sem
No ratings yet
Data Analysis 3rd Sem
15 pages
Measures of Central Tendency: Mean Median Mode Weighted Mean
100% (2)
Measures of Central Tendency: Mean Median Mode Weighted Mean
20 pages
Boswell Test: Beyond the Turing Benchmark
No ratings yet
Boswell Test: Beyond the Turing Benchmark
8 pages
RRstrategy ai윤리 작성 제출본
No ratings yet
RRstrategy ai윤리 작성 제출본
2 pages
Chapter 1
No ratings yet
Chapter 1
24 pages
2505.00612v1
No ratings yet
2505.00612v1
12 pages
Revision Exercise 2024 (FIA3291)
No ratings yet
Revision Exercise 2024 (FIA3291)
4 pages
Silo - Tips Standard Costing With Solutions
100% (1)
Silo - Tips Standard Costing With Solutions
65 pages
3 - Cost
No ratings yet
3 - Cost
27 pages
W08-0331
No ratings yet
W08-0331
4 pages
StatProb q3 Mod5 Sampling-and-Sampling-Distributions
100% (5)
StatProb q3 Mod5 Sampling-and-Sampling-Distributions
27 pages
Coal
No ratings yet
Coal
43 pages
How to Find Inter-Groups Differences Using Spss/Excel/Web Tools in Common Experimental Designs: Book 1
From Everand
How to Find Inter-Groups Differences Using Spss/Excel/Web Tools in Common Experimental Designs: Book 1
P.Y. Cheng
No ratings yet
10 Minute Guide to Orthogonal Array Test Strategy
From Everand
10 Minute Guide to Orthogonal Array Test Strategy
Rajeev Nair Raman
No ratings yet
Linear Regression with Multiple Covariates
From Everand
Linear Regression with Multiple Covariates
Brett Kottmann
No ratings yet
Statistical Classification: Fundamentals and Applications
From Everand
Statistical Classification: Fundamentals and Applications
Fouad Sabry
No ratings yet
Pedestrian Detection: Please, suggest a subtitle for a book with title 'Pedestrian Detection' within the realm of 'Computer Vision'. The suggested subtitle should not have ':'.
From Everand
Pedestrian Detection: Please, suggest a subtitle for a book with title 'Pedestrian Detection' within the realm of 'Computer Vision'. The suggested subtitle should not have ':'.
Fouad Sabry
No ratings yet
Process Performance Models: Statistical, Probabilistic & Simulation
From Everand
Process Performance Models: Statistical, Probabilistic & Simulation
Vishnuvarthanan Moorthy
No ratings yet
Differential Evolution: Fundamentals and Applications
From Everand
Differential Evolution: Fundamentals and Applications
Fouad Sabry
No ratings yet

P18-1060

Uploaded by

P18-1060

Uploaded by

The price of debiasing automatic metrics in natural language evaluation

Arun Tejasvi Chaganty∗ and Stephen Mussmann∗ and Percy Liang

Abstract ing automatic metrics such as BLEU (Papineni

Reference summary System summary (System; Edit / VecSim)

goal is to estimate the average across all examples: where

ably normalize g(z): without these additional ex-

and control variates estimates. We observe that the 6 Related work

80% confidence interval

80% confidence interval

A. See, P. J. Liu, and C. D. Manning. 2017. Get to the

M. Snover, B. Dorr, R. Schwartz, L. Micciulla, and

You might also like