0% found this document useful (0 votes)
24 views13 pages

Summary Eval Alex

The document proposes a new automatic evaluation metric called QAGS (Question Answering and Generation for Summarization) to identify factual inconsistencies in generated summaries. QAGS works by using a question generation model to ask questions about a summary, and then using question answering models to answer the questions using both the summary and source text. It compares the answers to compute a quality score indicating factual consistency. QAGS shows substantially higher correlation with human judgments of factual consistency than existing metrics like ROUGE.

Uploaded by

mtoukir1791
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views13 pages

Summary Eval Alex

The document proposes a new automatic evaluation metric called QAGS (Question Answering and Generation for Summarization) to identify factual inconsistencies in generated summaries. QAGS works by using a question generation model to ask questions about a summary, and then using question answering models to answer the questions using both the summary and source text. It compares the answers to compute a quality score indicating factual consistency. QAGS shows substantially higher correlation with human judgments of factual consistency than existing metrics like ROUGE.

Uploaded by

mtoukir1791
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Asking and Answering Questions to Evaluate the Factual Consistency of

Summaries

Alex Wang∗ Kyunghyun Cho Mike Lewis


New York University Facebook AI Facebook AI
[email protected] New York University

Abstract on counting n-grams, which weigh all n-grams


equally and are insensitive to semantic errors. This
Practical applications of abstractive summa-
rization models are limited by frequent fac-
inadequacy leaves human evaluation as the primary
method for evaluating the factual consistencies,
arXiv:2004.04228v1 [cs.CL] 8 Apr 2020

tual inconsistencies with respect to their in-


put. Existing automatic evaluation metrics which has been noted to be challenging even for
for summarization are largely insensitive to humans (Daume III and Marcu, 2005; Kryscinski
such errors. We propose an automatic evalu- et al., 2019b), in addition to being slow and costly.
ation protocol called QAGS1 that is designed We argue that evaluation metrics that are able
to identify factual inconsistencies in a gener- to capture subtle semantic errors are required to
ated summary. QAGS is based on the intu-
build better models. In this work, we introduce a
ition that if we ask questions about a sum-
mary and its source, we will receive simi- general framework for evaluating conditional text
lar answers if the summary is factually con- generation that is designed to detect factual incon-
sistent with the source. To evaluate QAGS, sistencies in generated text with respect to some
we collect human judgments of factual consis- input. Our framework consists of three steps: (1)
tency on model-generated summaries for the Given a generated text, a question generation (QG)
CNN/DailyMail (Hermann et al., 2015) and model generates a set of questions about the text.
XSUM (Narayan et al., 2018) summarization
(2) We then use question answering (QA) models
datasets. QAGS has substantially higher cor-
relations with these judgments than other au- to answer these questions given both the input and
tomatic evaluation metrics. Also, QAGS of- the generated text. (3) A quality score is computed
fers a natural form of interpretability: The an- based on the similarity of corresponding answers.
swers and questions generated while comput- This approach leverages recent progress in QA
ing QAGS indicate which tokens of a summary and QG to ask and answer human readable, on-
are inconsistent and why. We believe QAGS topic questions (Devlin et al., 2019; Song et al.,
is a promising tool in automatically generating
2019). It only assumes access to a question answer-
usable and factually consistent text.
ing dataset to train the QG and QA models, and is
1 Introduction applicable to any modality where a QA model is
available, e.g. text, images, or knowledge graphs.
Automatic summarization aims to produce sum- We use this framework to develop QAGS (Ques-
maries that are succinct, coherent, relevant, and — tion Answering and Generation for Summariza-
crucially — factually correct. Recent progress in tion), a metric for evaluating the factual consis-
conditional text generation has led to models that tency of abstractive document summaries. Com-
can generate fluent, topical summaries (Lewis et al., pared to commonly used automatic metrics such
2019). However, model-generated summaries fre- as ROUGE (Lin, 2004), QAGS shows dramatically
quently contain factual inconsistencies, limiting higher correlations with human judgements of fac-
their applicability (Kryscinski et al., 2019a). tuality, for example achieving a Pearson correlation
The problem of factual inconsistency is due in coefficient of 54.52 on the CNN/DailyMail sum-
part to the lack of automatic evaluation metrics marization task, compared to 17.72 for ROUGE-2.
that can detect such errors. Standard metrics for QAGS also achieves new state-of-the-art results
evaluating generated text are predominantly based on evaluating the factuality of summaries, outper-
1
Pronounced “kags”. forming recently proposed NLI models for this task
(Kryscinski et al., 2019b). We identify two key deficiencies when using
Finally, we analyse the robustness of QAGS these n-gram based evaluation metrics to detect
through an ablation study. QAGS shows robust- factual inconsistencies in generated text.
ness to the quality of the underlying QG and QA First, these metrics require one or more reference
models, the domain of the models, and the number texts to compare against. Obtaining references can
of questions asked. Even under the worst ablation be expensive and challenging, and as such many
settings, QAGS still has stronger correlation with text generation datasets contain only a single ref-
human judgments than other automatic metrics. erence. This problem is exacerbated with high-
Overall, we contribute the following: (1) We entropy generation tasks, such as summarization
introduce QAGS, an automatic model-based evalu- or dialogue, where there is a very large number of
ation metric for measuring the factual consistency acceptable outputs. In these settings, comparing
of model-generated text. (2) We collect a new against a single reference is woefully inadequate.
set of human judgments of factual consistency of Second, given a reference to compare against,
model-generated summaries for two summariza- n-gram based approach weigh all portions of the
tion datasets. We demonstrate that QAGS corre- text equally, even when only a small fraction of
lates with these judgments significantly better than the n-grams carry most of the semantic content.
other automatic metrics. (3) We show via abla- Factual inconsistencies caused by minor changes
tions that QAGS is robust to a number of factors may be drowned out by otherwise high n-gram
including underlying model quality and domain overlap, making these metrics insensitive to these
mismatch. (4) We analyze the questions and an- errors. For example, the sentences “I am writing
swers produced in computing QAGS to illustrate my paper in Vancouver.” and “I am not writing my
which parts of summaries are inconsistent. (5) We paper in Vancouver.” share nearly all unigrams and
will release models and code to compute QAGS. bigrams despite having the opposite meaning.

2 Background: Automatically 3 A Framework for Automatically


Evaluating Machine Generated Text Evaluating Factual Consistency
Standard approaches to evaluating generated text We introduce a framework for automatically de-
are primarily based on counting n-gram overlap. tecting factual inconsistencies in generated text
These methods assume access to one or more refer- while also addressing the deficiencies of current
ence texts, and score a generated summary based approaches. Let X and Y be sequences of tokens
on the precision and recall of all reference n-grams coming from a vocabulary V where X is a source
in the generated summary. We briefly describe text and Y is a summary of X. We define p(Q|Y )
the most common metrics in this family, and refer as a distribution over all possible questions Q given
readers to Liu et al. (2016) for further discussion. summary Y , and p(A|Q, X) and p(A|Q, Y ) as dis-
ROUGE (Lin, 2004) was developed specifically tributions over all possible answers A to a partic-
for evaluating automatic summarization, and its ular question Q given either the source X or the
variants are the de facto standard for such. The summary Y . We constrain the questions Q and
most common variant is ROUGE-n (typically n ∈ answers A to also be sequences of tokens from V .
{1, 2}), which computes the F1 score for all refer- Then the factual consistency of the summary Y is
ence n-grams in the generated summary. ROUGE-
L, another commonly used variant, is the length  
of the longest common subsequence (possibly non- EQ∼p(Q|Y ) D p(A|Q, X), p(A|Q, Y ) , (1)
consecutive) between a summary and references.
BLEU (Papineni et al., 2002) is closely related to where D is some function measuring the similar-
ROUGE but was developed for machine translation. ity of the two answer distributions. This expression
BLEU computes the precision of the reference n- is maximized when Y contains a subset of the infor-
grams in the generated summary. METEOR (Lavie mation in X such that it produces the same answer
and Agarwal, 2007) extends BLEU by using an for any question from p(Q|Y ). This happens triv-
alignment between the generated text and a ref- ially when Y = X, e.g. we take X as its own
erence, as well as using stemming and synonym summary, but we usually have other desiderata of
replacement for more flexible n-gram matching. Y such that this solution is undesirable.
Source
Leeds showed they are in good shape to
cope with Kevin Sinfield’s retirement as Summary
they claimed a 26 - 12 derby victory over
Castleford in front of a sell-out crowd at Summarization Kevin Sinfield scored his first try of the
season against Castleford. Leeds Rhino
the Mend-a-Hose Jungle. [...] Ryan Hall
scored unbeaten run against Tigers to
was sent to the sin-bin for the first time in
six matches. Ryan Hall was sent to
his career […] Joel Moon scored his first
Leeds Rhino for first time in his career .
try of the season […] Leeds extended
their unbeaten run against the Tigers to
six matches

Source Generated Summary


Answers Questions Answers
Who scored their first try
Joel Moon Kevin Sinfield
of the season?

Who was sent to Leeds


<unanswerable> Ryan Hall
Rhino for the first time?

How many matches did


Six matches Six matches
they win?

Figure 1: Overview of QAGS. A set of questions is generated based on the summary. The questions are then
answered using both the source article and the summary. Corresponding answers are compared using a similarity
function and averaged across questions to produce the final QAGS score.

This framework addresses the two issues with n- to develop QAGS and describe our instantiations
gram based approaches. Instead of requiring a refer- of these components.
ence to compare against, our framework asks ques-
tions based on the generation itself, and compares Question Generation To instantiate p(Q|Y ),
answers with the provided source text. Also, the we draw on recent work on automatic question
use of questions focuses the metric on the seman- generation (QG), which models this distribution
tically relevant parts of the generated text, rather using neural seq2seq models (Du et al., 2017; Kr-
than weighting all parts of the text equally. ishna and Iyyer, 2019). We over-sample questions,
and then filter out low quality questions as follows.
In practice, exactly computing the expectation in
First, we train and generate from answer-
Equation 1 is intractable due to the large space of
conditional QG models: The model receives both
possible questions. One potential workaround is to
the answer and the source article, and is trained
randomly sample questions from p(Q|Y ), but this
to maximize the likelihood of the paired question.
suffers from high variance and requires many sam-
At test time, we extract named entities and noun
ples to obtain a good estimate. Instead, we focus on
phrases as answers candidates using spaCy.2
producing highly probable questions, e.g. as pro-
Second, we filter out low-quality questions using
duced by beam search, which may be biased in the
a number of heuristics, such as duplicates and ques-
limit, but will require fewer questions to estimate
tions less than three tokens long. We also found it
because of the higher quality of the questions.
useful to run the QA model (see next section) on all
of the candidate questions, and filter out questions
4 QAGS
for which the QA model predicted no answer.
Using this framework requires specifying the ques-
Question Answering We instantiate the answer
tion distribution p(Q|Y ), the answer distribution
distributions p(A|Q, ∗) as extractive QA models,
p(A|Q, Y ) (or X), and the answer similarity func-
2
tion D. We apply this framework to summarization https://spacy.io/api/entityrecognizer
for simplicity. We use extractive QA because we Metric CNN/DM XSUM
assume the facts are represented as text spans in the
ROUGE-1 28.74 13.22
article and summary. Future work should explore
ROUGE-2 17.72 8.95
using abstractive QA models, which could match
ROUGE-L 24.09 8.86
paraphrases of the same answer.
METEOR 26.65 10.03
Answer Similarity We use token-level F1 to BLEU-1 29.68 11.76
compare answers, which is standard for extractive BLEU-2 25.65 11.68
QA and equivalent to defining D as BLEU-3 23.96 8.41
BLEU-4 21.45 5.64
F 1(arg max p(A|Q, X), arg max p(A|Q, Y ))
BERTScore 27.63 2.51
The QAGS Score Given these components, we QAGS 54.53 17.49
obtain the QAGS score of a generation by (1) gen-
Table 1: Summary-level Pearson correlation coeffi-
erating K questions conditioned on the summary,
cients between various automatic metrics and human
(2) answering the questions using both the source judgments of correctness for summarization datasets.
article and the summary to get two sets of answers, QAGS obtains substantially higher correlations than all
(3) comparing corresponding answers using the other automatic metrics.
answer similarity metric, and (4) averaging the an-
swer similarity metric over all questions. We depict
this process in Figure 1. first names) in the summary that are not available in
the “article”. This quirk made it especially difficult
5 Experiments for humans and QAGS to tell when factual errors
5.1 Human Evaluation were being made by the summarization model. To
remedy this, for human evaluation and QAGS, we
We test whether QAGS accurately measures the
prepend the summary back to the “article”. We use
factual consistency of a summary with respect to
a subset of 239 test outputs from BART fine-tuned
a source article by computing correlations with
on XSUM (Lewis et al., 2019).
human judgments of factual consistency.
Datasets We evaluate on two abstractive sum- Annotation Protocol We collect human judg-
marization datasets, CNN/Daily Mail (CNNDM, ments on Amazon Mechanical Turk3 via ParlAI
Hermann et al., 2015; Nallapati et al., 2016) and (Miller et al., 2017). We present summaries one
XSUM (Narayan et al., 2018). Abstractive sum- sentence at a time, along with the entire article. For
marization is particularly interesting because fac- each summary sentence, the annotator makes a bi-
tual consistency with the original text is crucial nary decision as to whether the sentence is factually
to usability, and a lack of such consistency has consistent with the article. Workers are instructed
plagued abstractive neural summarization models to mark non-grammatical sentences as not consis-
(Cao et al., 2018; Falke et al., 2019; Kryscinski tent, and copies of article sentences as consistent.
et al., 2019b, i.a.). Workers are paid $1 per full summary annotated.
CNN/DM is a standard dataset for summariza- See Appendix A for further details.
tion that consists of CNN and DailyMail articles. We collect 3 annotations per summary. To obtain
Each reference summary consists of the concate- a single “correctness” score per summary, we first
nation of three editor-written, bullet point high- take the majority vote for each sentence, then aver-
lights. For summaries, we use 235 test outputs age the binary scores across summary sentences.
from Gehrmann et al. (2018). Inter-annotator agreement as measured by Krip-
XSUM was created by taking the first sentence pendorff’s α is 0.51 and 0.34 for CNN/DM and
of a news article as the summary, and using the rest XSUM, respectively indicating “moderate” and
of the article as the source. Consequently, XSUM “fair” agreement (Ageeva et al., 2015). While not
summaries are significantly more abstractive than perfect, these agreement numbers are in-line with
those of CNN/DM, and extractive summarization similar figures from previous work on summariza-
models perform poorly on this dataset. tion evaluation (Daume III and Marcu, 2005).
We found that while the XSUM summaries are
3
more abstractive, frequently there are facts (e.g. https://www.mturk.com/
5.2 Experimental Details QA model
SQuAD CNN/DM XSUM
(F1) (Pear.) (Pear.)
Question Generation We use fairseq (Ott
bert-base 75.95 55.20 20.71
et al., 2019) to fine-tune a pretrained BART lan- bert-large 81.57 54.53 17.49
guage model on NewsQA (Trischler et al., 2017), bert-large-wwm 84.36 51.36 18.07
a dataset consisting of CNN articles and crowd-
sourced questions. For each summary, we use 10 Table 2: Pearson correlations between human judg-
answer candidates and generate questions using ments of factual consistency and QAGS using QA mod-
els of different qualities, as measured by performance
beam search with width 10, for a total of 100 ques-
on the SQuAD2.0 development set (F1). The correla-
tion candidates. After filtering, we use the K = 20 tions are stable across QA model quality.
most probable questions. If a summary has too
few filtered questions, we randomly sample ques-
NewsQA CNN/DM XSUM
tions to reach the required number. For details, see
(ppl.) (Pear.) (Pear.)
Appendix B.
5.48 54.53 17.49
Question Answering We train QA models 9.50 50.09 19.93
by fine-tuning BERT (Devlin et al., 2019) on 18.56 47.92 16.38
SQuAD2.0 (Rajpurkar et al., 2018). We use
the large-uncased BERT variant via the Table 3: Pearson correlations between human judg-
transformers library (Wolf et al., 2019). ments of factual consistency and QAGS with QG mod-
els of varying quality, as measured by perplexity on the
Baselines We compare against a number of au- NewsQA development set. We see some decrease in
tomatic evaluation metrics: ROUGE (Lin, 2004), correlation on CNN/DM as QG perplexity increases,
METEOR (Lavie and Agarwal, 2007), BLEU (Pa- though we do not see a similar trend for XSUM.
pineni et al., 2002), and BERTScore (Zhang et al.,
2019). The latter uses BERT representations to 5.4 Ablations
compute an alignment between generation and ref-
erence tokens, and which is then used to com- A potential issue with model-based evaluation is
pute a soft version of unigram F1. We use the that the quality of the evaluation metric may de-
large-uncased BERT variant. pend heavily on specific hyperparameter settings.
We explore whether this is true with QAGS by per-
5.3 Results forming ablations on several factors.
We present results in Table 1. QAGS strongly Model Quality We first consider the degree to
outperforms other automatic evaluation metrics in which the quality of the underlying models impacts
terms of correlation with human judgments of fac- their evaluation capabilities.
tual consistency. BLEU and ROUGE perform com- For QA quality, we answer this question by
parably, and lower order n-gram metrics work bet- training QA models of varying quality by fine-
ter. BERTScore matches the best n-gram metrics tuning different versions of BERT on SQuAD.
on CNN/DM, but the worst overall on XSUM. We present results in Table 2. The QA mod-
On CNN/DM, QAGS obtains nearly twice the els perform similarly despite substantially dif-
correlation of the next best automatic metric ferent performances on the SQuAD develop-
(BLEU-1). We speculate that this large increase ment set. Surprisingly, using the best QA
is due to the sensitivity of the QA model to the model (bert-large-wwm) does not lead to the
sentence fusing behavior exhibited in many sum- best correlations with human judgments. On
marization models trained on CNN/DM (Lebanoff CNN/DM, bert-large-wwm slightly under-
et al., 2019). When two sentences are fused to performs bert-base and bert-large. On
produce an incorrect summary statement, the QA XSUM, bert-base slightly outperforms the
model produces different answers than when using other two BERT variants. These results indicate
the source article versus when using the summary. that QAGS is fairly robust to the quality of the un-
On XSUM, all metrics correlate worse with hu- derlying QA model, though we note that BERT is a
man judgments than on CNN/DM, which reflects strong QA baseline, and using weaker QA models
the fact that XSUM is more abstractive. QAGS still might lead to larger performance dropoffs.
outperforms the next best automatic metric. To ablate QG quality, we use models with in-
# Questions CNN/DM XSUM Model/Metric % Correct (↑)
5 41.61 15.63 Random 50.0%
10 41.17 15.49 BERT NLI 64.1%
20 54.53 17.49 ESIM 67.6%
50 57.94 17.74 FactCC 70.0%
QAGS 72.1%
Table 4: Pearson correlation coefficients between
QAGS scores with varying number of questions and Table 5: Results on the sentence ranking task from
human judgments of correctness for summarization Falke et al. (2019). Results using BERT NLI and ESIM
datasets. The correlation increases with the number of are from Falke et al. (2019); FactCC is from Kryscinski
questions used, but with decreasing marginal benefit. et al. (2019b). QAGS outperforms previous work.

creasing perplexity on the NewsQA development other automatic metrics, indicating its robustness.
set. Results in Table 3 show that QAGS is robust
to the QG model quality, with some decrease in Answer Similarity Metric Finally, we consider
correlation with human judgments as perplexity in- using exact match as an alternative answer sim-
creases on CNN/DM, and no clear trend on XSUM. ilarity metric. Exact match is another common
Even the weakest QG model still significantly out- evaluation metric for extractive QA, and is more re-
performs all other automatic metrics in Table 1. strictive than F1. When using EM, we obtain Pear-
son correlations with human judgments of 45.97
Domain Effects Our approach relies on having a and 18.10 on CNN/DM and XSUM, as opposed to
labeled dataset to train QG and QA models. How- 54.53 and 17.49 when using F1.
ever, for relatively niche domains, such a labeled
QA/QG dataset may not exist. Instead, we may 6 Re-ranking with QAGS
need to resort to using models trained on out- Several works explore the use of natural language
of-domain data, leading to domain shift effects inference (NLI) models to detect factual consis-
that negatively impact the quality of the QAGS tency in generated text (Welleck et al., 2019; Falke
scores. We simulate this setting by fine-tuning the et al., 2019). We compare against these methods
QG model on SQuAD, which is of similar size to by evaluating on the sentence ranking experiment
NewsQA but drawn from Wikipedia articles rather from Falke et al. (2019). The experiment uses 373
than CNN articles, which exactly matches the genre triplets of source sentences from CNN/DM and two
of the summarization datasets. summary sentences generated from the model from
Evaluating with this QG model, we get cor- Chen and Bansal (2018). One summary sentence is
relations of 51.53 and 15.28 with human judg- factually consistent with the source sentence, and
ments on CNN/DM and XSUM respectively, versus the other is inconsistent. A metric (or model) is
54.53 and 17.49 when using the NewsQA-tuned evaluated based on how often it ranks the consistent
QG model. The drop in performance indicates a sentence higher than the inconsistent sentence.
negative domain shift effect. However using the We present the results in Table 5. Results using
SQuAD-tuned QG model still substantially outper- two NLI models fine-tuned on MultiNLI (Williams
forms all other automatic metrics, again pointing et al., 2018), BERT NLI and ESIM (Chen et al.,
to the robustness of QAGS. 2017), are from Falke et al. (2019). FactCC
Number of Questions Next, we investigate the (Kryscinski et al., 2019b) is an NLI-based fact-
correlation with human judgments when varying checking model that is trained on a dataset tailor
the number of questions used. Results in Table 4 made for detecting factual inconsistencies in gener-
show that increasing the number of questions used ated text. QAGS outperforms these methods, while
improves correlations with human judgments. We requiring no special supervision for this task.
observe a large increase when moving from 10
7 Qualitative Analysis
to 20 questions, and a smaller increase from 20
to 50 questions, indicating decreasing marginal Interpreting QAGS The questions and answers
benefit moving beyond 50 questions. With just 5 produced in computing QAGS are directly inter-
questions, QAGS still substantially outperforms pretable, and highlight errors in summaries. We
Article: On Friday, 28-year-old Usman Khan stabbed reportedly several people at Fishmongers’ Hall
in London with a large knife, then fled up London Bridge. Members of the public confronted him; one
man sprayed Khan with a fire extinguisher, others struck him with their fists and took his knife, and
another, a Polish chef named ukasz, harried him with a five-foot narwhal tusk. [. . . ]
Summary : On Friday afternoon , a man named Faisal Khan entered a Cambridge University building
and started attacking people with a knife and a fire extinguisher .
Question 1: What did the attacker have ?
Article answer: a large knife Summary answer: a knife and a fire extinguisher
Question 2: When did the attack take place ?
Article answer: Friday Summary answer: Friday afternoon
Question 3: What is the attacker’s name ?
Article answer: Usman Khan Summary answer: Faisal Khan
Question 4: Where did the attack take place ?
Article answer: Fishmongers’ Hall Summary answer: Cambridge University building
Article: In findings published on Wednesday in the journal PLOS ONE, an international team of
scientists report ancient Egyptians captured sacred ibises (Threskiornis aethiopicus) from the wild for
use in ritual sacrifice rather than domesticating the birds. [. . . ] The team collected DNA samples from
mummified birds collected from six separate catacombs including sites at Abydos, Saqqara, and Tuna
el-Gebel with permission from the Egyptian Ministry of State for Antiquity, and several museums
offered to send tissue samples from the mummified ibises in their collections. [. . . ]
Summary : Archaeologists have used DNA samples from ancient ibis birds to determine whether the
birds were domesticated or sacrificed in ancient Egypt
Question 1: Archaeologists have used what to determine whether the birds were domesticated ?
Article Answer: hatchery structures Summary Answer: DNA samples
Question 2: Who used DNA samples to determine whether the birds were domesticated ?
Article Answer: [NO ANSWER] Summary Answer: Archaeologists
Question 3: What are archeologists using to determine whether the birds were domesticated ?
Article Answer: DNA samples Summary Answer: DNA samples
Question 4: Where were the birds found?
Article Answer: six separate catacombs Summary Answer: ancient Egypt

Table 6: Example questions and answers generated when computing QAGS. The questions are overwhelmingly
fluent and relevant. The answers indicate which tokens in the summary are factually consistent or inconsistent.

present examples of articles, summaries, and the QA model incorrectly marks question 2 as unan-
QAGS questions and answers in Table 6. swerable. On question 4, both answers produced
are correct, but because they have no common to-
On the first example (Table 6, top), QAGS de-
kens, they are marked inconsistent by QAGS.
tects several factual inconsistencies in the gener-
ated summary: The summary mistakes the first Error Analysis The interpretability of QAGS al-
name of the attacker, the location of the attack, and lows for error analysis on the metric. We manually
the weapons used. Because the QG model focuses annotate 400 triplets of generated questions, article
on these details, QAGS is able to correctly penalize answers, and summary answers that are produced
the summary for its hallucinations. Because the in computing QAGS on the XSUM summaries, and
answer candidates used are mostly named entities label them by the quality of the generated questions,
and noun phrases, QAGS is particularly effective predicted answers, and answer similarity scores.
at detecting errors of this kind. Using more di-
Among the generated questions, 8.75% are non-
verse answer candidates may broaden the set of
sensical, while 3.00% are well-formed but unan-
inconsistencies that QAGS is able to detect.
swerable using the generated summary they were
The second example (Table 6, bottom), illus- conditioned upon. These figures indicate that the
trates failure modes of QAGS. For example, the vast majority of questions are understandable and
on-topic. We frequently observe multiple questions neau, 2004), relevance prediction (Daume III and
with slightly different wordings, which is likely Marcu, 2005), and many more.
due to the low number of answer candidates in There has been a recent resurgence of work lever-
XSUM summaries (which are one sentence long) aging NLU models for evaluating the factuality of
and due to beam search. 8.25% of questions are generated text. Goodrich et al. (2019) use infor-
well-formed but unanswerable using the source, mation extraction models to measure factual over-
which is usually due to a hallucinated fact in the lap, but facts are restricted to pre-defined schemas.
summary that the QG model turns into a question. Falke et al. (2019) investigate the use of NLI mod-
Among predicted answers, 1.75% of questions els to evaluate the factual correctness of CNN/DM
are potentially answerable using the summary, but summaries, and conclude that current NLI models
are incorrectly answered. This percentage in- are too brittle to be reliably used in this manner.
creases to 32.50% for the article, which indicates Kryscinski et al. (2019b) train a NLI-based fact-
that the transfer ability of the QA model is lacking. checking model by building a dataset of factual in-
In a small number of cases, we found that while consistencies based on noise heuristic. Our QA ap-
a question had a single answer in the summary, it proach allows a finer-grained analysis, because NLI
could have multiple answers in the article. operates on complete sentences, whereas QAGS
Finally, for 8.00% of the examples, the ques- can ask many questions about the same sentence.
tion is answered correctly using both the article Most relatedly, Eyal et al. (2019) and Scialom
and summary, but the answers have high lexical et al. (2019) use QA models to evaluate summariza-
variation such that F1 score fails to detect their tion. We diverge from these works in two important
similarity. While this happens in a relatively small ways. First, both works use Cloze-style questions,
number of cases, exploring similarity metrics other which are generated by masking entities in either
than n-gram based approaches could be useful. the source document or the reference summary. We
instead generate the questions with a model, allow-
Limitations We emphasize that QAGS and our
ing a much greater range of questions. Second,
overall framework are specifically designed to de-
we produce questions conditioned on the generated
tect factual inconsistencies in generated summaries
summary, rather than the reference summary or
relative to the source article. QAGS does not mea-
source article. Producing questions from the gener-
sure other desirable properties of generated text,
ated summary is more appropriate for verifying the
including fluency, readability, or factual recall. We
accuracy of the text, whereas using the reference
therefore recommend using QAGS in conjunction
or source measures content selection.
with complementary evaluation metrics.
The choices of QG and QA models in QAGS are
particular to abstractive summarization and may
9 Conclusion
require adaptation to be used for other conditional We introduce a framework for automatically detect-
text generation tasks. For example, we expect that ing factual inconsistencies in conditionally gener-
extractive summarization models may obtain nearly ated texts and use this framework to develop QAGS,
perfect QAGS scores because facts and statements a metric for measuring inconsistencies in abstrac-
are directly copied from the source article. tive summarization. QAGS correlates with human
judgments of factuality significantly better than
8 Related Work
standard automatic evaluation metrics for summa-
Automatic summarization and its evaluation are rization, and outperforms related NLI-based ap-
long-standing lines of work in NLP, dating at least proaches to factual consistency checking. QAGS is
as far back as the Document Understanding Con- naturally interpretable: The questions and answers
ferences (Chali and Kolla, 2004). The primary produced in computing QAGS indicate which to-
evaluation metric then and now is ROUGE (Lin, kens in a generated summary are inconsistent and
2004), though much work has demonstrated the why. Error analysis shows that future work should
limited ability of ROUGE and its relatives to evalu- explore improved QA models. Our approach can
ate summaries (Dorr et al., 2004; Liu and Liu, 2009; also be applied to diverse modalities, such as trans-
Kedzie et al., 2018, i.a.). Other metrics have fo- lation and image captioning. Overall, we believe
cused on specific aspects of summarization quality, QAGS is useful in quantifying and incentivizing
including content selection (Nenkova and Passon- factually consistent text generation.
References Tobias Falke, Leonardo FR Ribeiro, Prasetya Ajie
Utama, Ido Dagan, and Iryna Gurevych. 2019.
Ekaterina Ageeva, Mikel L. Forcada, Francis M. Ty- Ranking generated summaries by correctness: An in-
ers, and Juan Antonio Pérez-Ortiz. 2015. Evaluating teresting but challenging application for natural lan-
machine translation for assimilation via a gap-filling guage inference. In Proceedings of the 57th Confer-
task. In Proceedings of the 18th Annual Conference ence of the Association for Computational Linguis-
of the European Association for Machine Transla- tics, pages 2214–2220.
tion, pages 137–144, Antalya, Turkey.
Angela Fan, Mike Lewis, and Yann Dauphin. 2018. Hi-
Ziqiang Cao, Furu Wei, Wenjie Li, and Sujian Li. 2018. erarchical neural story generation. In Proceedings
Faithful to the original: Fact aware neural abstrac- of the 56th Annual Meeting of the Association for
tive summarization. In Thirty-Second AAAI Confer- Computational Linguistics (Volume 1: Long Papers),
ence on Artificial Intelligence. pages 889–898.
Yllias Chali and Maheedhar Kolla. 2004. Summariza- Sebastian Gehrmann, Yuntian Deng, and Alexander
tion techniques at duc 2004. In In Proceedings of Rush. 2018. Bottom-up abstractive summarization.
the Document Understanding Conference. Citeseer. In Proceedings of the 2018 Conference on Empiri-
cal Methods in Natural Language Processing, pages
Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei, Hui
4098–4109.
Jiang, and Diana Inkpen. 2017. Enhanced lstm for
natural language inference. In Proceedings of the Ben Goodrich, Vinay Rao, Peter J. Liu, and Moham-
55th Annual Meeting of the Association for Compu- mad Saleh. 2019. Assessing the factual accuracy
tational Linguistics (Volume 1: Long Papers), pages of generated text. In Proceedings of the 25th ACM
1657–1668. SIGKDD International Conference on Knowledge
Discovery & Data Mining, KDD ’19, pages 166–
Yen-Chun Chen and Mohit Bansal. 2018. Fast abstrac-
175, New York, NY, USA. ACM.
tive summarization with reinforce-selected sentence
rewriting. In Proceedings of the 56th Annual Meet- Karl Moritz Hermann, Tomas Kocisky, Edward Grefen-
ing of the Association for Computational Linguistics stette, Lasse Espeholt, Will Kay, Mustafa Suleyman,
(Volume 1: Long Papers), pages 675–686. and Phil Blunsom. 2015. Teaching machines to read
and comprehend. In Advances in neural information
Hal Daume III and Daniel Marcu. 2005. Bayesian
processing systems, pages 1693–1701.
summarization at duc and a suggestion for extrinsic
evaluation. In Proceedings of the Document Under- Ari Holtzman, Jan Buys, Maxwell Forbes, and Yejin
standing Conference, DUC-2005, Vancouver, USA. Choi. 2019. The curious case of neural text degener-
ation. arXiv preprint arXiv:1904.09751.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019. Bert: Pre-training of Chris Kedzie, Kathleen McKeown, and Hal Daume III.
deep bidirectional transformers for language under- 2018. Content selection in deep learning models of
standing. In Proceedings of the 2019 Conference of summarization. In Proceedings of the 2018 Con-
the North American Chapter of the Association for ference on Empirical Methods in Natural Language
Computational Linguistics: Human Language Tech- Processing, pages 1818–1828.
nologies, Volume 1 (Long and Short Papers), pages
4171–4186. Kalpesh Krishna and Mohit Iyyer. 2019. Generating
question-answer hierarchies. In Proceedings of the
Bonnie Dorr, Christof Monz, Douglas Oard, David 57th Annual Meeting of the Association for Com-
Zajic, and Richard Schwartz. 2004. Extrin- putational Linguistics, pages 2321–2334, Florence,
sic evaluation of automatic metrics for summa- Italy. Association for Computational Linguistics.
rization. Technical report, MARYLAND UNIV
COLLEGE PARK INST FOR ADVANCED COM- Wojciech Kryscinski, Nitish Shirish Keskar, Bryan Mc-
PUTER STUDIES. Cann, Caiming Xiong, and Richard Socher. 2019a.
Neural text summarization: A critical evaluation. In
Xinya Du, Junru Shao, and Claire Cardie. 2017. Learn- Proceedings of the 2019 Conference on Empirical
ing to ask: Neural question generation for reading Methods in Natural Language Processing, Volume 1
comprehension. In Proceedings of the 55th An- (Long and Short Papers).
nual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), pages 1342– Wojciech Kryscinski, Bryan McCann, Caiming Xiong,
1352. and Richard Socher. 2019b. Evaluating the factual
consistency of abstractive text summarization.
Matan Eyal, Tal Baumel, and Michael Elhadad. 2019.
Question answering as an automatic evaluation met- Alon Lavie and Abhaya Agarwal. 2007. Meteor: An
ric for news article summarization. In Proceed- automatic metric for mt evaluation with high levels
ings of the 2019 Conference of the North American of correlation with human judgments. In Proceed-
Chapter of the Association for Computational Lin- ings of the Second Workshop on Statistical Machine
guistics: Human Language Technologies, Volume 1 Translation, pages 228–231. Association for Compu-
(Long and Short Papers), pages 3938–3948. tational Linguistics.
Logan Lebanoff, John Muchovej, Franck Dernoncourt, toolkit for sequence modeling. NAACL HLT 2019,
Doo Soon Kim, Seokhwan Kim, Walter Chang, and page 48.
Fei Liu. 2019. Analyzing sentence fusion in abstrac-
tive summarization. In Proceedings of the 2nd Work- Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
shop on New Frontiers in Summarization, pages Jing Zhu. 2002. Bleu: a method for automatic eval-
104–110. uation of machine translation. In Proceedings of
the 40th annual meeting on association for compu-
Mike Lewis, Yinhan Liu, Naman Goyal, Mar- tational linguistics, pages 311–318. Association for
jan Ghazvininejad, Abdelrahman Mohamed, Omer Computational Linguistics.
Levy, Ves Stoyanov, and Luke Zettlemoyer.
2019. BART: Denoising sequence-to-sequence pre- Gabriel Pereyra, George Tucker, Jan Chorowski,
training for natural language generation, translation, Lukasz Kaiser, and Geoffrey Hinton. 2017. Regular-
and comprehension. arXiv preprint 1910.13461. izing neural networks by penalizing confident output
Chin-Yew Lin. 2004. Rouge: A package for automatic distributions.
evaluation of summaries. In Text summarization
branches out, pages 74–81. Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018.
Know what you dont know: Unanswerable ques-
Chia-Wei Liu, Ryan Lowe, Iulian Serban, Mike Nose- tions for squad. In Proceedings of the 56th Annual
worthy, Laurent Charlin, and Joelle Pineau. 2016. Meeting of the Association for Computational Lin-
How not to evaluate your dialogue system: An em- guistics (Volume 2: Short Papers), pages 784–789.
pirical study of unsupervised evaluation metrics for
dialogue response generation. In Proceedings of the Thomas Scialom, Sylvain Lamprier, Benjamin Pi-
2016 Conference on Empirical Methods in Natural wowarski, and Jacopo Staiano. 2019. Answers
Language Processing, pages 2122–2132. unite! unsupervised metrics for reinforced summa-
rization models. In Proceedings of the 2019 Con-
Feifan Liu and Yang Liu. 2009. Exploring correlation ference on Empirical Methods in Natural Language
between rouge and human evaluation on meeting Processing and the 9th International Joint Confer-
summaries. IEEE Transactions on Audio, Speech, ence on Natural Language Processing (EMNLP-
and Language Processing, 18(1):187–196. IJCNLP), pages 3237–3247.
Ilya Loshchilov and Frank Hutter. 2018. Decoupled
weight decay regularization. Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-
Yan Liu. 2019. Mass: Masked sequence to se-
Alexander Miller, Will Feng, Dhruv Batra, Antoine quence pre-training for language generation. In In-
Bordes, Adam Fisch, Jiasen Lu, Devi Parikh, and ternational Conference on Machine Learning, pages
Jason Weston. 2017. Parlai: A dialog research soft- 5926–5936.
ware platform. In Proceedings of the 2017 Con-
ference on Empirical Methods in Natural Language Adam Trischler, Tong Wang, Xingdi Yuan, Justin Har-
Processing: System Demonstrations, pages 79–84. ris, Alessandro Sordoni, Philip Bachman, and Ka-
heer Suleman. 2017. Newsqa: A machine compre-
Ramesh Nallapati, Bowen Zhou, Cicero dos Santos, hension dataset. In Proceedings of the 2nd Work-
Çağlar Gulçehre, and Bing Xiang. 2016. Abstrac- shop on Representation Learning for NLP, pages
tive text summarization using sequence-to-sequence 191–200.
RNNs and beyond. In Proceedings of The 20th
SIGNLL Conference on Computational Natural Lan- Sean Welleck, Jason Weston, Arthur Szlam, and
guage Learning, pages 280–290, Berlin, Germany. Kyunghyun Cho. 2019. Dialogue natural language
Association for Computational Linguistics. inference. In Proceedings of the 57th Annual Meet-
ing of the Association for Computational Linguistics,
Shashi Narayan, Shay B. Cohen, and Mirella Lapata. pages 3731–3741, Florence, Italy. Association for
2018. Don’t give me the details, just the summary! Computational Linguistics.
Topic-aware convolutional neural networks for ex-
treme summarization. In Proceedings of the 2018
Adina Williams, Nikita Nangia, and Samuel Bowman.
Conference on Empirical Methods in Natural Lan-
2018. A broad-coverage challenge corpus for sen-
guage Processing, Brussels, Belgium.
tence understanding through inference. In Proceed-
Ani Nenkova and Rebecca Passonneau. 2004. Evalu- ings of the 2018 Conference of the North American
ating content selection in summarization: The pyra- Chapter of the Association for Computational Lin-
mid method. In Proceedings of the human language guistics: Human Language Technologies, Volume 1
technology conference of the north american chap- (Long Papers), pages 1112–1122.
ter of the association for computational linguistics:
Hlt-naacl 2004, pages 145–152. Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
Chaumond, Clement Delangue, Anthony Moi, Pier-
Myle Ott, Sergey Edunov, Alexei Baevski, Angela ric Cistac, Tim Rault, Rémi Louf, Morgan Funtow-
Fan, Sam Gross, Nathan Ng, David Grangier, and icz, et al. 2019. Transformers: State-of-the-art natu-
Michael Auli. 2019. Fairseq: A fast, extensible ral language processing. arXiv preprint 1910.03771.
Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q
Weinberger, and Yoav Artzi. 2019. Bertscore: Eval-
uating text generation with bert. arXiv preprint
1904.09675.
A Human Evaluation Task Design some answers if there are more than 10. The model
predicts the question after seeing an answer and the
We restrict our pool of workers to US-based work-
article.
ers. Workeres are required to have at least 1000
During decoding, we use beam search with beam
approved HITs with an acceptance rate of at least
size 10, length penalty 1.0, and trigram repetition
98%.
blocking. We experimented with top-k (Holtz-
The base reward for our task is $0.15. For each
man et al., 2019) and top-p (Fan et al., 2018), but
summary, we include automatic quality checks in-
the outputted questions, while diverse, were quite
cluding
noisy. Generations have minimum length 8 and
• Time checks: workers who complete the task max length 60.
under 30s fail the check To filter the questions, we first use simple heuris-
tics, including removing
• Attention checks: we include exact copies of
article sentences and corrupted mixtures of • everything after the first question mark in a
two article sentences as positive and negative question
control task. If a worker fails to answer both • exact duplicates
of these examples correctly, they fail the check
• questions shorter than three tokens long
• Explanation checks: For each sentence in the
summary, the worker is required to provide a For the remaining questions, we use our QA model
short explanation of their decision to answer each question and we remove questions
for which the QA model deems unanswerable. We
If a worker passes all checks, they are awarded then take the top 20 most probable questions, ran-
a $0.85 bonus, totalling $1.00 per correct annota- dom sampling some of the filtered questions if there
tion. According to turkerview.com, workers of were too few.
our HIT are paid well in excess of $15.00 on aver-
age. Question Answering We fine-tune BERT for
We show our annotation interfaces for the anno- question answering following the original work.
tation task for CNN/DM and XSUM respectively We optimize using AdamW (Loshchilov and Hut-
in Figures 2 and 3. We use slightly different instruc- ter, 2018) with initial learning rate 5e-5. We train
tions to accommodate for the quirks of each dataset. for 3 epochs, with a warmup ratio of 0.1. We use
For XSUM, we prepend the reference “summary” the model with the best development set perfor-
back onto the source article, as without it, workers mance.
were struggling to identify factual inconsistencies. We use SQuAD2.0 because we found the unan-
swerable questions useful for filtering out questions
B Model and Generation Details and questions based on hallucinated facts in the
summary should be unanswerable using the source
Question Generation We fine-tune BART for article. Similar to the QG setting, we append the
question generation using the same tuning hyper- question and answer to the source article with in-
parameters as the original work. We optimize label tervening special marker tokens.
smoothed cross entropy with smoothing parameter
0.1 (Pereyra et al., 2017) and a peak learning rate of
2e-5. We optimize for 100k steps with 5k warmup
steps, and use the model with the best perplexity
on the development set.
To turn NewsQA into an answer conditional QG
dataset, we concatenate the answer to the source
article with a special marker token in between. We
then concatenate another special marker token and
the question. At test time, we get 10 named entities
and noun phrases as answer candidates using the
en-web-sm spaCy model. We downsampling if
there are more than 10 and randomly duplicating
Figure 2: Annotation interface and instructions for CNN/DM factual consistency task.

Figure 3: Annotation interface and instructions for XSUM factual consistency task.

You might also like