0% found this document useful (0 votes)
14 views

Benchmarking Large Language Models for News Summarization

Uploaded by

kelvinhkcs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Benchmarking Large Language Models for News Summarization

Uploaded by

kelvinhkcs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Benchmarking Large Language Models for News Summarization

Tianyi Zhang1∗, Faisal Ladhak2∗ , Esin Durmus1 , Percy Liang1 ,


Kathleen McKeown2 , Tatsunori B. Hashimoto1
1
Stanford University, USA 2 Columbia University, USA

Abstract evaluation identifies instruction tuning to be the


key to zero-shot summarization capability. In
Large language models (LLMs) have shown contrast, self-supervised learning alone cannot in-
promise for automatic summarization but the duce strong summarization performance in the
reasons behind their successes are poorly un-
zero-shot setting (Figure 1). In fact, even a 350M

Downloaded from http://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00632/2325685/tacl_a_00632.pdf by guest on 30 October 2024


derstood. By conducting a human evaluation
on ten LLMs across different pretraining meth- parameter instruction-tuned GPT-3 can perform
ods, prompts, and model scales, we make on par with the 175B parameter GPT-3.
two important observations. First, we find in- To benchmark LLMs, we evaluated the stan-
struction tuning, not model size, is the key dard CNN/DM (Hermann et al., 2015) and XSUM
to the LLM’s zero-shot summarization ca- datasets (Narayan et al., 2018) but found that ex-
pability. Second, existing studies have been isting reference summaries caused several issues.
limited by low-quality references, leading to
The reference summaries in these benchmarks
underestimates of human performance and
lower few-shot and finetuning performance. were originally created in a different use con-
To better evaluate LLMs, we perform human text and, when evaluated as part of a generic
evaluation over high-quality summaries we news summarization benchmark, human annota-
collect from freelance writers. Despite ma- tors judge them to be worse than the outputs of
jor stylistic differences such as the amount of most automatic systems (Figure 1). When com-
paraphrasing, we find that LLM summaries puting automatic metrics using these references,
are judged to be on par with human written their poor quality reduces the correlation between
summaries.
metric results and human judgment. Not only does
this make evaluation difficult, but it also degrades
1 Introduction the performance of systems that take supervision
either through finetuning or few-shot prompting
Large language models (LLMs) have shown
and makes comparison difficult.
promising results in zero-/few-shot tasks across
To address the quality issues of reference sum-
a wide range of domains (Chowdhery et al., 2022;
maries and better understand how LLMs compare
Bai et al., 2022; Brown et al., 2020; Zhang et al.,
to human summary writers, we recruit freelance
2022) and have raised significant interest for their
writers from Upwork1 to re-annotate 100 articles
potential for automatic summarization (Goyal
from the test set of CNN/DM and XSUM. Com-
et al., 2022; Liu et al., 2022a). However, the de-
paring the best performing LLM, Instruct Davinci,
sign decisions contributing to its success in sum-
to the freelance writers, we find that the Instruct
marization remain poorly understood, and while
Davinci summaries are much more extractive.
prior work has shown that LLMs outperform the
By manually annotating the summarization oper-
prior state of the art, it remains unclear whether
ations (Jing and McKeown, 2000) used in these
their outputs are comparable to human writers. Ex-
summaries, we find that Instruct Davinci para-
amining these questions is crucial for advancing
phrases much less frequently although it is able to
future research in automatic summarization.
combine copied segments coherently.
To answer the first question, we perform a
Given their stylistic differences, we recruit
systematic evaluation of ten diverse LLMs with
annotators to compare the Instruct Davinci sum-
human evaluation on news summarization; our
maries to those written by freelance writers. On

Equal contribution. Order determined by a random
aggregate, we find that Instruct Davinci is rated
coin flip. Correspondence to [email protected] and
1
[email protected]. https://www.upwork.com.

39

Transactions of the Association for Computational Linguistics, vol. 12, pp. 39–57, 2024. https://doi.org/10.1162/tacl a 00632
Action Editor: Dan Goldwasser. Submission batch: 5/2023; Revision batch: 7/2023; Published 1/2024.
c 2024 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.
the field of automatic summarization (Radev et al.,
2002; Nenkova and McKeown, 2011). Early work
focused mostly on extractive approaches, using
unsupervised data-driven methods that relied on
different variants of word frequency to determine
salience (e.g., Salton et al., 1997; Hovy and Lin,
1999; Lin and Hovy, 2002; Mani and Bloedorn,
1999; Conroy et al., 2006; Nenkova et al., 2006).
Other approaches to extractive summarization re-
lied on aspects of discourse semantics (e.g., lexical
Figure 1: Selected annotator ratings of summary chains and rhetorical structure theory) (Barzilay
coherence on a 1 to 5 Likert scale.
and Elhadad, 1997; Marcu, 1997; Silber and
McCoy, 2002; Steinberger et al., 2007), or graph-
as comparable to the freelance writers. Exami- based methods (e.g., Radev et al., 2000; Mihalcea

Downloaded from http://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00632/2325685/tacl_a_00632.pdf by guest on 30 October 2024


nation of the annotations from each individual and Tarau, 2005; Erkan and Radev, 2004). These
rater shows that every rater has their own consis- extractive approaches were developed both for
tent preference for either Instruct Davinci or the single-document and multi-document news sum-
freelance writers. marization, with far more work focusing on
Together, our work makes the following key multi-document than the single-document task.
contributions. First, we identify instruction tun- Humans, however, rely on more abstractive op-
ing, instead of model scale, as the key to LLMs’ erations (such as paraphrasing, generalizations,
summarization capability. Second, we show that etc.) in order to write fluent summaries (Jing
reference summaries used in XSUM, which are and McKeown, 1999). This has led to a push
simply the first sentence of the news article, are toward building abstractive summarization sys-
judged by humans to be worse than the best LLM- tems, with initial research focusing on designing
generated summaries. Third, to address these post-processing algorithms for extractive summa-
issues with references, we collect better quality sum- rizers that focused on specific operations such
maries from freelance writers and we show that as sentence fusion (Barzilay and McKeown,
the best LLM is rated as comparable to Upwork 2005; Marsi and Krahmer, 2005; Krahmer et al.,
freelance writers. In combination, these results 2008; Filippova and Strube, 2008; Thadani and
call into question recent claims made about LLM McKeown, 2013), generation (Barzilay et al.,
summarization. In particular, summarization prog- 1999) and sentence compression (Jing, 2000;
ress cannot be measured using reference-based Knight and Marcu, 2002; McDonald, 2006; Cohn
metrics applied on XSUM. Furthermore, the ques- and Lapata, 2008). More scalable, data-driven
tion of whether fine-tuned, few-shot, or zero-shot approaches for building abstractive summariza-
models perform better remains an open question tion systems were made possible with more ef-
due to the poor quality of training data. To en- fective neural systems for conditional generation
courage future work on improved evaluations, we (Sutskever et al., 2014; Bahdanau et al., 2015)
release the high-quality summaries written by as well as large-scale datasets (Rush et al., 2015;
freelance writers and the evaluation data on 18 Hermann et al., 2015), leading to steady progress
model settings and two datasets as resources.2 over the years (See et al., 2017; Chen and Bansal,
2018; Dong et al., 2019; Liu and Lapata, 2019;
Lewis et al., 2019; Zhang et al., 2020).
2 Background and Related Work This work benchmarks LLMs on news summa-
rization using two popular benchmarks, CNN/DM
2.1 News Summarization
(Hermann et al., 2015) and XSUM (Narayan et al.,
News summarization is the task of producing a 2018). These datasets contain hundreds of thou-
concise paragraph that captures the main points of sands of article-summary pairs but were created
a news article and has been a core problem within using ‘‘incidental supervision’’, i.e., the refer-
ence summaries were not written specifically for
2
https://github.com/Tiiiger/benchmark llm the task but adapted from content on the web-
summarization. sites. CNN/DM includes articles from the CNN

40
and DailyMail websites as the source articles and parameters are updated for these tasks either
adapts the bullet point highlights that accompany through supervised finetuning or reinforcement
the articles as reference summaries. XSUM in- learning.
cludes articles from BBC News and adapts the Recent work (Goyal and Durrett, 2020) shows
bolded introductory sentence(s) as reference sum- that the instruct-tuned GPT-3 Davinci model is
maries. As a result, the reference summaries in better than finetuned LMs, but does not show the
these datasets are known to have quality issues design decision that contributes to the improved
(Maynez et al., 2020; Kang and Hashimoto, 2020), performance. In our work, we carry out a more
motivating us to address these defects to improve comprehensive benchmark on ten different LLMs,
LLM evaluation. to understand the effect of model scale, in-context
To contextualize the performance of LLMs, learning, and instruction tuning. Given that auto-
we mainly compare to previous state-of-the-art matic metrics may not be reliable, we focus on
approaches that leveraged supervised finetuning human evaluation as our benchmarking method.
(Liu and Lapata, 2019; Lewis et al., 2019; Zhang

Downloaded from http://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00632/2325685/tacl_a_00632.pdf by guest on 30 October 2024


et al., 2020; Liu et al., 2022b). Summarization 3 Human Evaluation on News
evaluation is another active area of research. Many Summarization Benchmarks
automatic metrics have been proposed (Lin, 2004;
Zhang et al., 2020; Sellam et al., 2020; Durmus In this section, we use human evaluation to
et al., 2020; Maynez et al., 2020; Deutsch and systematically benchmark a diverse set of ten
Roth, 2021) but they do not always correlate LLMs on news summarization. We observe that
with human evaluation of summarization systems instruction tuning is the key to strong summa-
(Fabbri et al., 2020; Durmus et al., 2022). In this rization capability and reference summaries in
work, we evaluate the effectiveness of automatic current benchmarks may underestimate few-shot
metrics for evaluating LLMs and show that the or finetuning performance.
usefulness of reference-based evaluation is closely
linked to the quality of the references. 3.1 Experimental Setup
Data We conduct our human evaluation on
2.2 Large Language Models
CNN/DM and XSUM by sampling a hundred
LLMs (Bommasani et al., 2021; Chowdhery et al., examples from each validation set, respectively.
2022; Brown et al., 2020) have two distinctive For the few-shot in-context learning settings, we
features over previous pretrained models. First, sample five examples from the training set to be
LLMs have a much larger scale in terms of model the demonstration examples. Due to the limited
parameters and training data. Second, unlike pre- context window, we sample five articles that are
vious pretrained models that require finetuning, between 50 and 150 tokens in length according to
LLMs can be prompted zero-shot or few-shot to the GPT-2 tokenizer. For XSUM, we find that a
solve a task. In the zero-shot setting, prompt- uniform sampling occasionally results in articles
ing presents the LLMs with inputs (e.g., news that are unreadable due to data preprocessing so
articles) and a natural language instruction (e.g., we manually pick from the training set.
‘‘summarize this news article in three sentences’’)
and solicits outputs by having LLMs generate an- Model Details We consider ten LLMs across
swers directly. When few-shot training examples different pretraining strategies and model scales.3
are available, LLMs have the ability to learn Table 1 lists the details of the LLMs we consider.
‘‘in context’’. Incontext learning prepends train- Due to limited computational resources and model
ing input-output pairs along with the same style access, we benchmark all models in the five-shot
of instruction to the testing input. setting but only benchmark three OpenAI GPT-3
Recently, instruction-tuning has emerged as an models and three OpenAI instruction-tuned GPT-3
effective way to improve LLM prompting per- models in the zero-shot setting.
formance (Sanh et al., 2021; Wang et al., 2022; 3
We note that the training details of instruction-tuned
Ouyang et al., 2022). In this approach, a diverse GPT-3 models may differ from those mentioned in the pub-
set of natural language processing tasks are refor- lication and are inferred by us based on the API naming
mulated into the prompting format and the LLM’s scheme.

41
Model Model Creator # Parameters Instruction Tuning Reference
GPT-3 davinci v1 175B
GPT-3 curie v1 OpenAI 6.7B ✗ Brown et al. (2020)
GPT-3 ada v1 350M
InstructGPT davinci v2 175B
InstructGPT curie v1 OpenAI 6.7B ✓ Ouyang et al. (2022)
InstructGPT ada v1 350M
OPT 175B Meta 175B ✗ Zhang et al. (2022)
Tsinghua
GLM 130B ✗ Du et al. (2021)
University
Cohere xlarge v20220609 Cohere 52.4B ✗ Cohere (2022)

Downloaded from http://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00632/2325685/tacl_a_00632.pdf by guest on 30 October 2024


Anthropic-LM v4-s3 Anthropic 52B ✓ Bai et al. (2022)

Table 1: List of large language models we benchmarked with human evaluation.

For CNN/DM, we solicit LLM summaries with Our annotators evaluate each summary based
the following prompt template ‘‘Article: on three criteria: faithfulness, coherence, and rel-
[article]. Summarize the article evance. We define these terms and collect data
in three sentences. Summary:’’ according to the guidelines in Fabbri et al. (2020).
For XSUM, we modify the prompt template Coherence and relevance ratings are collected on
to summarize in one sentence to match a 1 to 5 Likert scale while faithfulness ratings are
the style of the reference summaries. For all collected as a binary value, since it is inherently
LLMs we consider, we sample with temperature binary in nature. Unlike Fabbri et al. (2020), we
0.3 following prior work (Wu et al., 2021). do not evaluate fluency because we find LLM
To contextualize our LLM benchmarking outputs to be mostly fluent. The average pair-
results, we also evaluate two state-of-the-art fine- wise agreement for the annotators in our annotator
tuned LMs: Pegasus (Zhang et al., 2020) and pool was 75% for faithfulness, 81% for coher-
BRIO (Liu et al., 2022b). We decode the fine- ence, and 86% for relevance.5 The full annotation
tuned LMs using a beam size of 5 following prior guidelines are included in our code release.
work (Lewis et al., 2019). In addition, we also
evaluate the existing reference summaries in the 3.2 Evaluation Results
CNN/DM and XSUM validation sets. Table 2 presents the evaluation results.6 We
now discuss two main observations.
Human Evaluation Protocol We recruit an-
notators from Amazon Mechanical Turk, com- Instruction Tuned Models Have Strong Sum-
pensating them at California minimum wage of marization Ability. Across the two datasets
$15.00/hr using conservative time estimates as and three aspects, we find that the zero-shot
recommended by Whiting et al. (2019). We re- instruction-tuned GPT-3 models, especially In-
cruited a total of 30 annotators from the US who struct Curie and Davinci, perform the best overall.
have a lifetime HIT approval rate of 98% or above Compared to the fine-tuned LMs (e.g., Pegasus),
with at least 10,000 approved HITs (Figure 8).4 Instruct Davinci achieves higher coherence and
Summaries are presented in random order and relevance scores (4.15 vs. 3.93 and 4.60 vs. 4.40)
are evaluated independently by three annotators.
5
We report average scores for each summary based To compute agreement for coherence and relevance, we
on ratings from all three annotators. first binarize the Likert scores, with a score of 3 or above
being mapped to 1.
6
We note that the 350M GPT-3 consistently generates
4
We recruited annotators who were previously vetted empty outputs on the XSUM dataset so we omit it from the
for an earlier study (Liang et al., 2022). human evaluation.

42
CNN/Daily Mail XSUM
Setting Models Faithfulness Coherence Relevance Faithfulness Coherence Relevance
GPT-3 (350M) 0.29 1.92 1.84 0.26 2.03 1.90
GPT-3 (6.7B) 0.29 1.77 1.93 0.77 3.16 3.39
GPT-3 (175B) 0.76 2.65 3.50 0.80 2.78 3.52
Zero-shot language models
Ada Instruct v1 (350M*) 0.88 4.02 4.26 0.81 3.90 3.87
Curie Instruct v1 (6.7B*) 0.97 4.24 4.59 0.96 4.27 4.34
Davinci Instruct v2 (175B*) 0.99 4.15 4.60 0.97 4.41 4.28
Anthropic-LM (52B) 0.94 3.88 4.33 0.70 4.77 4.14
Cohere XL (52.4B) 0.99 3.42 4.48 0.63 4.79 4.00
GLM (130B) 0.94 3.69 4.24 0.74 4.72 4.12
OPT (175B) 0.96 3.64 4.33 0.67 4.80 4.01
GPT-3 (350M) 0.86 3.73 3.85 – – –
Five-shot language models
GPT-3 (6.7B) 0.97 3.87 4.17 0.75 4.19 3.36
GPT-3 (175B) 0.99 3.95 4.34 0.69 4.69 4.03
Ada Instruct v1 (350M*) 0.84 3.84 4.07 0.63 3.54 3.07
Curie Instruct v1 (6.7B*) 0.96 4.30 4.43 0.85 4.28 3.80
Davinci Instruct v2 (175B*) 0.98 4.13 4.49 0.77 4.83 4.33

Downloaded from http://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00632/2325685/tacl_a_00632.pdf by guest on 30 October 2024


Brio 0.94 3.94 4.40 0.58 4.68 3.89
Fine-tuned language models
Pegasus 0.97 3.93 4.38 0.57 4.73 3.85
Existing references – 0.84 3.20 3.94 0.37 4.13 3.00

Table 2: Human evaluation results for zero-shot and five-shot LLMs, finetuned LMs, and reference
summaries. We bold the entries that are not statistically significantly different from the best numbers in
each column at p = 0.05, using a bootstrap-based paired mean difference test.

on CNN and higher faithfulness and relevance maries, the faithfulness score of Instruct Davinci
scores (0.97 vs. 0.57 and 4.28 vs. 3.85) on XSUM, drops from 0.97 to 0.77.
which is consistent with recent work (Goyal et al., The reference summaries make it difficult to
2022). In contrast to instruction tuning, we find compare LLMs to both finetuned models and hu-
scale to be less important. Even the largest 175B mans. When comparing to finetuned models, the
model often ignores the instruction and gener- relatively poor performance of finetuned models
ates irrelevant content while the much smaller can be attributed to the low quality of refer-
Instruct Ada outperforms the 175B GPT-3 model ences in the training data. This suggests we could
on coherence and relevance. be underestimating the potential performance of
In the five-shot setting, non-instruction-tuned finetuning approaches. When comparing to hu-
LLMs can improve their summarization perfor- mans, the existing low-quality references are not
mance through in-context learning. For faithful- representative of actual human performance since
ness scores on CNN/DM and coherence scores on they were created through heuristics. As a result,
XSUM, several non-instruction-tuned LLMs can the differences between instruction-tuned LLMs
perform as well as the instruction-tuned LLMs. and human performance are likely overstated in
However, for other aspects, we still find the Table 3.
instruction-tuned LLMs to be better.
Qualitative Examples. Figure 2 showcases ex-
Reference Summaries in Current Benchmarks ample summaries on an article from the CNN/DM
Should Not Be Used for Training and Evalu- validation set, comparing the summaries of zero-
ating Generic News Summarization Systems. shot GPT-3 Davinci, instruction-tuned GPT-3
We arrive at this conclusion based on two ob- Davinci, and the CNN/DM reference summary.
servations. First, most automatic summarization We start by noting that the zero-shot GPT-3
systems score better than the reference summaries model cannot follow the instructions to sum-
across all three aspects. Second, applying in- marize well. After the summary paragraph, the
context learning with the current reference sum- model generates an additional question that is
maries makes instruction-tuned models generate completely irrelevant. In addition to the failure
worse summaries. For example, on the XSUM to follow instructions, the generated summary
dataset, after conditioning on five reference sum- contains a factual error, stating that the handbag

43
CNN/DailyMail XSUM
Metric Faithfulness Coherence Relevance Faithfulness Coherence Relevance
Rouge-L 0.54 0.48 0.72 −0.27 0.71 0.30
METEOR 0.58 0.37 0.66 −0.22 0.68 0.38
BertScore 0.54 0.47 0.70 −0.23 0.70 0.30
BARTScore 0.56 0.34 0.65 −0.22 0.70 0.35
BLEURT 0.56 0.62 0.81 −0.08 0.67 0.41
SummaC 0.54 0.11 0.26 0.26 −0.41 −0.29
QAFactEval 0.64 0.16 0.35 0.55 0.16 0.37
BLANC 0.54 0.31 0.50 0.50 0.10 0.32

Table 3: System-level kendall’s tau correlation with human scores across different axes.

Downloaded from http://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00632/2325685/tacl_a_00632.pdf by guest on 30 October 2024

Figure 2: Examples summaries generated by GPT-3 models (Section 3) or written by freelance writers
(Section 4) of an article from the CNN/DM dataset. We find that the instruction-tuned GPT-3 model can gen-
erate a much better summary compared to the non-instruction-tuned variant. The reference summary from
CNN/DM is not coherent whereas the freelance writer summary is both coherent and relevant.

mentioned is the most expensive in the world, the rest of the story is unclear. This is unsurpris-
which contradicts the original article. In contrast, ing as reference summaries in the CNN/DM da-
the instruction-tuned GPT-3 model generates a taset were originally bullet points accompanying
summary that is both faithful and coherent. the articles as opposed to a coherent paragraph.
We also observe from Figure 2 that the reference While such reference summaries might be suited
summary is not coherent. The brand ‘‘Hermes’’ is in the original context, we argue that they are not
not introduced until the end and its connection to useful for evaluating generic news summarization.

44
On XSUM, reference-based metrics have a very
low correlation with faithfulness and relevance
since the reference summaries themselves are ter-
rible in these aspects (Table 3; also see Maynez
et al., 2020). With such low-quality references,
we do not expect reference-based metrics to ex-
tract useful information.
In general, across both datasets, we find that
Figure 3: System-level Rouge-L vs. annotator rated reference-based metrics correlate better with hu-
relevance scores. man judgments on the aspects for which reference
summaries also have better scores (e.g., CNN/DM
relevance, XSUM coherence). This points to the
3.3 Understanding Automatic Metrics important role of quality reference summaries
for reference-based metrics, as previously ob-

Downloaded from http://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00632/2325685/tacl_a_00632.pdf by guest on 30 October 2024


We compute system-level correlations against served in machine translation (Freitag et al., 2020).
human ratings for eight popular automated evalua- Reference-free metrics are less handicapped by the
tion metrics. For reference-based metrics we con- low-quality references but they are mostly geared
sider: Rouge-L (Lin, 2004), METEOR (Banerjee towards measuring faithfulness. Even BLANC,
and Lavie, 2005), BertScore (Zhang et al., 2020), which is designed to measure overall summary
BLEURT (Sellam et al., 2020), and BARTScore quality, correlates best with faithfulness and much
(Yuan et al., 2021). For reference-free metrics we worse for relevance and coherence.
consider: SummaC (Laban et al., 2021), QAFact-
Eval (Fabbri et al., 2022), and BLANC (Vasilyev
et al., 2020). 4 Comparing to Freelance Writers
Table 3 shows Kendall’s tau rank correlations
between automated metrics and human judgments. In Section 3, we see that the low-quality refer-
We observe significantly different trends on CNN/ ence summaries make studying and benchmarking
DM and XSUM so we discuss them separately in LLMs difficult. In this section, we address this
the following paragraphs. by recruiting Upwork freelance writers to col-
For CNN/DM, we observe that the reference- lect higher-quality summaries. With this data, we
based automatic metrics have a moderate correla- aim to answer two important questions. First, we
tion with some aspects of human judgments, e.g., would like to know whether the best LLM has
Rouge-L has a 0.72 Kendall’s tau correlation co- reached human-level performance and how the
efficient with relevance in Table 3. Such a level summaries written by the best LLM differ from
of correlation is comparable to that reported in the ones written by humans. Second, we aim to
the study of Fabbri et al. (2020), which measures examine the correlation between reference-based
the correlation of automatic metrics on evaluating metrics and human judgments when the metrics
finetuned LMs and even earlier neural summa- are calculated using our higher-quality reference
rization systems. Therefore, we conclude that on summaries.
CNN/DM automatic, reference-based metrics can
still provide useful signals for relevance. 4.1 Experimental Setup
Studying the results more closely, we find that
Rouge-L and human evaluation are more corre- In this section, we describe the recruitment pro-
lated when comparing within each model group. cess and instructions for the summary writing
We plot Rouge-L over the relevance rating in task.
Figure 3 as an example. First, we observe that
Rouge-L still prefers finetuned LMs (green points Data. For data used in our study, we select 50
on top of the plots) to LLMs, consistent with prior articles from each of the CNN/DM and XSUM
work (Goyal et al., 2022). Despite this mistake, evaluation sets described in Section 3.1 and as-
when only comparing LLMs with each other, we sign each article to three writers. For XSUM, we
find that a larger than 0.05 Rouge-L difference use the full articles rather than the preprocessed
usually translates to improved human evaluation. version where the first bolded sentence is removed.

45
Writer Recruitment. We recruit six writers Model Faithfulness Coherence Relevance
who have had previous experience in writing Freelance Writer 0.93 4.39 4.26
Zero-shot
blog posts, landing page introductions, or prod- Instruct Davinci
0.98 4.26 4.40
uct descriptions from the freelance work platform Reference Summaries 0.64 3.59 3.45
Upwork. After conducting a qualification round
by asking writers to summarize five articles, we Table 4: Amazon Mechanical Turker evaluation
selected the best writers according to the faithful- results of the freelance writer summaries. Results
ness, coherence, and relevance of their summaries. of zero-shot Instruct Davinci and reference sum-
Through an initial pilot study, we estimate that maries are taken from Table 2 after averaging the
the time required to summarize a CNN/DM or corresponding ratings.
XSUM article is around 12 to 15 minutes. There-
fore, we pay our writers $4 for every article they
summarize following the recommended practice same annotation scheme in Section 3.1 using Me-
(Whiting et al., 2019). We based the assignments chanical Turkers. Table 4 reports the evaluation

Downloaded from http://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00632/2325685/tacl_a_00632.pdf by guest on 30 October 2024


on writers’ availability, with the most prolific results, where we see that the freelance writer
writer summarizing 100 articles and the least pro- summaries have much higher quality than the
lific writer summarizing 35 articles. We include original reference summaries in CNN/DM and
our annotation guideline for freelance writers in XSUM. In addition, we see that the difference be-
Figure 7. tween the freelance writer and Instruct Davinci in
this evaluation is small. Next, we carry out more
Summary Writing Instructions. For the an- targeted evaluations to compare the summaries
notation instruction, we instruct our writers to written by freelance writers and Instruct Davinci.
summarize each article in around 50 words.7 To
4.2 Paired Comparison between LLM and
give better task grounding, we ask the writers
Freelance Writers
to summarize as if they are writing a newsletter
to update their readers on the news. We release Comparing Stylistic Differences. Despite the
the full annotation guideline along with our code similar performance in our quality control study,
release. we find that LLM summaries and freelance writer
summaries have distinctive styles. Figure 2 shows
LLM Summaries Generation. Recently, Liu an example summary written by the freelance
et al. (2022a) showed that length is a confound- writer. Compared to the LLM-generated sum-
ing factor in the human evaluation of summariza- mary, we find the freelance writer summary often
tion. To control this potential length confound, contains more paraphrasing and copies less from
we modify the zero-shot prompt in Section 3.1 the article.
to elicit summaries that are around 50 words, To illustrate this stylistic difference, we mea-
which is the same word limit provided to the free- sure two extractiveness measures, coverage and
lance writers. We found that the Instruct Davinci density, following Grusky et al. (2018). Coverage
model consistently produces summaries that ex- is defined as the percentage of words in the sum-
ceed a given word limit. Therefore, we intention- mary that are also present in the article; density
ally prompt the Instruct Davinci model with a is defined as the average length of the continuous
25-word limit to produce summaries with an av- text spans in the summary that are copied from the
erage length of 50 words. With this new prompt, article. Our analysis shows that the coverage and
we generate the summaries using the same hyper- density for Instruct Davinci generated summaries
parameters described in Section 3.1. are 0.92 and 12.1 whereas those for the writers’
summaries are 0.81 and 2.07. These measures
Quality Control. To verify the quality of the show that the summaries generated by Instruct
summaries written by freelance writers, we eval- Davinci are highly extractive whereas the sum-
uate a random subset of 100 summaries using the maries written by the freelance writers are much
more abstractive.
7
We conducted an initial study to pilot instructions and
To have a fine-grained understanding of these
found that instructing writers with a sentence limit often stylistic differences, we manually analyze the dis-
resulted in summaries that differ significantly in length. tribution of ‘‘cut-and-paste operations’’ in these

46
In conclusion, we find that Instruct Davinci
summarizes in a very different style than human
writers. We emphasize here that the freelance
writers write in an abstractive style despite the
fact that we have not explicitly instructed them
to do so. We also observe similarly abstractive
styles across the six freelance writers.

Comparing Human Preference. We now re-


turn to our original goal of understanding whether
LLM-generated summaries have quality on par
with the human-written ones. In the following
paragraphs, we discuss our annotation design and
recruitment process.

Downloaded from http://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00632/2325685/tacl_a_00632.pdf by guest on 30 October 2024


Figure 4: Distributions of cut and paste operations We conduct a blinded pairwise comparison
in the summaries written by freelance writers and evaluation between the best LLM Instruct Davinci
by Instruct Davinci. By comparison, human-written and the freelance writers, similar to the evaluation
summaries contain more lexical paraphrasing and sen- in Goyal and Durrett (2020). Besides selecting the
tence reduction whereas the Instruct Davinci model better summary within each pair, the annotators
has more direct copying from the article.
can decide the summary pair to be equally good.
We release the full annotation instructions along
with the code release for this project.
In order to compare the best LLM with the
two sets of summaries. Jing and McKeown (2000) freelance writers, we annotate two aspects. First,
identify a set of ‘‘cut and paste’’ operations we solicit annotators’ overall preference, which
for reusing text from the article, including sen- balances the multiple quality aspects such as faith-
tence reduction, sentence combination, syntactic fulness, coherence, and relevance. Second, we
transformation, lexical paraphrasing, and general- solicit a more targeted measure of informative-
ization or specification. On top of these operations, ness by asking the annotators to compare the
we additionally include a sentence copy operation number of facts in each summary. For the in-
to account for summary sentences that are directly formativeness measure, we are motivated by the
copied from the article. Using this guideline, we hypothesis that a more abstractive writing style
manually annotate ten randomly sampled sum- can pack more information into the summary given
mary pairs written by Instruct Davinci and the the same word count. While it is also interesting
freelance writers. to compare summary coherence and relevance,
Figure 4 reports the distribution of the cut-and- we omit them because annotators were unable to
paste operations, showing the fraction of sentences differentiate these aspects from the overall pre-
that contain each operation. First, we observe that ference in a pilot study.
the freelance writer summaries use lexical para- For our recruitment process, we recruit five
phrasing and generalization/specification much additional annotators through Upwork and retain
more frequently than the Instruct Davinci gen- one writer who participated in the previous round
erated summaries. Because both operations often of summary writing.8 We carry out a qualifica-
involve using novel words that are not present tion round and reject annotators whose ratings
in the article, this matches with the fact that the differ significantly from the authors’ on a set of
freelance writer summaries have lower coverage control questions for informativeness. We give
(0.81 vs. 0.92) than the Instruct Davinci summa- each annotator the same set of 100 summary
ries. Second, we find that sentence combination pairs, where the average length of the freelance
is a common strategy used by both the freelance writer summaries and the Instruct Davinci sum-
writers and Instruct Davinci. Third, we find that maries are 53.2 and 52.0, respectively.
the freelance writers never copy an entire sen-
tence directly from the article but Instruct Davinci 8
Other annotators left during the course of study due to
does this more frequently. a change in freelance work schedule.

47
Figure 5: Human evaluation results comparing summaries written by freelance writers and summaries gener-
ated by Instruct GPT-3 Davinci. On aggregate, annotators equally prefer freelance writers and Instruct Davinci.
However, there is high variability in individual annotators’ preferences. Notably, annotator 1 writes abstractive
summaries but prefers the more extractive Instruct Davinci summaries.

Downloaded from http://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00632/2325685/tacl_a_00632.pdf by guest on 30 October 2024


Figure 6: System-level Rouge-L vs. annotating rating of faithfulness. The left plot is computed with XSUM
references, where the correlation is weak, and the right plot is computed with the freelance writer summaries,
where the correlation is much improved.

Figure 5 shows the results of the paired com- Like other writers, annotator 1 summarizes in an
parison. While we hypothesized that the more abstractive style (2.5 density and 0.86 coverage).
abstractive writing style could lead to more infor- However, annotator 1 prefers Instruct Davinci
mative summaries, we did not find a significant 57% of the time even though it generated much
effect in our annotator pool, who rate the more more extractive summaries. These results suggest
abstractive summaries to be more informative an intriguing gap between annotator preferences
only 51.1% of the time. On the informative when writing and evaluating summaries.
question, our annotators reached a moderate agree-
ment (Krippendorff’s alpha is 0.32), validating 4.3 Reevaluating Reference-based Metrics
our annotation instruction and recruitment pro- In Section 3.3, we saw that the performance of
cess. Moving onto the more subjective overall automated metrics may depend on the quality of
preference, we find that our annotators equally reference summaries. With the freelance writer
prefer the freelance writer summaries and the summaries, we now conduct an initial study on
Instruct Davinci summaries. However, a closer the effect of using better quality summaries. We
analysis shows that there is significant variabil- focus on using Rouge-L for faithfulness evaluation
ity in individual annotators’ preferences and the on the XSUM dataset because the current refer-
inter-annotator agreement is low (Krippendorff’s ence summaries are known to be highly unfaithful
alpha is 0.07). This suggests that the quality of (Maynez et al., 2020).
generated summaries is getting close to that of In Figure 6, we plot the system-level Rouge-L
the freelance writer summaries and the compar- against the human ratings. The left plot shows
ison is dependent on each annotator’s stylistic the results of computing Rouge-L with existing
preference. reference summaries from XSUM, which has a
One example of such stylistic preference is negative correlation with human ratings. This re-
seen in the results from annotator 1, who also sult matches our expectations because the exist-
participated in the first round of summary writing. ing reference summaries are highly unfaithful. On

48
Downloaded from http://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00632/2325685/tacl_a_00632.pdf by guest on 30 October 2024
Figure 7: Annotation guideline for freelance writers.

the right, we see the results of computing Rouge-L ineffectiveness. At this point, we cannot rule out
with the freelance writer summaries, which leads the possibility that when finetuned on higher qual-
to a much more positive correlation. Hence, we ity data, finetuned LMs may perform much better.
see that the usefulness of reference-based evalu- Second, the learning algorithm used for instruc-
ation is closely linked to the quality of the refer- tion tuning can be important (Ouyang et al., 2022).
ences and we can improve metric correlation by While the exact training details are unknown,
using better reference summaries. the success of Instruct Davinci might be cred-
ited to ‘‘learning from human feedback’’ (LHF;
5 Discussion Stiennon et al., 2020; Ziegler et al., 2019). Con-
trary to supervised finetuning that trains systems
Implication for Model Development. In this on written summaries, learning from human feed-
study, we systematically evaluate a diverse set back trains systems from binary labels of human
of LLMs and find that instruction tuning contrib- preferences. As we observe in Section 4.2, there
utes the most to LLMs’ summarization capability. is a discrepancy in how annotators write and rate
We believe that there is much research beyond summaries. While it is possible that LHF has
our benchmarking effort that needs to be done merits over the supervised learning/finetuning
to better understand the effect of instruction tun- approach in exploiting this discrepancy, more
ing. Here we hypothesize three aspects that could analysis is needed to validate this hypothesis.
account for the success of instruction tuning. Third, multi-task learning can be important. In-
First, the quality of the summarization data struct Davinci is trained on a diverse distribution
used in instruction tuning can serve an important of inputs and many previous studies have con-
role. Our findings in Section 3 show that cur- firmed the effectiveness of multi-task learning.
rently, we are finetuning language models on low- We look forward to understanding how summari-
quality training data, which can account for their zation benefits from learning on other tasks.

49
Downloaded from http://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00632/2325685/tacl_a_00632.pdf by guest on 30 October 2024
Figure 8: MTurk annotation guideline for summary quality evaluation.

Implication for Summarization Evaluation. summaries unreliable but the summaries writ-
Our work also reveals the difficulties in evalu- ten by well-paid freelance writers also may not
ating high-performance LLMs. As LLMs become outperform LLM summaries significantly. There-
increasingly close to human-level performance, fore, defining reference summaries as the ground
human evaluation requires a larger number of truth may be overly restrictive as LLMs are ap-
samples and less noisy measurements to evaluate proaching or even exceeding average human-level
the quality of LLMs. Recently, Liu et al. (2022a) performance.
also pointed out the difficulties in conducting hu- We acknowledge that summarization evalua-
man evaluation for summarization and advocated tion is dependent on the application scenarios and
using fine-grained semantic units to match with the existing reference summaries could be suit-
reference summaries. However, as our evaluation able in another context. For example, the bullet
points out, not only are the existing reference points style summary in CNN/DM may suffice

50
for being displayed on news websites. The quality will remain an open question, and the current
issues (such as coherence) we pointed out in this benchmarks will provide limited value when used
paper may not constitute a concern in specific ap- with reference-based evaluation. Even when we
plication scenarios. However, we emphasize that address the quality issue and conduct a human
the research on single document news summariza- evaluation with high-quality references, we ob-
tion is often abstracted away from the downstream serve a significant amount of individual variation
applications and used for judging the generic sum- from our annotator pool. Due to these factors,
marization capability. Our findings in this paper evaluations for single document news summari-
are tied to this research context. This is the reason zation may be reaching their limits.
why the major results of our study rely on new
summaries written by freelance writers.
Not only is human evaluation limited by the Acknowledgments
reference quality, but it also is affected by the sub-
jectivity in evaluation. Individual variation shows This work is supported by an Open Philan-

Downloaded from http://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00632/2325685/tacl_a_00632.pdf by guest on 30 October 2024


that there are many acceptable ways to summa- thropy grant and partially supported by a gift
rize and individuals may even show different from Northrup Grumman. We thank the reviewers
preferences at different points in time (writing and editors for their comments, as well as the
vs rating). These factors in combination lead to Stanford NLP group and the Stanford Center for
the fact that we may have reached the limit of Research on Foundation Models community for
single-document news summarization. Existing their feedback.
benchmarks can still play a role in evaluating new
models but only if evaluation is done correctly.
As LLMs improve, we believe that summariza- References
tion can be better grounded in downstream ap-
plications where user values are better defined so Dzmitry Bahdanau, Kyung Hyun Cho, and Yoshua
that annotators have a lower degree of freedom Bengio. 2015. Neural machine translation by
in balancing which quality aspects matter most jointly learning to align and translate. In
to them. 3rd International Conference on Learning
Representations, ICLR 2015.
Limitations. Due to time constraints, this study
has only evaluated systems on English news sum- Yushi Bai, Andy Jones, Kamal Ndousse, Amanda
marization where the summaries are designed to Askell, Anna Chen, Nova DasSarma, Dawn
have around 50 words. We also acknowledge that Drain, Stanislav Fort, Deep Ganguli, T. J.
as automatic systems improve, it becomes increas- Henighan, Nicholas Joseph, Saurav Kadavath,
ingly difficult for annotators to unambiguously John Kernion, Tom Conerly, Sheer El-Showk,
rank summaries by quality due to differences in Nelson Elhage, Zac Hatfield-Dodds, Danny
their individual preferences. Hernandez, Tristan Hume, Scott Johnston,
Shauna Kravec, Liane Lovitt, Neel Nanda,
Catherine Olsson, Dario Amodei, Tom B. Brown,
6 Conclusion
Jack Clark, Sam McCandlish, Christopher
In this work, we conducted a comprehensive Olah, Benjamin Mann, and Jared Kaplan. 2022.
human evaluation of ten LLMs, across the two Training a helpful and harmless assistant with
most popular news summarization benchmarks. reinforcement learning from human feedback.
Through our experiments, we find that the state- arXiv preprint arXiv:2204.05862.
of-the-art LLM performs on par with summaries Satanjeev Banerjee and Alon Lavie. 2005. Meteor:
written by freelance writers, with instruction tun- An automatic metric for mt evaluation with
ing being the key factor for success. Beyond improved correlation with human judgments.
these findings, our work highlights the crucial role In IEEvaluation@ACL.
of good reference summaries in both summariza-
tion model development and evaluation. Unless Regina Barzilay and Michael Elhadad. 1997.
the reference quality issue is addressed, comparing Using lexical chains for text summarization.
zero-shot, few-shot, and finetuning performance In Proceedings of ISTS, ACL 1997.

51
Regina Barzilay and Kathleen R. McKeown. Michael Xie, Michihiro Yasunaga, Jiaxuan
2005. Sentence fusion for multidocument news You, Matei A. Zaharia, Michael Zhang, Tianyi
summarization. Computational Linguistics, Zhang, Xikun Zhang, Yuhui Zhang, Lucia
31(3):297–328. https://doi.org/10.1162 Zheng, Kaitlyn Zhou, and Percy Liang. 2021.
/089120105774321091 On the opportunities and risks of foundation
models. arXiv preprint arXiv:2108.07258.
Regina Barzilay, Kathleen R. McKeown, and
Michael Elhadad. 1999. Information fusion Tom B. Brown, Benjamin Mann, Nick Ryder,
in the context of multi-document summa- Melanie Subbiah, Jared Kaplan, Prafulla
rization. In Proceedings of the 37th Annual Dhariwal, Arvind Neelakantan, Pranav Shyam,
Meeting of the Association for Computational Girish Sastry, Amanda Askell, Sandhini
Linguistics, pages 550–557, College Park, Agarwal, Ariel Herbert-Voss, Gretchen Krueger,
Maryland, USA. Association for Computa- T. J. Henighan, Rewon Child, Aditya Ramesh,
tional Linguistics. https://doi.org/10 Daniel M. Ziegler, Jeff Wu, Clemens Winter,

Downloaded from http://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00632/2325685/tacl_a_00632.pdf by guest on 30 October 2024


.3115/1034678.1034760 Christopher Hesse, Mark Chen, Eric Sigler,
Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Mateusz Litwin, Scott Gray, Benjamin Chess,
Russ Altman, Simran Arora, Sydney von Arx, Jack Clark, Christopher Berner, Sam McCandlish,
Michael S. Bernstein, Jeannette Bohg, Antoine Alec Radford, Ilya Sutskever, and Dario
Bosselut, Emma Brunskill, Erik Brynjolfsson, Amodei. 2020. Language models are few-shot
S. Buch, Dallas Card, Rodrigo Castellon, learners. In NeurIPS.
Niladri S. Chatterji, Annie S. Chen, Kathleen Yen-Chun Chen and Mohit Bansal. 2018.
A. Creel, Jared Davis, Dora Demszky, Chris Fast abstractive summarization with reinforce-
Donahue, Moussa Doumbouya, Esin Durmus, selected sentence rewriting. In Proceedings of
Stefano Ermon, John Etchemendy, Kawin the 56th Annual Meeting of the Association for
Ethayarajh, Li Fei-Fei, Chelsea Finn, Trevor Computational Linguistics (Volume 1: Long
Gale, Lauren E. Gillespie, Karan Goel, Noah Papers), pages 675–686, Melbourne, Australia.
D. Goodman, Shelby Grossman, Neel Guha, Association for Computational Linguistics.
Tatsunori Hashimoto, Peter Henderson, John https://doi.org/10.18653/v1/P18
Hewitt, Daniel E. Ho, Jenny Hong, Kyle Hsu, -1063
Jing Huang, Thomas F. Icard, Saahil Jain,
Dan Jurafsky, Pratyusha Kalluri, Siddharth Aakanksha Chowdhery, Sharan Narang, Jacob
Karamcheti, Geoff Keeling, Fereshte Khani, Devlin, Maarten Bosma, Gaurav Mishra, Adam
O. Khattab, Pang Wei Koh, Mark S. Krass, Roberts, Paul Barham, Hyung Won Chung,
Ranjay Krishna, Rohith Kuditipudi, Ananya Charles Sutton, Sebastian Gehrmann, Parker
Kumar, Faisal Ladhak, Mina Lee, Tony Lee, Schuh, Kensen Shi, Sasha Tsvyashchenko,
Jure Leskovec, Isabelle Levent, Xiang Lisa Joshua Maynez, Abhishek B. Rao, Parker
Li, Xuechen Li, Tengyu Ma, Ali Malik, Barnes, Yi Tay, Noam M. Shazeer, Vinodkumar
Christopher D. Manning, Suvir Mirchandani, Prabhakaran, Emily Reif, Nan Du, Benton C.
Eric Mitchell, Zanele Munyikwa, Suraj Nair, Hutchinson, Reiner Pope, James Bradbury,
Avanika Narayan, Deepak Narayanan, Benjamin Jacob Austin, Michael Isard, Guy Gur-Ari,
Newman, Allen Nie, Juan Carlos Niebles, Pengcheng Yin, Toju Duke, Anselm Levskaya,
Hamed Nilforoshan, J. F. Nyarko, Giray Ogut, Sanjay Ghemawat, Sunipa Dev, Henryk
Laurel J. Orr, Isabel Papadimitriou, Joon Sung Michalewski, Xavier Garcia, Vedant Misra,
Park, Chris Piech, Eva Portelance, Christopher Kevin Robinson, Liam Fedus, Denny Zhou,
Potts, Aditi Raghunathan, Robert Reich, Daphne Ippolito, David Luan, Hyeontaek Lim,
Hongyu Ren, Frieda Rong, Yusuf H. Roohani, Barret Zoph, Alexander Spiridonov, Ryan
Camilo Ruiz, Jack Ryan, Christopher Re, Dorsa Sepassi, David Dohan, Shivani Agrawal, Mark
Sadigh, Shiori Sagawa, Keshav Santhanam, Omernick, Andrew M. Dai, Thanumalayan
Andy Shih, Krishna Parasuram Srinivasan, Sankaranarayana Pillai, Marie Pellat, Aitor
Alex Tamkin, Rohan Taori, Armin W. Thomas, Lewkowycz, Erica Moreira, Rewon Child,
Florian Tramer, Rose E. Wang, William Wang, Oleksandr Polozov, Katherine Lee, Zongwei
Bohan Wu, Jiajun Wu, Yuhuai Wu, Sang Zhou, Xuezhi Wang, Brennan Saeta, Mark

52
Diaz, Orhan Firat, Michele Catasta, Jason Esin Durmus, Faisal Ladhak, and Tatsunori
Wei, Kathleen S. Meier-Hellstern, Douglas Hashimoto. 2022. Spurious correlations in
Eck, Jeff Dean, Slav Petrov, and Noah Fiedel. reference-free evaluation of text generation.
2022. Palm: Scaling language modeling with In Proceedings of the 60th Annual Meeting
pathways. arXiv preprint arXiv:2204.02311. of the Association for Computational Linguis-
tics (Volume 1: Long Papers). https://doi
Cohere. 2022. Introduction to large language
.org/10.18653/v1/2022.acl-long.102
models. https://docs.cohere.ai/docs
/introduction-to-large-language Güneş Erkan and Dragomir R. Radev. 2004.
-models Lexrank: Graph-based centrality as salience in
Trevor Cohn and Mirella Lapata. 2008. Sentence text summarization. Journal of Artificial Intel-
compression beyond word deletion. In Pro- ligence Research. https://doi.org/10
ceedings of the 22nd International Conference .1613/jair.1523
on Computational Linguistics (Coling 2008), Alexander Fabbri, Chien-Sheng Wu, Wenhao

Downloaded from http://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00632/2325685/tacl_a_00632.pdf by guest on 30 October 2024


pages 137–144, Manchester, UK. Coling 2008 Liu, and Caiming Xiong. 2022. QAFactEval:
Organizing Committee. https://doi.org Improved QA-based factual consistency
/10.3115/1599081.1599099 evaluation for summarization. In Proceedings
J. Conroy, J. Schlessinger, D. O’Leary, and J. of the 2022 Conference of the North American
Goldstein. 2006. Back to basics: Classy 2006. Chapter of the Association for Computational
In Proceedings of the Document Understand- Linguistics: Human Language Technologies,
ing Conference. https://doi.org/10.12968 pages 2587–2601, Seattle, United States.
/sece.2006.11.755 Association for Computational Linguistics.
https://doi.org/10.18653/v1/2022
Daniel Deutsch and Dan Roth. 2021. Understand- .naacl-main.187
ing the extent to which content quality metrics
measure the information quality of summaries. Alexander R. Fabbri, Wojciech Kryscinski, Bryan
In Proceedings of the 25th Conference on McCann, Caiming Xiong, Richard Socher,
Computational Natural Language Learning, and Dragomir Radev. 2020. Summeval: Re-
pages 300–309, Online. Association for Com- evaluating summarization evaluation. arXiv
putational Linguistics. https://doi.org preprint arXiv:2007.12626. https://doi
/10.18653/v1/2021.conll-1.24 .org/10.1162/tacl_a_00373
Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Katja Filippova and Michael Strube. 2008. Sen-
Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming tence fusion via dependency graph compres-
Zhou, and Hsiao-Wuen Hon. 2019. Unified lan- sion. In Proceedings of the 2008 Conference
guage model pre-training for natural language on Empirical Methods in Natural Language
understanding and generation. https://doi Processing, pages 177–185. https://doi
.org/10.48550/arXiv.1905.03197 .org/10.3115/1613715.1613741
Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Markus Freitag, David Grangier, and Isaac
Ding, Jiezhong Qiu, Zhilin Yang, and Jie Caswell. 2020. BLEU might be guilty but
Tang. 2021. Glm: General language model references are not innocent. In Proceedings
pretraining with autoregressive blank infilling. of the 2020 Conference on Empirical Meth-
In ACL. ods in Natural Language Processing (EMNLP).
https://doi.org/10.18653/v1/2020
Esin Durmus, He He, and Mona Diab. 2020.
.emnlp-main.5
FEQA: A question answering evaluation frame-
work for faithfulness assessment in abstractive Tanya Goyal and Greg Durrett. 2020. Evaluat-
summarization. In Proceedings of the 58th ing factuality in generation with dependency-
Annual Meeting of the Association for Compu- level entailment. In Findings of the Association
tational Linguistics, pages 5055–5070, Online. for Computational Linguistics: EMNLP 2020,
Association for Computational Linguistics. pages 3592–3603, Online. Association for
https://doi.org/10.18653/v1/2020 Computational Linguistics. https://doi.org
.acl-main.454 /10.18653/v1/2020.findings-emnlp.322

53
Tanya Goyal, Junyi Jessy Li, and Greg Durrett. defined and leads to more preferred results
2022. News summarization and evaluation in than generic sentence fusion. In Proceedings
the era of gpt-3. ArXiv, abs/2209.12356. of the Annual Meeting of the Association for
Max Grusky, Mor Naaman, and Yoav Artzi. Computational Linguistics, pages 193–196.
2018. Newsroom: A dataset of 1.3 million https://doi.org/10.3115/1557690
summaries with diverse extractive strategies. .1557745
In North American Chapter of the Association Philippe Laban, Tobias Schnabel, Paul N.
for Computational Linguistics. https://doi Bennett, and Marti A. Hearst. 2021. Summac:
.org/10.18653/v1/N18-1065 Re-visiting nli-based models for inconsistency
Karl Moritz Hermann, Tomas Kocisky, Edward detection in summarization. Transactions of
Grefenstette, Lasse Espeholt, Will Kay, the Association for Computational Linguistics,
Mustafa Suleyman, and Phil Blunsom. 2015. 10:163–177. https://doi.org/10.1162
Teaching machines to read and comprehend. /tacl a 00453

Downloaded from http://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00632/2325685/tacl_a_00632.pdf by guest on 30 October 2024


In NeurIPS.
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan
Eduard Hovy and Chin-Yew Lin. 1999. Auto-
Ghazvininejad, Abdelrahman Mohamed, Omer
mated text summarization in summarist. In
Levy, Veselin Stoyanov, and Luke Zettlemoyer.
Advances in Automatic Text Summarization,
2019. Bart: Denoising sequence-to-sequence
pages 82–94.
pre-training for natural language generation,
Hongyan Jing. 2000. Sentence reduction for auto- translation, and comprehension. In Annual
matic text summarization. In Applied Natural Meeting of the Association for Computational
Language Processing Conference. https:// Linguistics. https://doi.org/10.18653
doi.org/10.3115/974147.974190 /v1/2020.acl-main.703
Hongyan Jing and Kathleen McKeown. 2000.
Percy Liang, Rishi Bommasani, Tony Lee,
Cut and paste based text summarization.
Dimitris Tsipras, Dilara Soylu, Michihiro
In Applied Natural Language Processing
Yasunaga, Yian Zhang, Deepak Narayanan,
Conference.
Yuhuai Wu, Ananya Kumar, Benjamin
Hongyan Jing and Kathleen R. McKeown. 1999. Newman, Binhang Yuan, Bobby Yan, Ce
The decomposition of human-written summary Zhang, Christian Cosgrove, Christopher D.
sentences. In Proceedings of the 22nd An- Manning, Christopher Re, Diana Acosta-
nual International ACM SIGIR Conference on Navas, Drew A. Hudson, E. Zelikman, Esin
Research and Development in Information Re- Durmus, Faisal Ladhak, Frieda Rong, Hongyu
trieval, pages 129–136. https://doi.org Ren, Huaxiu Yao, Jue Wang, Keshav
/10.1145/312624.312666 Santhanam, Laurel J. Orr, Lucia Zheng, Mert
Daniel Kang and Tatsunori B. Hashimoto. 2020. Yuksekgonul, Mirac Suzgun, Nathan S. Kim,
Improved natural language generation via Neel Guha, Niladri S. Chatterji, O. Khattab,
loss truncation. In Proceedings of the 58th Peter Henderson, Qian Huang, Ryan Chi,
Annual Meeting of the Association for Com- Sang Michael Xie, Shibani Santurkar, Surya
putational Linguistics, pages 718–731, Online. Ganguli, Tatsunori Hashimoto, Thomas F.
Association for Computational Linguistics. Icard, Tianyi Zhang, Vishrav Chaudhary,
https://doi.org/10.18653/v1/2020 William Wang, Xuechen Li, Yifan Mai, Yuhui
.acl-main.66 Zhang, and Yuta Koreeda. 2022. Holistic eval-
Kevin Knight and Daniel Marcu. 2002. Sum- uation of language models. arXiv preprint
marization beyond sentence extraction: arXiv:2211.09110.
A probabilistic approach to sentence com- C. Lin and E. Hovy. 2002. From single to multi-
pression. Artificial Intelligence, 139(1):91–107. document summarization: A prototype system
https://doi.org/10.1016/S0004-3702 and its evaluation. In Proceedings of the An-
(02)00222-9 nual Meeting of the Association for Computa-
Emiel Krahmer, Erwin Marsi, and Paul van Pelt. tional Linguistics, pages 457–464. https://
2008. Query-based sentence fusion is better doi.org/10.3115/1073083.1073160

54
Chin-Yew Lin. 2004. Rouge: A package for au- Ryan McDonald. 2006. Discriminative sentence
tomatic evaluation of summaries. In Annual compression with soft syntactic evidence. In
Meeting of the Association for Computational 11th Conference of the European Chapter of
Linguistics. the Association for Computational Linguistics,
pages 297–304.
Yang Liu and Mirella Lapata. 2019. Text
summarization with pretrained encoders. In Rada Mihalcea and Paul Tarau. 2005. Multi-
Proceedings of the 2019 Conference on Empir- document summarization with iterative graph-
ical Methods in Natural Language Processing based algorithms. In Proceedings of the First
and the 9th International Joint Conference International Conference on Intelligent Analysis
on Natural Language Processing (EMNLP- Methods and Tools (IA 2005). McLean, VA.
IJCNLP), pages 3730–3740, Hong Kong, China. Shashi Narayan, Shay B. Cohen, and Mirella
Association for Computational Linguistics. Lapata. 2018. Don’t give me the details, just
https://doi.org/10.18653/v1/D19 the summary! Topic-aware convolutional neu-

Downloaded from http://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00632/2325685/tacl_a_00632.pdf by guest on 30 October 2024


-1387 ral networks for extreme summarization. In
Yixin Liu, Alexander R. Fabbri, Pengfei Liu, Proceedings of the 2018 Conference on Empiri-
Yilun Zhao, Linyong Nan, Ruilin Han, Simeng cal Methods in Natural Language Processing.
Han, Shafiq R. Joty, Chien-Sheng Wu, Caiming https://doi.org/10.18653/v1/D18
Xiong, and Dragomir R. Radev. 2022a. Revis- -1206
iting the gold standard: Grounding summariza- Ani Nenkova and Kathleen McKeown. 2011. Au-
tion evaluation with robust human evaluation. tomatic summarization. Foundations and Trends
ArXiv, abs/2212.07981. https://doi.org in Information Retrieval, 52(2–3):103–233.
/10.18653/v1/2023.acl-long.228 https://doi.org/10.1561/1500000015
Yixin Liu, Pengfei Liu, Dragomir R. Radev, and Ani Nenkova, Lucy Vanderwende, and Kathleen
Graham Neubig. 2022b. Brio: Bringing or- McKeown. 2006. A compositional context sen-
der to abstractive summarization. In Annual sitive multi-document summarizer: Exploring
Meeting of the Association for Computational the factors that influence summarization. In
Linguistics. https://doi.org/10.18653 Proceedings of the Annual International ACM
/v1/2022.acl-long.207 SIGIR Conference on Research and Develop-
ment in Information Retrieval, pages 573–580.
Inderjeet Mani and Eric Bloedorn. 1999. Sum-
https://doi.org/10.1145/1148170
marizing similarities and differences among
.1148269
related documents. Information Retrieval,
1(1–2):35–67. https://doi.org/10.1023 Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida,
/A:1009930203452 Carroll L. Wainwright, Pamela Mishkin, Chong
Zhang, Sandhini Agarwal, Katarina Slama,
Daniel Marcu. 1997. From discourse structures Alex Ray, John Schulman, Jacob Hilton,
to text summaries. In Intelligent Scalable Text Fraser Kelton, Luke E. Miller, Maddie Simens,
Summarization. Amanda Askell, Peter Welinder, Paul Francis
Erwin Marsi and Emiel Krahmer. 2005. Explo- Christiano, Jan Leike, and Ryan J. Lowe.
rations in sentence fusion. In Proceedings of 2022. Training language models to follow in-
the European Workshop on Natural Language structions with human feedback. arXiv preprint
Generation 2005, pages 109–117. arXiv:2203.02155.
Dragomir R. Radev, Eduard H. Hovy, and
Joshua Maynez, Shashi Narayan, Bernd Bohnet,
Kathleen McKeown. 2002. Introduction to the
and Ryan McDonald. 2020. On faithfulness
special issue on summarization. Computational
and factuality in abstractive summarization. In
Linguistics, 28:399–408. https://doi.org
Proceedings of the 58th Annual Meeting of
/10.1162/089120102762671927
the Association for Computational Linguis-
tics, pages 1906–1919, Online. Association for Dragomir R. Radev, Hongyan Jing, and
Computational Linguistics. https://doi.org Malgorzata Budzikowska. 2000. Centroid-
/10.18653/v1/2020.acl-main.173 based summarization of multiple documents:

55
Sentence extraction, utility-based evaluation, H. Grogory Silber and Kathleen F. McCoy.
and user studies. In NAACL-ANLP 2000 Work- 2002. Efficiently computed lexical chains as
shop: Automatic Summarization. an intermediate representation for automatic
Alexander M. Rush, Sumit Chopra, and Jason text summarization. Computational Linguistics,
Weston. 2015. A neural attention model for 28(4):487–496. https://doi.org/10.1162
abstractive sentence summarization. Proceed- /089120102762671954
ings of the 2015 Conference on Empirical Josef Steinberger, Massimo Poesio, Mijail A.
Methods in Natural Language Processing. Kabadjov, and Karel Jeek. 2007. Two uses
https://doi.org/10.18653/v1/D15 of anaphora resolution in summarization.
-1044 Information Processing and Management,
Gerard Salton, Amit Singhal, Mandar Mitra, 43(6):1663–1680. https://doi.org/10
and Chris Buckley. 1997. Automatic text .1016/j.ipm.2007.01.010
structuring and summarization. Information

Downloaded from http://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00632/2325685/tacl_a_00632.pdf by guest on 30 October 2024


Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel
Processing & Management, 33(2):193–207.
M. Ziegler, Ryan J. Lowe, Chelsea Voss, Alec
Methods and Tools for the Automatic Con-
Radford, Dario Amodei, and Paul Christiano.
struction of Hypertext. https://doi.org
2020. Learning to summarize from human
/10.1016/S0306-4573(96)00062-3
feedback. arXiv preprint arXiv:2009.01325.
Victor Sanh, Albert Webson, Colin Raffel,
Stephen H. Bach, Lintang Sutawika, Zaid Ilya Sutskever, Oriol Vinyals, and Quoc V. Le.
Alyafeai, Antoine Chaffin, Arnaud Stiegler, 2014. Sequence to sequence learning with neu-
Teven Le Scao, Arun Raja, Manan Dey, M. ral networks. Advances in Neural Information
Saiful Bari, Canwen Xu, Urmish Thakker, Processing Systems, 27.
Shanya Sharma, Eliza Szczechla, Taewoon
Kapil Thadani and Kathleen McKeown. 2013.
Kim, Gunjan Chhablani, Nihal V. Nayak,
Supervised sentence fusion with single-stage
Debajyoti Datta, Jonathan Chang, Mike Tian-
inference. In Proceedings of IJCNLP, Nagoya,
Jian Jiang, Han Wang, Matteo Manica, Sheng
Japan.
Shen, Zheng Xin Yong, Harshit Pandey, Rachel
Bawden, Thomas Wang, Trishala Neeraj, Jos Oleg Vasilyev, Vedant Dharnidharka, and John
Rozen, Abheesht Sharma, Andrea Santilli, Bohannon. 2020. Fill in the BLANC: Human-
Thibault Fevry, Jason Alan Fries, Ryan Teehan, free quality estimation of document summar-
Stella Rose Biderman, Leo Gao, Tali Bers, ies. In Proceedings of the First Workshop on
Thomas Wolf, and Alexander M. Rush. 2021. Evaluation and Comparison of NLP Systems,
Multitask prompted training enables zero- pages 11–20, Online. Association for Compu-
shot task generalization. arXiv preprint tational Linguistics. https://doi.org/10
arXiv:2110.08207. .18653/v1/2020.eval4nlp-1.2
Abigail See, Peter J. Liu, and Christopher D. Yizhong Wang, Swaroop Mishra, Pegah
Manning. 2017. Get to the point: Summa- Alipoormolabashi, Yeganeh Kordi, Amirreza
rization with pointer-generator networks. In Mirzaei, Anjana Arunkumar, Arjun Ashok,
Proceedings of the 55th Annual Meeting of Arut Selvan Dhanasekaran, Atharva Naik,
the Association for Computational Linguistics David Stap, Eshaan Pathak, Giannis
(Volume 1: Long Papers), pages 1073–1083, Karamanolakis, Haizhi Gary Lai, Ishan Purohit,
Vancouver, Canada. Association for Computa- Ishani Mondal, Jacob Anderson, Kirby Kuznia,
tional Linguistics. https://doi.org/10 Krima Doshi, Maitreya Patel, Kuntal Kumar
.18653/v1/P17-1099 Pal, M. Moradshahi, Mihir Parmar, Mirali
Thibault Sellam, Dipanjan Das, and Ankur P. Purohit, Neeraj Varshney, Phani Rohitha Kaza,
Parikh. 2020. Bleurt: Learning robust metrics Pulkit Verma, Ravsehaj Singh Puri, Rushang
for text generation. In Annual Meeting of the Karia, Shailaja Keyur Sampat, Savan Doshi,
Association for Computational Linguistics. Siddharth Deepak Mishra, Sujan Reddy,
https://doi.org/10.18653/v1/2020 Sumanta Patro, Tanay Dixit, Xudong Shen,
.acl-main.704 Chitta Baral, Yejin Choi, Hannaneh Hajishirzi,

56
Noah A. Smith, and Daniel Khashabi. 2022. with extracted gap-sentences for abstractive
Benchmarking generalization via in-context summarization. In ICML.
instructions on 1, 600+ language tasks. arXiv Susan Zhang, Stephen Roller, Naman Goyal,
preprint arXiv:2204.07705. Mikel Artetxe, Moya Chen, Shuohui Chen,
Mark E. Whiting, Grant Hugh, and Michael Christopher Dewan, Mona Diab, Xian Li,
S. Bernstein. 2019. Fair work: Crowd work Xi Victoria Lin, Todor Mihaylov, Myle Ott,
minimum wage with one line of code. In Sam Shleifer, Kurt Shuster, Daniel Simig,
AAAI Conference on Human Computation Punit Singh Koura, Anjali Sridhar, Tianlu
& Crowdsourcing. https://doi.org/10 Wang, and Luke Zettlemoyer. 2022. Opt:
.1609/hcomp.v7i1.5283 Open pre-trained transformer language models.
ArXiv, abs/2205.01068.
Jeff Wu, Long Ouyang, Daniel M Ziegler, Nisan
Stiennon, Ryan Lowe, Jan Leike, and Paul Tianyi Zhang, Varsha Kishore, Felix Wu,
Christiano. 2021. Recursively summarizing Kilian Q. Weinberger, and Yoav Artzi. 2020.

Downloaded from http://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00632/2325685/tacl_a_00632.pdf by guest on 30 October 2024


books with human feedback. arXiv preprint Bertscore: Evaluating text generation with
arXiv:2109.10862. bert. In International Conference on Learning
Representations.
Weizhe Yuan, Graham Neubig, and Pengfei Liu.
Daniel M. Ziegler, Nisan Stiennon, Jeff Wu,
2021. Bartscore: Evaluating generated text as
Tom B. Brown, Alec Radford, Dario Amodei,
text generation. ArXiv, abs/2106.11520.
Paul Christiano, and Geoffrey Irving. 2019.
Jingqing Zhang, Yao Zhao, Mohammad Saleh, Fine-tuning language models from human
and Peter J. Liu. 2020. Pegasus: Pre-training preferences. arXiv preprint arXiv:1909.08593.

57

You might also like