Benchmarking Large Language Models for News Summarization
Benchmarking Large Language Models for News Summarization
39
Transactions of the Association for Computational Linguistics, vol. 12, pp. 39–57, 2024. https://doi.org/10.1162/tacl a 00632
Action Editor: Dan Goldwasser. Submission batch: 5/2023; Revision batch: 7/2023; Published 1/2024.
c 2024 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.
the field of automatic summarization (Radev et al.,
2002; Nenkova and McKeown, 2011). Early work
focused mostly on extractive approaches, using
unsupervised data-driven methods that relied on
different variants of word frequency to determine
salience (e.g., Salton et al., 1997; Hovy and Lin,
1999; Lin and Hovy, 2002; Mani and Bloedorn,
1999; Conroy et al., 2006; Nenkova et al., 2006).
Other approaches to extractive summarization re-
lied on aspects of discourse semantics (e.g., lexical
Figure 1: Selected annotator ratings of summary chains and rhetorical structure theory) (Barzilay
coherence on a 1 to 5 Likert scale.
and Elhadad, 1997; Marcu, 1997; Silber and
McCoy, 2002; Steinberger et al., 2007), or graph-
as comparable to the freelance writers. Exami- based methods (e.g., Radev et al., 2000; Mihalcea
40
and DailyMail websites as the source articles and parameters are updated for these tasks either
adapts the bullet point highlights that accompany through supervised finetuning or reinforcement
the articles as reference summaries. XSUM in- learning.
cludes articles from BBC News and adapts the Recent work (Goyal and Durrett, 2020) shows
bolded introductory sentence(s) as reference sum- that the instruct-tuned GPT-3 Davinci model is
maries. As a result, the reference summaries in better than finetuned LMs, but does not show the
these datasets are known to have quality issues design decision that contributes to the improved
(Maynez et al., 2020; Kang and Hashimoto, 2020), performance. In our work, we carry out a more
motivating us to address these defects to improve comprehensive benchmark on ten different LLMs,
LLM evaluation. to understand the effect of model scale, in-context
To contextualize the performance of LLMs, learning, and instruction tuning. Given that auto-
we mainly compare to previous state-of-the-art matic metrics may not be reliable, we focus on
approaches that leveraged supervised finetuning human evaluation as our benchmarking method.
(Liu and Lapata, 2019; Lewis et al., 2019; Zhang
41
Model Model Creator # Parameters Instruction Tuning Reference
GPT-3 davinci v1 175B
GPT-3 curie v1 OpenAI 6.7B ✗ Brown et al. (2020)
GPT-3 ada v1 350M
InstructGPT davinci v2 175B
InstructGPT curie v1 OpenAI 6.7B ✓ Ouyang et al. (2022)
InstructGPT ada v1 350M
OPT 175B Meta 175B ✗ Zhang et al. (2022)
Tsinghua
GLM 130B ✗ Du et al. (2021)
University
Cohere xlarge v20220609 Cohere 52.4B ✗ Cohere (2022)
For CNN/DM, we solicit LLM summaries with Our annotators evaluate each summary based
the following prompt template ‘‘Article: on three criteria: faithfulness, coherence, and rel-
[article]. Summarize the article evance. We define these terms and collect data
in three sentences. Summary:’’ according to the guidelines in Fabbri et al. (2020).
For XSUM, we modify the prompt template Coherence and relevance ratings are collected on
to summarize in one sentence to match a 1 to 5 Likert scale while faithfulness ratings are
the style of the reference summaries. For all collected as a binary value, since it is inherently
LLMs we consider, we sample with temperature binary in nature. Unlike Fabbri et al. (2020), we
0.3 following prior work (Wu et al., 2021). do not evaluate fluency because we find LLM
To contextualize our LLM benchmarking outputs to be mostly fluent. The average pair-
results, we also evaluate two state-of-the-art fine- wise agreement for the annotators in our annotator
tuned LMs: Pegasus (Zhang et al., 2020) and pool was 75% for faithfulness, 81% for coher-
BRIO (Liu et al., 2022b). We decode the fine- ence, and 86% for relevance.5 The full annotation
tuned LMs using a beam size of 5 following prior guidelines are included in our code release.
work (Lewis et al., 2019). In addition, we also
evaluate the existing reference summaries in the 3.2 Evaluation Results
CNN/DM and XSUM validation sets. Table 2 presents the evaluation results.6 We
now discuss two main observations.
Human Evaluation Protocol We recruit an-
notators from Amazon Mechanical Turk, com- Instruction Tuned Models Have Strong Sum-
pensating them at California minimum wage of marization Ability. Across the two datasets
$15.00/hr using conservative time estimates as and three aspects, we find that the zero-shot
recommended by Whiting et al. (2019). We re- instruction-tuned GPT-3 models, especially In-
cruited a total of 30 annotators from the US who struct Curie and Davinci, perform the best overall.
have a lifetime HIT approval rate of 98% or above Compared to the fine-tuned LMs (e.g., Pegasus),
with at least 10,000 approved HITs (Figure 8).4 Instruct Davinci achieves higher coherence and
Summaries are presented in random order and relevance scores (4.15 vs. 3.93 and 4.60 vs. 4.40)
are evaluated independently by three annotators.
5
We report average scores for each summary based To compute agreement for coherence and relevance, we
on ratings from all three annotators. first binarize the Likert scores, with a score of 3 or above
being mapped to 1.
6
We note that the 350M GPT-3 consistently generates
4
We recruited annotators who were previously vetted empty outputs on the XSUM dataset so we omit it from the
for an earlier study (Liang et al., 2022). human evaluation.
42
CNN/Daily Mail XSUM
Setting Models Faithfulness Coherence Relevance Faithfulness Coherence Relevance
GPT-3 (350M) 0.29 1.92 1.84 0.26 2.03 1.90
GPT-3 (6.7B) 0.29 1.77 1.93 0.77 3.16 3.39
GPT-3 (175B) 0.76 2.65 3.50 0.80 2.78 3.52
Zero-shot language models
Ada Instruct v1 (350M*) 0.88 4.02 4.26 0.81 3.90 3.87
Curie Instruct v1 (6.7B*) 0.97 4.24 4.59 0.96 4.27 4.34
Davinci Instruct v2 (175B*) 0.99 4.15 4.60 0.97 4.41 4.28
Anthropic-LM (52B) 0.94 3.88 4.33 0.70 4.77 4.14
Cohere XL (52.4B) 0.99 3.42 4.48 0.63 4.79 4.00
GLM (130B) 0.94 3.69 4.24 0.74 4.72 4.12
OPT (175B) 0.96 3.64 4.33 0.67 4.80 4.01
GPT-3 (350M) 0.86 3.73 3.85 – – –
Five-shot language models
GPT-3 (6.7B) 0.97 3.87 4.17 0.75 4.19 3.36
GPT-3 (175B) 0.99 3.95 4.34 0.69 4.69 4.03
Ada Instruct v1 (350M*) 0.84 3.84 4.07 0.63 3.54 3.07
Curie Instruct v1 (6.7B*) 0.96 4.30 4.43 0.85 4.28 3.80
Davinci Instruct v2 (175B*) 0.98 4.13 4.49 0.77 4.83 4.33
Table 2: Human evaluation results for zero-shot and five-shot LLMs, finetuned LMs, and reference
summaries. We bold the entries that are not statistically significantly different from the best numbers in
each column at p = 0.05, using a bootstrap-based paired mean difference test.
on CNN and higher faithfulness and relevance maries, the faithfulness score of Instruct Davinci
scores (0.97 vs. 0.57 and 4.28 vs. 3.85) on XSUM, drops from 0.97 to 0.77.
which is consistent with recent work (Goyal et al., The reference summaries make it difficult to
2022). In contrast to instruction tuning, we find compare LLMs to both finetuned models and hu-
scale to be less important. Even the largest 175B mans. When comparing to finetuned models, the
model often ignores the instruction and gener- relatively poor performance of finetuned models
ates irrelevant content while the much smaller can be attributed to the low quality of refer-
Instruct Ada outperforms the 175B GPT-3 model ences in the training data. This suggests we could
on coherence and relevance. be underestimating the potential performance of
In the five-shot setting, non-instruction-tuned finetuning approaches. When comparing to hu-
LLMs can improve their summarization perfor- mans, the existing low-quality references are not
mance through in-context learning. For faithful- representative of actual human performance since
ness scores on CNN/DM and coherence scores on they were created through heuristics. As a result,
XSUM, several non-instruction-tuned LLMs can the differences between instruction-tuned LLMs
perform as well as the instruction-tuned LLMs. and human performance are likely overstated in
However, for other aspects, we still find the Table 3.
instruction-tuned LLMs to be better.
Qualitative Examples. Figure 2 showcases ex-
Reference Summaries in Current Benchmarks ample summaries on an article from the CNN/DM
Should Not Be Used for Training and Evalu- validation set, comparing the summaries of zero-
ating Generic News Summarization Systems. shot GPT-3 Davinci, instruction-tuned GPT-3
We arrive at this conclusion based on two ob- Davinci, and the CNN/DM reference summary.
servations. First, most automatic summarization We start by noting that the zero-shot GPT-3
systems score better than the reference summaries model cannot follow the instructions to sum-
across all three aspects. Second, applying in- marize well. After the summary paragraph, the
context learning with the current reference sum- model generates an additional question that is
maries makes instruction-tuned models generate completely irrelevant. In addition to the failure
worse summaries. For example, on the XSUM to follow instructions, the generated summary
dataset, after conditioning on five reference sum- contains a factual error, stating that the handbag
43
CNN/DailyMail XSUM
Metric Faithfulness Coherence Relevance Faithfulness Coherence Relevance
Rouge-L 0.54 0.48 0.72 −0.27 0.71 0.30
METEOR 0.58 0.37 0.66 −0.22 0.68 0.38
BertScore 0.54 0.47 0.70 −0.23 0.70 0.30
BARTScore 0.56 0.34 0.65 −0.22 0.70 0.35
BLEURT 0.56 0.62 0.81 −0.08 0.67 0.41
SummaC 0.54 0.11 0.26 0.26 −0.41 −0.29
QAFactEval 0.64 0.16 0.35 0.55 0.16 0.37
BLANC 0.54 0.31 0.50 0.50 0.10 0.32
Table 3: System-level kendall’s tau correlation with human scores across different axes.
Figure 2: Examples summaries generated by GPT-3 models (Section 3) or written by freelance writers
(Section 4) of an article from the CNN/DM dataset. We find that the instruction-tuned GPT-3 model can gen-
erate a much better summary compared to the non-instruction-tuned variant. The reference summary from
CNN/DM is not coherent whereas the freelance writer summary is both coherent and relevant.
mentioned is the most expensive in the world, the rest of the story is unclear. This is unsurpris-
which contradicts the original article. In contrast, ing as reference summaries in the CNN/DM da-
the instruction-tuned GPT-3 model generates a taset were originally bullet points accompanying
summary that is both faithful and coherent. the articles as opposed to a coherent paragraph.
We also observe from Figure 2 that the reference While such reference summaries might be suited
summary is not coherent. The brand ‘‘Hermes’’ is in the original context, we argue that they are not
not introduced until the end and its connection to useful for evaluating generic news summarization.
44
On XSUM, reference-based metrics have a very
low correlation with faithfulness and relevance
since the reference summaries themselves are ter-
rible in these aspects (Table 3; also see Maynez
et al., 2020). With such low-quality references,
we do not expect reference-based metrics to ex-
tract useful information.
In general, across both datasets, we find that
Figure 3: System-level Rouge-L vs. annotator rated reference-based metrics correlate better with hu-
relevance scores. man judgments on the aspects for which reference
summaries also have better scores (e.g., CNN/DM
relevance, XSUM coherence). This points to the
3.3 Understanding Automatic Metrics important role of quality reference summaries
for reference-based metrics, as previously ob-
45
Writer Recruitment. We recruit six writers Model Faithfulness Coherence Relevance
who have had previous experience in writing Freelance Writer 0.93 4.39 4.26
Zero-shot
blog posts, landing page introductions, or prod- Instruct Davinci
0.98 4.26 4.40
uct descriptions from the freelance work platform Reference Summaries 0.64 3.59 3.45
Upwork. After conducting a qualification round
by asking writers to summarize five articles, we Table 4: Amazon Mechanical Turker evaluation
selected the best writers according to the faithful- results of the freelance writer summaries. Results
ness, coherence, and relevance of their summaries. of zero-shot Instruct Davinci and reference sum-
Through an initial pilot study, we estimate that maries are taken from Table 2 after averaging the
the time required to summarize a CNN/DM or corresponding ratings.
XSUM article is around 12 to 15 minutes. There-
fore, we pay our writers $4 for every article they
summarize following the recommended practice same annotation scheme in Section 3.1 using Me-
(Whiting et al., 2019). We based the assignments chanical Turkers. Table 4 reports the evaluation
46
In conclusion, we find that Instruct Davinci
summarizes in a very different style than human
writers. We emphasize here that the freelance
writers write in an abstractive style despite the
fact that we have not explicitly instructed them
to do so. We also observe similarly abstractive
styles across the six freelance writers.
47
Figure 5: Human evaluation results comparing summaries written by freelance writers and summaries gener-
ated by Instruct GPT-3 Davinci. On aggregate, annotators equally prefer freelance writers and Instruct Davinci.
However, there is high variability in individual annotators’ preferences. Notably, annotator 1 writes abstractive
summaries but prefers the more extractive Instruct Davinci summaries.
Figure 5 shows the results of the paired com- Like other writers, annotator 1 summarizes in an
parison. While we hypothesized that the more abstractive style (2.5 density and 0.86 coverage).
abstractive writing style could lead to more infor- However, annotator 1 prefers Instruct Davinci
mative summaries, we did not find a significant 57% of the time even though it generated much
effect in our annotator pool, who rate the more more extractive summaries. These results suggest
abstractive summaries to be more informative an intriguing gap between annotator preferences
only 51.1% of the time. On the informative when writing and evaluating summaries.
question, our annotators reached a moderate agree-
ment (Krippendorff’s alpha is 0.32), validating 4.3 Reevaluating Reference-based Metrics
our annotation instruction and recruitment pro- In Section 3.3, we saw that the performance of
cess. Moving onto the more subjective overall automated metrics may depend on the quality of
preference, we find that our annotators equally reference summaries. With the freelance writer
prefer the freelance writer summaries and the summaries, we now conduct an initial study on
Instruct Davinci summaries. However, a closer the effect of using better quality summaries. We
analysis shows that there is significant variabil- focus on using Rouge-L for faithfulness evaluation
ity in individual annotators’ preferences and the on the XSUM dataset because the current refer-
inter-annotator agreement is low (Krippendorff’s ence summaries are known to be highly unfaithful
alpha is 0.07). This suggests that the quality of (Maynez et al., 2020).
generated summaries is getting close to that of In Figure 6, we plot the system-level Rouge-L
the freelance writer summaries and the compar- against the human ratings. The left plot shows
ison is dependent on each annotator’s stylistic the results of computing Rouge-L with existing
preference. reference summaries from XSUM, which has a
One example of such stylistic preference is negative correlation with human ratings. This re-
seen in the results from annotator 1, who also sult matches our expectations because the exist-
participated in the first round of summary writing. ing reference summaries are highly unfaithful. On
48
Downloaded from http://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00632/2325685/tacl_a_00632.pdf by guest on 30 October 2024
Figure 7: Annotation guideline for freelance writers.
the right, we see the results of computing Rouge-L ineffectiveness. At this point, we cannot rule out
with the freelance writer summaries, which leads the possibility that when finetuned on higher qual-
to a much more positive correlation. Hence, we ity data, finetuned LMs may perform much better.
see that the usefulness of reference-based evalu- Second, the learning algorithm used for instruc-
ation is closely linked to the quality of the refer- tion tuning can be important (Ouyang et al., 2022).
ences and we can improve metric correlation by While the exact training details are unknown,
using better reference summaries. the success of Instruct Davinci might be cred-
ited to ‘‘learning from human feedback’’ (LHF;
5 Discussion Stiennon et al., 2020; Ziegler et al., 2019). Con-
trary to supervised finetuning that trains systems
Implication for Model Development. In this on written summaries, learning from human feed-
study, we systematically evaluate a diverse set back trains systems from binary labels of human
of LLMs and find that instruction tuning contrib- preferences. As we observe in Section 4.2, there
utes the most to LLMs’ summarization capability. is a discrepancy in how annotators write and rate
We believe that there is much research beyond summaries. While it is possible that LHF has
our benchmarking effort that needs to be done merits over the supervised learning/finetuning
to better understand the effect of instruction tun- approach in exploiting this discrepancy, more
ing. Here we hypothesize three aspects that could analysis is needed to validate this hypothesis.
account for the success of instruction tuning. Third, multi-task learning can be important. In-
First, the quality of the summarization data struct Davinci is trained on a diverse distribution
used in instruction tuning can serve an important of inputs and many previous studies have con-
role. Our findings in Section 3 show that cur- firmed the effectiveness of multi-task learning.
rently, we are finetuning language models on low- We look forward to understanding how summari-
quality training data, which can account for their zation benefits from learning on other tasks.
49
Downloaded from http://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00632/2325685/tacl_a_00632.pdf by guest on 30 October 2024
Figure 8: MTurk annotation guideline for summary quality evaluation.
Implication for Summarization Evaluation. summaries unreliable but the summaries writ-
Our work also reveals the difficulties in evalu- ten by well-paid freelance writers also may not
ating high-performance LLMs. As LLMs become outperform LLM summaries significantly. There-
increasingly close to human-level performance, fore, defining reference summaries as the ground
human evaluation requires a larger number of truth may be overly restrictive as LLMs are ap-
samples and less noisy measurements to evaluate proaching or even exceeding average human-level
the quality of LLMs. Recently, Liu et al. (2022a) performance.
also pointed out the difficulties in conducting hu- We acknowledge that summarization evalua-
man evaluation for summarization and advocated tion is dependent on the application scenarios and
using fine-grained semantic units to match with the existing reference summaries could be suit-
reference summaries. However, as our evaluation able in another context. For example, the bullet
points out, not only are the existing reference points style summary in CNN/DM may suffice
50
for being displayed on news websites. The quality will remain an open question, and the current
issues (such as coherence) we pointed out in this benchmarks will provide limited value when used
paper may not constitute a concern in specific ap- with reference-based evaluation. Even when we
plication scenarios. However, we emphasize that address the quality issue and conduct a human
the research on single document news summariza- evaluation with high-quality references, we ob-
tion is often abstracted away from the downstream serve a significant amount of individual variation
applications and used for judging the generic sum- from our annotator pool. Due to these factors,
marization capability. Our findings in this paper evaluations for single document news summari-
are tied to this research context. This is the reason zation may be reaching their limits.
why the major results of our study rely on new
summaries written by freelance writers.
Not only is human evaluation limited by the Acknowledgments
reference quality, but it also is affected by the sub-
jectivity in evaluation. Individual variation shows This work is supported by an Open Philan-
51
Regina Barzilay and Kathleen R. McKeown. Michael Xie, Michihiro Yasunaga, Jiaxuan
2005. Sentence fusion for multidocument news You, Matei A. Zaharia, Michael Zhang, Tianyi
summarization. Computational Linguistics, Zhang, Xikun Zhang, Yuhui Zhang, Lucia
31(3):297–328. https://doi.org/10.1162 Zheng, Kaitlyn Zhou, and Percy Liang. 2021.
/089120105774321091 On the opportunities and risks of foundation
models. arXiv preprint arXiv:2108.07258.
Regina Barzilay, Kathleen R. McKeown, and
Michael Elhadad. 1999. Information fusion Tom B. Brown, Benjamin Mann, Nick Ryder,
in the context of multi-document summa- Melanie Subbiah, Jared Kaplan, Prafulla
rization. In Proceedings of the 37th Annual Dhariwal, Arvind Neelakantan, Pranav Shyam,
Meeting of the Association for Computational Girish Sastry, Amanda Askell, Sandhini
Linguistics, pages 550–557, College Park, Agarwal, Ariel Herbert-Voss, Gretchen Krueger,
Maryland, USA. Association for Computa- T. J. Henighan, Rewon Child, Aditya Ramesh,
tional Linguistics. https://doi.org/10 Daniel M. Ziegler, Jeff Wu, Clemens Winter,
52
Diaz, Orhan Firat, Michele Catasta, Jason Esin Durmus, Faisal Ladhak, and Tatsunori
Wei, Kathleen S. Meier-Hellstern, Douglas Hashimoto. 2022. Spurious correlations in
Eck, Jeff Dean, Slav Petrov, and Noah Fiedel. reference-free evaluation of text generation.
2022. Palm: Scaling language modeling with In Proceedings of the 60th Annual Meeting
pathways. arXiv preprint arXiv:2204.02311. of the Association for Computational Linguis-
tics (Volume 1: Long Papers). https://doi
Cohere. 2022. Introduction to large language
.org/10.18653/v1/2022.acl-long.102
models. https://docs.cohere.ai/docs
/introduction-to-large-language Güneş Erkan and Dragomir R. Radev. 2004.
-models Lexrank: Graph-based centrality as salience in
Trevor Cohn and Mirella Lapata. 2008. Sentence text summarization. Journal of Artificial Intel-
compression beyond word deletion. In Pro- ligence Research. https://doi.org/10
ceedings of the 22nd International Conference .1613/jair.1523
on Computational Linguistics (Coling 2008), Alexander Fabbri, Chien-Sheng Wu, Wenhao
53
Tanya Goyal, Junyi Jessy Li, and Greg Durrett. defined and leads to more preferred results
2022. News summarization and evaluation in than generic sentence fusion. In Proceedings
the era of gpt-3. ArXiv, abs/2209.12356. of the Annual Meeting of the Association for
Max Grusky, Mor Naaman, and Yoav Artzi. Computational Linguistics, pages 193–196.
2018. Newsroom: A dataset of 1.3 million https://doi.org/10.3115/1557690
summaries with diverse extractive strategies. .1557745
In North American Chapter of the Association Philippe Laban, Tobias Schnabel, Paul N.
for Computational Linguistics. https://doi Bennett, and Marti A. Hearst. 2021. Summac:
.org/10.18653/v1/N18-1065 Re-visiting nli-based models for inconsistency
Karl Moritz Hermann, Tomas Kocisky, Edward detection in summarization. Transactions of
Grefenstette, Lasse Espeholt, Will Kay, the Association for Computational Linguistics,
Mustafa Suleyman, and Phil Blunsom. 2015. 10:163–177. https://doi.org/10.1162
Teaching machines to read and comprehend. /tacl a 00453
54
Chin-Yew Lin. 2004. Rouge: A package for au- Ryan McDonald. 2006. Discriminative sentence
tomatic evaluation of summaries. In Annual compression with soft syntactic evidence. In
Meeting of the Association for Computational 11th Conference of the European Chapter of
Linguistics. the Association for Computational Linguistics,
pages 297–304.
Yang Liu and Mirella Lapata. 2019. Text
summarization with pretrained encoders. In Rada Mihalcea and Paul Tarau. 2005. Multi-
Proceedings of the 2019 Conference on Empir- document summarization with iterative graph-
ical Methods in Natural Language Processing based algorithms. In Proceedings of the First
and the 9th International Joint Conference International Conference on Intelligent Analysis
on Natural Language Processing (EMNLP- Methods and Tools (IA 2005). McLean, VA.
IJCNLP), pages 3730–3740, Hong Kong, China. Shashi Narayan, Shay B. Cohen, and Mirella
Association for Computational Linguistics. Lapata. 2018. Don’t give me the details, just
https://doi.org/10.18653/v1/D19 the summary! Topic-aware convolutional neu-
55
Sentence extraction, utility-based evaluation, H. Grogory Silber and Kathleen F. McCoy.
and user studies. In NAACL-ANLP 2000 Work- 2002. Efficiently computed lexical chains as
shop: Automatic Summarization. an intermediate representation for automatic
Alexander M. Rush, Sumit Chopra, and Jason text summarization. Computational Linguistics,
Weston. 2015. A neural attention model for 28(4):487–496. https://doi.org/10.1162
abstractive sentence summarization. Proceed- /089120102762671954
ings of the 2015 Conference on Empirical Josef Steinberger, Massimo Poesio, Mijail A.
Methods in Natural Language Processing. Kabadjov, and Karel Jeek. 2007. Two uses
https://doi.org/10.18653/v1/D15 of anaphora resolution in summarization.
-1044 Information Processing and Management,
Gerard Salton, Amit Singhal, Mandar Mitra, 43(6):1663–1680. https://doi.org/10
and Chris Buckley. 1997. Automatic text .1016/j.ipm.2007.01.010
structuring and summarization. Information
56
Noah A. Smith, and Daniel Khashabi. 2022. with extracted gap-sentences for abstractive
Benchmarking generalization via in-context summarization. In ICML.
instructions on 1, 600+ language tasks. arXiv Susan Zhang, Stephen Roller, Naman Goyal,
preprint arXiv:2204.07705. Mikel Artetxe, Moya Chen, Shuohui Chen,
Mark E. Whiting, Grant Hugh, and Michael Christopher Dewan, Mona Diab, Xian Li,
S. Bernstein. 2019. Fair work: Crowd work Xi Victoria Lin, Todor Mihaylov, Myle Ott,
minimum wage with one line of code. In Sam Shleifer, Kurt Shuster, Daniel Simig,
AAAI Conference on Human Computation Punit Singh Koura, Anjali Sridhar, Tianlu
& Crowdsourcing. https://doi.org/10 Wang, and Luke Zettlemoyer. 2022. Opt:
.1609/hcomp.v7i1.5283 Open pre-trained transformer language models.
ArXiv, abs/2205.01068.
Jeff Wu, Long Ouyang, Daniel M Ziegler, Nisan
Stiennon, Ryan Lowe, Jan Leike, and Paul Tianyi Zhang, Varsha Kishore, Felix Wu,
Christiano. 2021. Recursively summarizing Kilian Q. Weinberger, and Yoav Artzi. 2020.
57