Likert
Likert
Mingqi Gao, Jie Ruan, Renliang Sun, Xunjian Yin, Shiping Yang, Xiaojun Wan
Wangxuan Institute of Computer Technology, Peking University
{gaomingqi, xjyin, wanxiaojun}@pku.edu.cn
{ruanjie, sunrenliang}@stu.pku.edu.cn
[email protected]
plored ChatGPT’s ability to perform human- as FactCC (Kryscinski et al., 2020), and question
like summarization evaluation using four hu- answering, such as FEQA (Durmus et al., 2020),
man evaluation methods on five datasets. We have also been used to evaluate the factual con-
found that ChatGPT was able to complete sistency of summaries. Despite the existence of
annotations relatively smoothly using Likert advanced automatic evaluation metrics, their per-
scale scoring, pairwise comparison, Pyramid, formance, usability, and interpretability are still far
and binary factuality evaluation. Additionally, from satisfactory.
it outperformed commonly used automatic
evaluation metrics on some datasets. Further- Large language models (LLMs) offer completely
more, we discussed the impact of different different possibilities for the automatic evaluation
prompts, compared its performance with that of summarization. GPT-3 (Brown et al., 2020) has
of human evaluation, and analyzed the gener- the ability of in-context learning, and instruction
ated explanations and invalid responses. tuning allows LLMs to align with human evaluation
(Ouyang et al., 2022). These two abilities make it
1 Introduction possible for LLMs to mimic the behavior of human
Text summarization is a task that involves gen- evaluators, who generally evaluate summaries by
erating a condensed version of one or multiple understanding examples and instructions. We re-
documents. Thanks to the advancements in deep fer to this automatic evaluation method that views
learning-based techniques, automatic summariza- large models as human evaluators as human-like
tion has made significant strides. Specifically, the automatic evaluation. The most prominent feature
emergence of large language models such as In- of this evaluation method is its flexibility, which
structGPT has resulted in comparable performance unifies all types of automatic evaluation in form
to reference summaries written by humans, even in and can simulate many of the practices of human
zero-shot settings (Zhang et al., 2023). evaluators. Unlike previous automatic evaluation
Evaluating text summarization, like other text metrics that give one or more numerical values
generation tasks, is a challenging problem. While as evaluation results, the evaluation results of this
human evaluation is considered the gold standard, human-like automatic evaluation are fully reflected
it is expensive and time-consuming. As a result, in the generated responses, which may include scor-
automatic evaluation metrics play a crucial role. ing, comparison, labels, and explanations.
ROUGE (Lin, 2004) and its variants, which are We conducted an evaluation of the evaluation
based on reference summaries and n-gram match- ability of ChatGPT, a recently popular LLM, us-
ing, are widely accepted and used in various types ing four commonly used human evaluation meth-
of summarization. However, surface-level word ods for summarization. The methods include Lik-
matching cannot accurately reflect the quality of ert scale scoring, pairwise comparison, Pyramid
the summary. Additionally, it is challenging to eval- (Nenkova and Passonneau, 2004), and binary fac-
uate the factual accuracy of the summary without tuality evaluation. Our findings indicate that Chat-
utilizing the source document. Recently, evalua- GPT is capable of completing annotations rela-
tion metrics based on pre-trained models such as tively smoothly using these methods. In addition,
our results demonstrate that ChatGPT outperforms evaluation specifically designed for summarization.
commonly used automatic evaluation metrics on After introducing each method, we will list the
some datasets. Furthermore, we analyzed the im- datasets we used that were annotated in this way.
pact of different prompts, compared the perfor- Likert scale scoring is the most common
mance of ChatGPT with human evaluation, and method for human evaluation. Specifically, given
examined the quality of the generated explanations a source document and a generated summary, an-
and invalid responses. notators rate the summary on several dimensions.
Typically, this is an absolute evaluation, meaning
2 Preliminary each summary is evaluated individually without ex-
2.1 Automatic Evaluation Metrics plicit comparison to other summaries. Dimensions
usually include factual consistency, informative-
We select several evaluation metrics that are com-
ness, fluency, etc. The rating scale is usually 1
monly used in summarization:
(worst) to 5 (best). We used SummEval (Fabbri
ROUGE (Lin, 2004), which is the dominant
et al., 2021) and Newsroom datasets (Grusky et al.,
automatic evaluation metric in summarization, is
2018).
widely used by researchers. The most commonly
Pairwise comparison is a relative human eval-
used ROUGE measures are ROUGE-1, ROUGE-2,
uation method. Given a source document and two
and ROUGE-L, which evaluate the similarity be-
generated summaries, annotators choose the one
tween two texts based on the overlap of unigrams,
that is of higher quality. This method is used in
bigrams, and the longest common sequence.
reinforcement learning based human feedback for
BERTScore (Zhang et al., 2020) assesses the
summarization. We used the TLDR dataset (Stien-
similarity between two texts at the token level
non et al., 2022).
by measuring the soft overlap using contextual
Pyramid (Nenkova and Passonneau, 2004) is a
embeddings from BERT. Similarly, MoverScore
human evaluation method designed for summariza-
(Zhao et al., 2019) uses n-gram embeddings that
tion that is based on reference summaries. Prior to
are pooled from BERT to compute the semantic
human annotation, several semantic content units
distance between two texts at the n-gram level.
(SCUs) are extracted from the reference summary.
BARTScore (Yuan et al., 2021) 1 views eval-
For each SCU, annotators judge whether it presents
uation as a natural language generation task and
in the generated summary. For single-document
considers that when the quality of the generated
summarization, the final score of the summary is
text is higher, BART is more likely to generate it
the proportion of SCUs it contains. We used the
from the source text or the reference, or to generate
REALSumm dataset (Bhandari et al., 2020).
the reference from it. BARTScore can be flexibly
Binary factuality evaluation is a method for
applied to evaluate text from various perspectives.
evaluating the factual correctness of summaries.
FactCC2 and DAE 3 are two factuality metrics
Given a source document and a sentence in the
based on classification. When evaluating a sum-
generated summary, annotators judge whether the
mary, we use NLTK 4 to split it into individual
sentence is faithful to the source document. We
sentences and classify each one as factually cor-
used the QAGS dataset (Wang et al., 2020).
rect or not. The factual score of the summary is
then calculated as the ratio of sentences that are 3 Experiments
factually correct.
3.1 Model and Parameters
2.2 Human Evaluation Methods
We used the ChatGPT API (gpt-3.5-turbo-0301)
There are several commonly used methods for hu- provided by OpenAI for our experiments. To re-
man evaluation, including the Likert scale scor- duce randomness, we set temperature to 0. In
ing and pairwise comparison for general text gen- addition, we set max_tokens to 256. We kept the
eration, as well as Pyramid and binary factuality default values for other parameters.
1
https://github.com/neulab/BARTScore, also for
ROUGE, BERTScore, and MoverScore. 3.2 Prompt Design
2
https://github.com/salesforce/factCC
3 When designing prompts, we made it as identical
https://github.com/tagoyal/
factuality-datasets as possible to the original instructions of human
4
version 3.7, https://www.nltk.org/ evaluations.
Evaluate the quality of summaries written for evaluation. The sentences are from the generated
a news article. Rate each summary on four di- summaries.
mensions: {Dimension_1}, {Dimension_2},
{Dimension_3}, and {Dimension_4}. You Is the sentence supported by the article?
should rate on a scale from 1 (worst) to 5 Answer "Yes" or "No".
(best).
Article: {Article}
Article: {Article} Sentence: {Sentence}
Summary: {Summary}
Figure 4: The template for binary factuality evaluation.
Figure 1: The template for Likert scale scoring.
Table 1: Spearman’s ρ of sample level, system level, and dataset level on SummEval.
Table 2: Spearman’s ρ of sample level, system level, and dataset level on Newsroom.
Table 6: Spearman’s ρ of sample level, system level, and dataset level on SummEval. Annotator_0, Annotator_1,
Annotator_2 refer to the three expert annotators. We compute the correlation coefficient between the score given
by a particular annotator and the average score of the three. "+def" means adding dimension definitions in the
prompt. "+ins" means adding step instructions in the prompt. Please see the example in Figure 5 for dimension
definitions and step instructions. "+sys_prompt" denotes setting system prompt.
Paul Merson was brought on with only seven minutes remaining in his team’s
0-0 draw with Burnley. Andros Townsend scored the Tottenham midfielder in
Summary the 89th minute. Paul Merson had another dig at Andros Townsend after his
appearance. The midfielder had been brought on to the England squad last
week. Click here for all the latest arsenal news."
Relevance: 1.67 (2, 1, 2)
Consistency (Faithfulness): 1.00 (1, 1, 1)
Expert annotator
Fluency: 3.00 (4, 2, 3)
Coherence: 1.33 (2, 1, 1)
Relevance: 2
Faithfulness: 2
Fluency: 2
Coherence: 2
Overall score: 2
ChatGPT
This summary is not very relevant to the article as it only mentions a few
details and leaves out important information. It is not very faithful as it
misrepresents the article by suggesting that Andros Townsend scored in the
Tottenham game, when in fact he was only brought on as a substitute. The
fluency and coherence are also poor, with awkward phrasing and lack of
logical flow.
Relevance: 2
Consistency: 1
Fluency: 2
Coherence: 1
Explanation:
ChatGPT+def
The summary is not relevant as it does not capture the key points of the article.
It is inconsistent with the facts in the original article, as it incorrectly states
that Andros Townsend scored in the 89th minute, when in fact he was brought
on in the 83rd minute. The fluency of the summary is poor, with grammatical
errors and incomplete sentences. The coherence of the summary is also poor,
as it does not flow well and does not make sense as a whole.
Table 7: Examples of ratings and explanations generated by ChatGPT. The red parts show the factual errors of
the summary. The blue parts show the explanations for fluency and coherence. The long source document is not
shown. “Expert Annotator” shows the average of scores given by three expert annotators, with each person’s score
in parentheses.
Imagine you are a human annotator now. You will evaluate the quality of summaries written for a news
article. Please follow these steps:
1. Carefully read the news article, and be aware of the information it contains.
2. Read the proposed summary.
3. Rate the summary on four dimensions: relevance, consistency, fluency, and coherence. You should rate
on a scale from 1 (worst) to 5 (best).
Figure 5: The template for Likert scale scoring with step instructions (in red) and dimension definitions (in orange)
on SummEval.
ChatGPT reproducible by setting randomness pa- it does not generate an explanation, but the impact
rameters (e.g., temperature) at decoding time. of this on the evaluation scores is unknown.
In terms of cost, it is cheaper to perform the The explanations generated by ChatGPT are gen-
human-like automatic evaluation. Taking Sum- erally self-consistent but not necessarily correct.
mEval as an example, in our experiments, the as- The generated explanations generally coincide with
sessment of one summary consumed about 1000 its scoring. For example, Table 7 shows that Chat-
tokens, and it took about 0.002 × 1600 = 3.2 USD GPT and ChatGPT+def both scored low for the
6 to finish the evaluation on the whole dataset. As-
faithfulness of the summary, and they both pointed
suming that a single annotator spends 5 hours an- out factual errors in the summary. However, the
notating the whole dataset. It costs 12 × 5 = 60 correctness of these explanations still needs further
USD. It is estimated that the cost of human evalua- testing.
tion is about 10 to 20 times higher than human-like
automatic evaluation using ChatGPT. The combination of ChatGPT’s explanations and
scoring can better confirm whether it understands
4.3 The quality of generated explanations the requirements of the evaluation, for example,
the dimension definitions. Without providing di-
We sampled and examined the responses gener-
mension definitions (see Figure 5), ChatGPT’s un-
ated by ChatGPT on SummEval, and found the
derstanding of fluency and coherence converged.
following characteristics of the explanations given
After examining multiple samples we found that its
by ChatGPT:
explanations of the scoring of these two dimensions
ChatGPT sometimes provides scores or labels
are placed together and the dataset-level correla-
followed by an explanation, even if it is not explic-
tion between the scoring of these two dimensions
itly asked to provide the explanation in the prompt.
is 0.960. ChatGPT is better able to distinguish be-
Of course, it is possible to add a request such as
tween these two dimensions when dimension defi-
"You do not need to explain." to the prompt so that
nitions are provided. Its explanations of the scoring
6
https://openai.com/pricing of the two dimensions are separated and the dataset-
Description Example
I’m sorry, but the proposed summary is not acceptable. It contains
Refuse to evaluate inaccurate and irrelevant information, and it is not coherent or
fluent. Please provide a new summary for evaluation.
As an AI language model, I cannot evaluate the quality of
summaries as it requires subjective judgment. However, I can
Not evaluate as required
provide feedback on the summary’s coherence, fluency, and
relevance. The proposed summary is poorly written and [......]
As an AI language model, I cannot evaluate the quality of
summaries as it requires subjective human judgment. However,
I can provide a summary of the article that is more accurate and
Write a new summary
relevant:
level correlation between the two dimensions drops factual consistency in summarization.. Liu et al.
to 0.843. (2023) utilized ChatGPT and GPT-4 to assess the
quality of NLG outputs with chain-of-thoughts.
4.4 Invalid responses
ChatGPT sometimes generates invalid responses, 6 Conclusion
but this fraction is only about 1% at most (see
From the above experiments using ChatGPT for
Table 9). As shown in Table 8, common invalid re-
human-like summarization evaluation, the key find-
sponses were refusing to evaluate, not evaluating as
ings are as follows:
required, writing a new summary, and continuing
to write the summary. The reason why invalid re-
• ChatGPT has the ability to perform summa-
sponses are generated needs to be further explored.
rization evaluation using various human eval-
Invalid responses uation methods. In some instances, it attains a
ChatGPT 0.0000 higher correlation with human judgments than
ChatGPT+def 0.0003 existing evaluation metrics.
ChatGPT+def+ins 0.0106
• The performance of ChatGPT on summariza-
ChatGPT+sys_prompt 0.0013
tion evaluation is highly dependent on prompt
Table 9: Porportions of invalid responses generated by design.
ChatGPT on SummEval.
• Human-like evaluation with ChatGPT is more
cost-effective and reproducible than human
5 Related Work evaluation.
There are some concurrent studies using LLMs for • The explanation generated by ChatGPT is con-
human-like NLG evaluation. According to Kocmi sistent with its scoring.
and Federmann (2023), LLMs are currently the
most advanced evaluators of translation quality.
Wang et al. (2023) tested ChatGPT’s ability to be an
evaluator on three NLG meta-evaluation datasets. References
Ji et al. (2023) explored the effectiveness of Chat- Anya Belz, Anastasia Shimorina, Shubham Agarwal,
GPT in ranking model-generated content. Luo et al. and Ehud Reiter. 2021. The ReproGen shared task
(2023) investigated ChatGPT’s ability to evaluate on reproducibility of human evaluations in NLG:
Overview and results. In Proceedings of the 14th In- Proceedings of the 2020 Conference on Empirical
ternational Conference on Natural Language Gener- Methods in Natural Language Processing (EMNLP),
ation, pages 249–258, Aberdeen, Scotland, UK. As- pages 9332–9346, Online. Association for Computa-
sociation for Computational Linguistics. tional Linguistics.
Manik Bhandari, Pranav Narayan Gour, Atabak Ash- Chin-Yew Lin. 2004. ROUGE: A package for auto-
faq, Pengfei Liu, and Graham Neubig. 2020. Re- matic evaluation of summaries. In Text Summariza-
evaluating evaluation in text summarization. In tion Branches Out, pages 74–81, Barcelona, Spain.
Proceedings of the 2020 Conference on Empirical Association for Computational Linguistics.
Methods in Natural Language Processing (EMNLP),
Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang,
pages 9347–9359, Online. Association for Computa-
Ruochen Xu, and Chenguang Zhu. 2023. Gpteval:
tional Linguistics.
Nlg evaluation using gpt-4 with better human align-
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie ment.
Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Zheheng Luo, Qianqian Xie, and Sophia Ananiadou.
Neelakantan, Pranav Shyam, Girish Sastry, Amanda 2023. Chatgpt as a factual inconsistency evaluator
Askell, Sandhini Agarwal, Ariel Herbert-Voss, for abstractive text summarization. Computing Re-
Gretchen Krueger, Tom Henighan, Rewon Child, search Repository, arXiv:2303.15621.
Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu,
Clemens Winter, Christopher Hesse, Mark Chen, Ani Nenkova and Rebecca Passonneau. 2004. Evaluat-
Eric Sigler, Mateusz Litwin, Scott Gray, Ben- ing content selection in summarization: The pyra-
jamin Chess, Jack Clark, Christopher Berner, Sam mid method. In Proceedings of the Human Lan-
McCandlish, Alec Radford, Ilya Sutskever, and guage Technology Conference of the North Ameri-
Dario Amodei. 2020. Language models are few- can Chapter of the Association for Computational
shot learners. Computing Research Repository, Linguistics: HLT-NAACL 2004, pages 145–152,
arXiv:2005.14165. Version 4. Boston, Massachusetts, USA. Association for Com-
putational Linguistics.
Esin Durmus, He He, and Mona Diab. 2020. FEQA: A
question answering evaluation framework for faith- Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida,
fulness assessment in abstractive summarization. In Carroll L. Wainwright, Pamela Mishkin, Chong
Proceedings of the 58th Annual Meeting of the Asso- Zhang, Sandhini Agarwal, Katarina Slama, Alex
ciation for Computational Linguistics, pages 5055– Ray, John Schulman, Jacob Hilton, Fraser Kelton,
5070, Online. Association for Computational Lin- Luke Miller, Maddie Simens, Amanda Askell, Pe-
guistics. ter Welinder, Paul Christiano, Jan Leike, and Ryan
Lowe. 2022. Training language models to follow
Alexander R. Fabbri, Wojciech Kryściński, Bryan instructions with human feedback. Computing Re-
McCann, Caiming Xiong, Richard Socher, and search Repository, arXiv:2203.02155.
Dragomir Radev. 2021. SummEval: Re-evaluating
summarization evaluation. Transactions of the Asso- Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M.
ciation for Computational Linguistics, 9:391–409. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford,
Dario Amodei, and Paul Christiano. 2022. Learning
Max Grusky, Mor Naaman, and Yoav Artzi. 2018. to summarize from human feedback. Computing Re-
Newsroom: A dataset of 1.3 million summaries with search Repository, arXiv:2009.01325. Version 3.
diverse extractive strategies. In Proceedings of the
Alex Wang, Kyunghyun Cho, and Mike Lewis. 2020.
2018 Conference of the North American Chapter of
Asking and answering questions to evaluate the fac-
the Association for Computational Linguistics: Hu-
tual consistency of summaries. In Proceedings of
man Language Technologies, Volume 1 (Long Pa-
the 58th Annual Meeting of the Association for Com-
pers), pages 708–719, New Orleans, Louisiana. As-
putational Linguistics, pages 5008–5020, Online.
sociation for Computational Linguistics.
Association for Computational Linguistics.
Yunjie Ji, Yan Gong, Yiping Peng, Chao Ni, Peiyan Jiaan Wang, Yunlong Liang, Fandong Meng, Haoxi-
Sun, Dongyu Pan, Baochang Ma, and Xiangang ang Shi, Zhixu Li, Jinan Xu, Jianfeng Qu, and Jie
Li. 2023. Exploring chatgpt’s ability to rank con- Zhou. 2023. Is chatgpt a good nlg evaluator? a
tent: A preliminary study on consistency with hu- preliminary study. Computing Research Repository,
man preferences. Computing Research Repository, arXiv:2303.04048.
arXiv:2303.07610.
Weizhe Yuan, Graham Neubig, and Pengfei Liu.
Tom Kocmi and Christian Federmann. 2023. Large 2021. Bartscore: Evaluating generated text as
language models are state-of-the-art evaluators of text generation. Computing Research Repository,
translation quality. Computing Research Repository, arXiv:2106.11520. Version 2.
arXiv:2302.14520.
Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q.
Wojciech Kryscinski, Bryan McCann, Caiming Xiong, Weinberger, and Yoav Artzi. 2020. Bertscore: Eval-
and Richard Socher. 2020. Evaluating the factual uating text generation with bert. Computing Re-
consistency of abstractive text summarization. In search Repository, arXiv:1904.09675. Version 3.
Tianyi Zhang, Faisal Ladhak, Esin Durmus, Percy
Liang, Kathleen McKeown, and Tatsunori B.
Hashimoto. 2023. Benchmarking large language
models for news summarization. Computing Re-
search Repository, arXiv:2301.13848.
Wei Zhao, Maxime Peyrard, Fei Liu, Yang Gao, Chris-
tian M. Meyer, and Steffen Eger. 2019. MoverScore:
Text generation evaluating with contextualized em-
beddings and earth mover distance. In Proceedings
of the 2019 Conference on Empirical Methods in
Natural Language Processing and the 9th Interna-
tional Joint Conference on Natural Language Pro-
cessing (EMNLP-IJCNLP), pages 563–578, Hong
Kong, China. Association for Computational Lin-
guistics.