0% found this document useful (0 votes)
57 views

Likert

The document discusses evaluating ChatGPT's ability to perform human-like summarization evaluation using four human evaluation methods (Likert scale scoring, pairwise comparison, Pyramid, and binary factuality evaluation) on five datasets. It found that ChatGPT could complete the annotations relatively smoothly for each method. Additionally, ChatGPT sometimes outperformed commonly used automatic metrics, and the authors analyzed the impact of prompts and ChatGPT's performance compared to humans.

Uploaded by

georgemlima
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views

Likert

The document discusses evaluating ChatGPT's ability to perform human-like summarization evaluation using four human evaluation methods (Likert scale scoring, pairwise comparison, Pyramid, and binary factuality evaluation) on five datasets. It found that ChatGPT could complete the annotations relatively smoothly for each method. Additionally, ChatGPT sometimes outperformed commonly used automatic metrics, and the authors analyzed the impact of prompts and ChatGPT's performance compared to humans.

Uploaded by

georgemlima
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Human-like Summarization Evaluation with ChatGPT

Mingqi Gao, Jie Ruan, Renliang Sun, Xunjian Yin, Shiping Yang, Xiaojun Wan
Wangxuan Institute of Computer Technology, Peking University
{gaomingqi, xjyin, wanxiaojun}@pku.edu.cn
{ruanjie, sunrenliang}@stu.pku.edu.cn
[email protected]

Abstract BERTScore (Zhang et al., 2020) and BARTScore


(Yuan et al., 2021) have achieved better correla-
Evaluating text summarization is a challeng- tion with human judgments. Factuality evaluation
ing problem, and existing evaluation metrics
methods based on entailment classification, such
are far from satisfactory. In this study, we ex-
arXiv:2304.02554v1 [cs.CL] 5 Apr 2023

plored ChatGPT’s ability to perform human- as FactCC (Kryscinski et al., 2020), and question
like summarization evaluation using four hu- answering, such as FEQA (Durmus et al., 2020),
man evaluation methods on five datasets. We have also been used to evaluate the factual con-
found that ChatGPT was able to complete sistency of summaries. Despite the existence of
annotations relatively smoothly using Likert advanced automatic evaluation metrics, their per-
scale scoring, pairwise comparison, Pyramid, formance, usability, and interpretability are still far
and binary factuality evaluation. Additionally, from satisfactory.
it outperformed commonly used automatic
evaluation metrics on some datasets. Further- Large language models (LLMs) offer completely
more, we discussed the impact of different different possibilities for the automatic evaluation
prompts, compared its performance with that of summarization. GPT-3 (Brown et al., 2020) has
of human evaluation, and analyzed the gener- the ability of in-context learning, and instruction
ated explanations and invalid responses. tuning allows LLMs to align with human evaluation
(Ouyang et al., 2022). These two abilities make it
1 Introduction possible for LLMs to mimic the behavior of human
Text summarization is a task that involves gen- evaluators, who generally evaluate summaries by
erating a condensed version of one or multiple understanding examples and instructions. We re-
documents. Thanks to the advancements in deep fer to this automatic evaluation method that views
learning-based techniques, automatic summariza- large models as human evaluators as human-like
tion has made significant strides. Specifically, the automatic evaluation. The most prominent feature
emergence of large language models such as In- of this evaluation method is its flexibility, which
structGPT has resulted in comparable performance unifies all types of automatic evaluation in form
to reference summaries written by humans, even in and can simulate many of the practices of human
zero-shot settings (Zhang et al., 2023). evaluators. Unlike previous automatic evaluation
Evaluating text summarization, like other text metrics that give one or more numerical values
generation tasks, is a challenging problem. While as evaluation results, the evaluation results of this
human evaluation is considered the gold standard, human-like automatic evaluation are fully reflected
it is expensive and time-consuming. As a result, in the generated responses, which may include scor-
automatic evaluation metrics play a crucial role. ing, comparison, labels, and explanations.
ROUGE (Lin, 2004) and its variants, which are We conducted an evaluation of the evaluation
based on reference summaries and n-gram match- ability of ChatGPT, a recently popular LLM, us-
ing, are widely accepted and used in various types ing four commonly used human evaluation meth-
of summarization. However, surface-level word ods for summarization. The methods include Lik-
matching cannot accurately reflect the quality of ert scale scoring, pairwise comparison, Pyramid
the summary. Additionally, it is challenging to eval- (Nenkova and Passonneau, 2004), and binary fac-
uate the factual accuracy of the summary without tuality evaluation. Our findings indicate that Chat-
utilizing the source document. Recently, evalua- GPT is capable of completing annotations rela-
tion metrics based on pre-trained models such as tively smoothly using these methods. In addition,
our results demonstrate that ChatGPT outperforms evaluation specifically designed for summarization.
commonly used automatic evaluation metrics on After introducing each method, we will list the
some datasets. Furthermore, we analyzed the im- datasets we used that were annotated in this way.
pact of different prompts, compared the perfor- Likert scale scoring is the most common
mance of ChatGPT with human evaluation, and method for human evaluation. Specifically, given
examined the quality of the generated explanations a source document and a generated summary, an-
and invalid responses. notators rate the summary on several dimensions.
Typically, this is an absolute evaluation, meaning
2 Preliminary each summary is evaluated individually without ex-
2.1 Automatic Evaluation Metrics plicit comparison to other summaries. Dimensions
usually include factual consistency, informative-
We select several evaluation metrics that are com-
ness, fluency, etc. The rating scale is usually 1
monly used in summarization:
(worst) to 5 (best). We used SummEval (Fabbri
ROUGE (Lin, 2004), which is the dominant
et al., 2021) and Newsroom datasets (Grusky et al.,
automatic evaluation metric in summarization, is
2018).
widely used by researchers. The most commonly
Pairwise comparison is a relative human eval-
used ROUGE measures are ROUGE-1, ROUGE-2,
uation method. Given a source document and two
and ROUGE-L, which evaluate the similarity be-
generated summaries, annotators choose the one
tween two texts based on the overlap of unigrams,
that is of higher quality. This method is used in
bigrams, and the longest common sequence.
reinforcement learning based human feedback for
BERTScore (Zhang et al., 2020) assesses the
summarization. We used the TLDR dataset (Stien-
similarity between two texts at the token level
non et al., 2022).
by measuring the soft overlap using contextual
Pyramid (Nenkova and Passonneau, 2004) is a
embeddings from BERT. Similarly, MoverScore
human evaluation method designed for summariza-
(Zhao et al., 2019) uses n-gram embeddings that
tion that is based on reference summaries. Prior to
are pooled from BERT to compute the semantic
human annotation, several semantic content units
distance between two texts at the n-gram level.
(SCUs) are extracted from the reference summary.
BARTScore (Yuan et al., 2021) 1 views eval-
For each SCU, annotators judge whether it presents
uation as a natural language generation task and
in the generated summary. For single-document
considers that when the quality of the generated
summarization, the final score of the summary is
text is higher, BART is more likely to generate it
the proportion of SCUs it contains. We used the
from the source text or the reference, or to generate
REALSumm dataset (Bhandari et al., 2020).
the reference from it. BARTScore can be flexibly
Binary factuality evaluation is a method for
applied to evaluate text from various perspectives.
evaluating the factual correctness of summaries.
FactCC2 and DAE 3 are two factuality metrics
Given a source document and a sentence in the
based on classification. When evaluating a sum-
generated summary, annotators judge whether the
mary, we use NLTK 4 to split it into individual
sentence is faithful to the source document. We
sentences and classify each one as factually cor-
used the QAGS dataset (Wang et al., 2020).
rect or not. The factual score of the summary is
then calculated as the ratio of sentences that are 3 Experiments
factually correct.
3.1 Model and Parameters
2.2 Human Evaluation Methods
We used the ChatGPT API (gpt-3.5-turbo-0301)
There are several commonly used methods for hu- provided by OpenAI for our experiments. To re-
man evaluation, including the Likert scale scor- duce randomness, we set temperature to 0. In
ing and pairwise comparison for general text gen- addition, we set max_tokens to 256. We kept the
eration, as well as Pyramid and binary factuality default values for other parameters.
1
https://github.com/neulab/BARTScore, also for
ROUGE, BERTScore, and MoverScore. 3.2 Prompt Design
2
https://github.com/salesforce/factCC
3 When designing prompts, we made it as identical
https://github.com/tagoyal/
factuality-datasets as possible to the original instructions of human
4
version 3.7, https://www.nltk.org/ evaluations.
Evaluate the quality of summaries written for evaluation. The sentences are from the generated
a news article. Rate each summary on four di- summaries.
mensions: {Dimension_1}, {Dimension_2},
{Dimension_3}, and {Dimension_4}. You Is the sentence supported by the article?
should rate on a scale from 1 (worst) to 5 Answer "Yes" or "No".
(best).
Article: {Article}
Article: {Article} Sentence: {Sentence}
Summary: {Summary}
Figure 4: The template for binary factuality evaluation.
Figure 1: The template for Likert scale scoring.

3.3 Post-processing of Results


Figure 1 shows the template for Likert scale scor-
The vast majority of ChatGPT responses contained
ing. ChatGPT is asked to rate four dimensions at
annotation results, which can be extracted by some
a time. For SummEval, the four dimensions are
simple rules. For invalid responses, we considered
relevance, faithfulness 5 , fluency, and coherence.
them as failing to complete the tagging successfully
For Newsroom, the four dimensions are relevance,
and marked them as NAN (not a number).
informativeness, fluency, and coherence. Figure 2
shows the template for pairwise comparison. 3.4 Evaluation
Given a new article, which summary is better? For Likert scale scoring, we computed sample-
Answer "Summary 0" or "Summary 1". You level, system-level, and dataset-level correlation
do not need to explain the reason. with human judgments. For the other human eval-
uation methods, we calculated the accuracy of the
Article: {Article} responses generated by ChatGPT using human an-
Summary 0: {Summary_0} notation as the answer.
Summary 1: {Summary_1}
3.5 Results
Figure 2: The template for pairwise comparison. Tables 1 and 2 show that ChatGPT has a good abil-
ity to evaluate summaries with Likert scale scoring.
Figure 3 shows the template for Pyramid. The On SummEval, it performs substantially better than
number of SCUs depends on the content of the the existing evaluation metrics. On Newsroom,
reference summary, up to 16. it performs second only to BARTScore_s_h and
BARTScore_cnn_s_h.
You are given a summary and some semantic
Tables 3, 4 and 5 illustrate that ChatGPT can
content units. For each semantic unit, mark
also perform relatively smoothly on pairwise com-
"Yes" if it can be inferred from the summary,
parisons, Pyramid, and binary factuality evaluation.
otherwise mark "No".
Nevertheless, from the current experimental results,
ChatGPT has not yet shown a very large advantage
Summary: {Summary}
except on QAGS_XSUM.
Semantic content units:
1. {SCU_1} 4 Analysis and Discussion
2. {SCU_2}
...... 4.1 Impact of different prompts
n. {SCU_n} We tried several different prompts on SummEval.
As shown in Figure 5, more detailed step instruc-
Figure 3: The template for Pyramid.
tions and dimension definitions are added. These
instructions and definitions are from the original
Figure 4 shows the template for binary factuality
human evaluation. In addition, we consider setting
5
The original term used in SummEval was "consistency". the system prompt as "You are a human annotator
Since we did not add definitions of the dimensions in the
prompt, we used "faithfulness", which is more representative that rates the quality of summaries." when using
of its actual meaning ChatGPT API.
consistency relevance fluency coherence
Metric Name sample system dataset sample system dataset sample system dataset sample system dataset
ROUGE-1 0.153 0.744 0.137 0.326 0.744 0.302 0.113 0.730 0.080 0.167 0.506 0.184
ROUGE-2 0.179 0.779 0.129 0.290 0.621 0.245 0.156 0.690 0.062 0.184 0.335 0.145
ROUGE-L 0.111 0.112 0.109 0.311 0.362 0.284 0.103 0.306 0.079 0.128 0.138 0.141
BERTScore 0.105 -0.077 0.118 0.312 0.324 0.362 0.189 0.246 0.150 0.284 0.477 0.317
MoverScore 0.151 0.679 0.150 0.318 0.724 0.294 0.126 0.687 0.119 0.159 0.474 0.178
BARTScore_s_h 0.299 0.800 0.269 0.264 0.524 0.363 0.243 0.614 0.187 0.322 0.477 0.335
BARTScore_h_r 0.097 0.606 0.101 0.178 0.147 0.246 0.002 0.261 0.000 0.017 -0.115 0.064
BARTScore_r_h -0.075 -0.556 -0.090 -0.081 -0.112 -0.136 0.013 -0.212 0.019 0.044 0.165 -0.010
BARTScore_cnn_s_h 0.367 0.435 0.334 0.356 0.765 0.394 0.349 0.746 0.285 0.448 0.700 0.408
BARTScore_cnn_h_r 0.171 0.771 0.106 0.320 0.456 0.244 0.111 0.561 0.066 0.153 0.174 0.130
BARTScore_cnn_r_h 0.001 -0.079 -0.004 0.146 0.312 0.221 0.107 0.297 0.145 0.228 0.506 0.236
ChatGPT 0.435 0.833 0.425 0.433 0.901 0.445 0.419 0.889 0.410 0.561 0.832 0.557

Table 1: Spearman’s ρ of sample level, system level, and dataset level on SummEval.

coherence fluency informativeness relevance


Metric Name sample system dataset sample system dataset sample system dataset sample system dataset
ROUGE-1 0.095 0.429 0.100 0.104 0.429 0.064 0.130 0.286 0.149 0.147 0.357 0.122
ROUGE-2 0.025 0.321 0.080 0.047 0.321 0.045 0.078 0.250 0.158 0.090 0.357 0.124
ROUGE-L 0.064 0.357 0.079 0.072 0.357 0.045 0.089 0.214 0.137 0.106 0.321 0.101
BERTScore 0.148 0.429 0.169 0.170 0.429 0.154 0.131 0.286 0.196 0.163 0.357 0.176
MoverScore 0.162 0.429 0.173 0.120 0.429 0.112 0.188 0.286 0.232 0.195 0.357 0.192
BARTScore_s_h 0.679 0.964 0.656 0.670 0.964 0.615 0.646 0.821 0.645 0.604 0.893 0.588
BARTScore_h_r 0.329 0.286 0.302 0.292 0.286 0.261 0.419 0.429 0.386 0.363 0.357 0.386
BARTScore_r_h -0.311 -0.571 -0.249 -0.215 -0.571 -0.232 -0.423 -0.750 -0.346 -0.334 -0.607 -0.305
BARTScore_cnn_s_h 0.653 0.893 0.623 0.640 0.893 0.596 0.616 0.750 0.592 0.567 0.786 0.557
BARTScore_cnn_h_r 0.239 0.429 0.215 0.235 0.429 0.165 0.284 0.429 0.239 0.267 0.464 0.221
BARTScore_cnn_r_h 0.316 0.429 0.333 0.353 0.429 0.330 0.242 0.286 0.289 0.245 0.357 0.292
ChatGPT 0.484 0.821 0.476 0.480 0.607 0.471 0.521 0.607 0.508 0.524 0.714 0.521

Table 2: Spearman’s ρ of sample level, system level, and dataset level on Newsroom.

Metric Name Accuracy Table 6 shows that changing prompts result in a


ROUGE-1 0.5869 significant change in the performance of the human-
ROUGE-2_f 0.4997 like automatic evaluation using ChatGPT, espe-
ROUGE-L_f 0.5647 cially in terms of system-level correlations. From
BARTScore 0.5674 the current results, these changes do not make it to
MoverScore 0.5864 achieve higher correlations with human judgments,
BARTScore_s_h 0.5858 except for a modest improvement in a few cases by
BARTScore_h_r 0.6151 adding dimension definitions alone.
BARTScore_r_h 0.5317
BARTScore_cnn_s_h 0.5880 4.2 Comparison with human evaluation
BARTScore_cnn_h_r 0.5934
In terms of accuracy, there is still an overall gap
BARTScore_cnn_r_h 0.5089
between the current automatic human-like evalua-
ChatGPT 0.6178
tions using ChatGPT compared to human experts.
Table 3: Accuracy of pairwise comparison on TLDR. Table 6 illustrates that in most cases, the corre-
lation between scores given by a human expert
Metric Name Accuracy and the average of scores given by human experts
DAE 0.6304 is substantially better than ChatGPT at all levels.
FactCC 0.5362 However, the correlation between ChatGPT and
ChatGPT 0.6436 human evaluations (0.889) is already higher than
that of a particular human expert (0.843) in terms
Table 4: Accuracy of the binary determination of SCUs
of system-level correlation of fluency.
on REALSumm.
For variance and reproducibility, automatic
QAGS_CNN QAGS_XSUM human-like evaluations using ChatGPT are more
DAE 0.8459 0.6360 controllable. It is easy to know from Table 6 that
FactCC 0.7731 0.4937 the scores of the same samples will not be iden-
ChatGPT 0.8488 0.7573 tical between different human annotators. Belz
et al. (2021) pointed out that reproducing the man-
Table 5: Accuracy of binary factuality evaluation on ual evaluation was difficult. In contrast, we can
QAGS. make the human-like manual evaluation based on
consistency relevance fluency coherence
sample system dataset sample system dataset sample system dataset sample system dataset
ChatGPT 0.435 0.833 0.425 0.433 0.901 0.445 0.419 0.889 0.410 0.561 0.832 0.557
ChatGPT+def 0.471 0.786 0.479 0.453 0.877 0.479 0.347 0.606 0.341 0.568 0.802 0.570
ChatGPT+def+ins 0.338 -0.149 0.302 0.396 -0.079 0.433 0.349 0.016 0.325 0.501 0.338 0.494
ChatGPT+sys_prompt 0.414 0.007 0.376 0.334 0.268 0.365 0.390 0.149 0.362 0.473 0.552 0.470
Annotator_0 0.843 0.990 0.902 0.748 0.968 0.816 0.740 0.960 0.775 0.845 0.929 0.884
Annotator_1 0.813 0.965 0.881 0.767 0.953 0.823 0.847 0.843 0.876 0.889 0.982 0.913
Annotator_2 0.712 0.973 0.797 0.743 0.944 0.747 0.613 0.923 0.700 0.790 0.932 0.820

Table 6: Spearman’s ρ of sample level, system level, and dataset level on SummEval. Annotator_0, Annotator_1,
Annotator_2 refer to the three expert annotators. We compute the correlation coefficient between the score given
by a particular annotator and the average score of the three. "+def" means adding dimension definitions in the
prompt. "+ins" means adding step instructions in the prompt. Please see the example in Figure 5 for dimension
definitions and step instructions. "+sys_prompt" denotes setting system prompt.

Paul Merson was brought on with only seven minutes remaining in his team’s
0-0 draw with Burnley. Andros Townsend scored the Tottenham midfielder in
Summary the 89th minute. Paul Merson had another dig at Andros Townsend after his
appearance. The midfielder had been brought on to the England squad last
week. Click here for all the latest arsenal news."
Relevance: 1.67 (2, 1, 2)
Consistency (Faithfulness): 1.00 (1, 1, 1)
Expert annotator
Fluency: 3.00 (4, 2, 3)
Coherence: 1.33 (2, 1, 1)
Relevance: 2
Faithfulness: 2
Fluency: 2
Coherence: 2

Overall score: 2
ChatGPT
This summary is not very relevant to the article as it only mentions a few
details and leaves out important information. It is not very faithful as it
misrepresents the article by suggesting that Andros Townsend scored in the
Tottenham game, when in fact he was only brought on as a substitute. The
fluency and coherence are also poor, with awkward phrasing and lack of
logical flow.
Relevance: 2
Consistency: 1
Fluency: 2
Coherence: 1

Explanation:
ChatGPT+def
The summary is not relevant as it does not capture the key points of the article.
It is inconsistent with the facts in the original article, as it incorrectly states
that Andros Townsend scored in the 89th minute, when in fact he was brought
on in the 83rd minute. The fluency of the summary is poor, with grammatical
errors and incomplete sentences. The coherence of the summary is also poor,
as it does not flow well and does not make sense as a whole.

Table 7: Examples of ratings and explanations generated by ChatGPT. The red parts show the factual errors of
the summary. The blue parts show the explanations for fluency and coherence. The long source document is not
shown. “Expert Annotator” shows the average of scores given by three expert annotators, with each person’s score
in parentheses.
Imagine you are a human annotator now. You will evaluate the quality of summaries written for a news
article. Please follow these steps:

1. Carefully read the news article, and be aware of the information it contains.
2. Read the proposed summary.
3. Rate the summary on four dimensions: relevance, consistency, fluency, and coherence. You should rate
on a scale from 1 (worst) to 5 (best).

Definitions are as follows:


Relevance: The rating measures how well the summary captures the key points of the article. Consider
whether all and only the important aspects are contained in the summary.
Consistency: The rating measures whether the facts in the summary are consistent with the facts in the
original article. Consider whether the summary does reproduce all facts accurately and does not make up
untrue information.
Fluency: This rating measures the quality of individual sentences, whether they are well-written and
grammatically correct. Consider the quality of individual sentences.
Coherence: The rating measures the quality of all sentences collectively, to fit together and sound natural.
Consider the quality of the summary as a whole.

The article and the summary are given below:


Article: {Article}
Summary: {Summary}

Figure 5: The template for Likert scale scoring with step instructions (in red) and dimension definitions (in orange)
on SummEval.

ChatGPT reproducible by setting randomness pa- it does not generate an explanation, but the impact
rameters (e.g., temperature) at decoding time. of this on the evaluation scores is unknown.
In terms of cost, it is cheaper to perform the The explanations generated by ChatGPT are gen-
human-like automatic evaluation. Taking Sum- erally self-consistent but not necessarily correct.
mEval as an example, in our experiments, the as- The generated explanations generally coincide with
sessment of one summary consumed about 1000 its scoring. For example, Table 7 shows that Chat-
tokens, and it took about 0.002 × 1600 = 3.2 USD GPT and ChatGPT+def both scored low for the
6 to finish the evaluation on the whole dataset. As-
faithfulness of the summary, and they both pointed
suming that a single annotator spends 5 hours an- out factual errors in the summary. However, the
notating the whole dataset. It costs 12 × 5 = 60 correctness of these explanations still needs further
USD. It is estimated that the cost of human evalua- testing.
tion is about 10 to 20 times higher than human-like
automatic evaluation using ChatGPT. The combination of ChatGPT’s explanations and
scoring can better confirm whether it understands
4.3 The quality of generated explanations the requirements of the evaluation, for example,
the dimension definitions. Without providing di-
We sampled and examined the responses gener-
mension definitions (see Figure 5), ChatGPT’s un-
ated by ChatGPT on SummEval, and found the
derstanding of fluency and coherence converged.
following characteristics of the explanations given
After examining multiple samples we found that its
by ChatGPT:
explanations of the scoring of these two dimensions
ChatGPT sometimes provides scores or labels
are placed together and the dataset-level correla-
followed by an explanation, even if it is not explic-
tion between the scoring of these two dimensions
itly asked to provide the explanation in the prompt.
is 0.960. ChatGPT is better able to distinguish be-
Of course, it is possible to add a request such as
tween these two dimensions when dimension defi-
"You do not need to explain." to the prompt so that
nitions are provided. Its explanations of the scoring
6
https://openai.com/pricing of the two dimensions are separated and the dataset-
Description Example
I’m sorry, but the proposed summary is not acceptable. It contains
Refuse to evaluate inaccurate and irrelevant information, and it is not coherent or
fluent. Please provide a new summary for evaluation.
As an AI language model, I cannot evaluate the quality of
summaries as it requires subjective judgment. However, I can
Not evaluate as required
provide feedback on the summary’s coherence, fluency, and
relevance. The proposed summary is poorly written and [......]
As an AI language model, I cannot evaluate the quality of
summaries as it requires subjective human judgment. However,
I can provide a summary of the article that is more accurate and
Write a new summary
relevant:

Referee Mark Clattenburg has been [......]


land movement overnight. The mass, which has risen some 50ft
Continue to write the summary above sea level, measures roughly 1,000-1,640ft long, and 100ft
wide. [......]

Table 8: Examples of invalid responses generated by ChatGPT on SummEval.

level correlation between the two dimensions drops factual consistency in summarization.. Liu et al.
to 0.843. (2023) utilized ChatGPT and GPT-4 to assess the
quality of NLG outputs with chain-of-thoughts.
4.4 Invalid responses
ChatGPT sometimes generates invalid responses, 6 Conclusion
but this fraction is only about 1% at most (see
From the above experiments using ChatGPT for
Table 9). As shown in Table 8, common invalid re-
human-like summarization evaluation, the key find-
sponses were refusing to evaluate, not evaluating as
ings are as follows:
required, writing a new summary, and continuing
to write the summary. The reason why invalid re-
• ChatGPT has the ability to perform summa-
sponses are generated needs to be further explored.
rization evaluation using various human eval-
Invalid responses uation methods. In some instances, it attains a
ChatGPT 0.0000 higher correlation with human judgments than
ChatGPT+def 0.0003 existing evaluation metrics.
ChatGPT+def+ins 0.0106
• The performance of ChatGPT on summariza-
ChatGPT+sys_prompt 0.0013
tion evaluation is highly dependent on prompt
Table 9: Porportions of invalid responses generated by design.
ChatGPT on SummEval.
• Human-like evaluation with ChatGPT is more
cost-effective and reproducible than human
5 Related Work evaluation.
There are some concurrent studies using LLMs for • The explanation generated by ChatGPT is con-
human-like NLG evaluation. According to Kocmi sistent with its scoring.
and Federmann (2023), LLMs are currently the
most advanced evaluators of translation quality.
Wang et al. (2023) tested ChatGPT’s ability to be an
evaluator on three NLG meta-evaluation datasets. References
Ji et al. (2023) explored the effectiveness of Chat- Anya Belz, Anastasia Shimorina, Shubham Agarwal,
GPT in ranking model-generated content. Luo et al. and Ehud Reiter. 2021. The ReproGen shared task
(2023) investigated ChatGPT’s ability to evaluate on reproducibility of human evaluations in NLG:
Overview and results. In Proceedings of the 14th In- Proceedings of the 2020 Conference on Empirical
ternational Conference on Natural Language Gener- Methods in Natural Language Processing (EMNLP),
ation, pages 249–258, Aberdeen, Scotland, UK. As- pages 9332–9346, Online. Association for Computa-
sociation for Computational Linguistics. tional Linguistics.

Manik Bhandari, Pranav Narayan Gour, Atabak Ash- Chin-Yew Lin. 2004. ROUGE: A package for auto-
faq, Pengfei Liu, and Graham Neubig. 2020. Re- matic evaluation of summaries. In Text Summariza-
evaluating evaluation in text summarization. In tion Branches Out, pages 74–81, Barcelona, Spain.
Proceedings of the 2020 Conference on Empirical Association for Computational Linguistics.
Methods in Natural Language Processing (EMNLP),
Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang,
pages 9347–9359, Online. Association for Computa-
Ruochen Xu, and Chenguang Zhu. 2023. Gpteval:
tional Linguistics.
Nlg evaluation using gpt-4 with better human align-
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie ment.
Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Zheheng Luo, Qianqian Xie, and Sophia Ananiadou.
Neelakantan, Pranav Shyam, Girish Sastry, Amanda 2023. Chatgpt as a factual inconsistency evaluator
Askell, Sandhini Agarwal, Ariel Herbert-Voss, for abstractive text summarization. Computing Re-
Gretchen Krueger, Tom Henighan, Rewon Child, search Repository, arXiv:2303.15621.
Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu,
Clemens Winter, Christopher Hesse, Mark Chen, Ani Nenkova and Rebecca Passonneau. 2004. Evaluat-
Eric Sigler, Mateusz Litwin, Scott Gray, Ben- ing content selection in summarization: The pyra-
jamin Chess, Jack Clark, Christopher Berner, Sam mid method. In Proceedings of the Human Lan-
McCandlish, Alec Radford, Ilya Sutskever, and guage Technology Conference of the North Ameri-
Dario Amodei. 2020. Language models are few- can Chapter of the Association for Computational
shot learners. Computing Research Repository, Linguistics: HLT-NAACL 2004, pages 145–152,
arXiv:2005.14165. Version 4. Boston, Massachusetts, USA. Association for Com-
putational Linguistics.
Esin Durmus, He He, and Mona Diab. 2020. FEQA: A
question answering evaluation framework for faith- Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida,
fulness assessment in abstractive summarization. In Carroll L. Wainwright, Pamela Mishkin, Chong
Proceedings of the 58th Annual Meeting of the Asso- Zhang, Sandhini Agarwal, Katarina Slama, Alex
ciation for Computational Linguistics, pages 5055– Ray, John Schulman, Jacob Hilton, Fraser Kelton,
5070, Online. Association for Computational Lin- Luke Miller, Maddie Simens, Amanda Askell, Pe-
guistics. ter Welinder, Paul Christiano, Jan Leike, and Ryan
Lowe. 2022. Training language models to follow
Alexander R. Fabbri, Wojciech Kryściński, Bryan instructions with human feedback. Computing Re-
McCann, Caiming Xiong, Richard Socher, and search Repository, arXiv:2203.02155.
Dragomir Radev. 2021. SummEval: Re-evaluating
summarization evaluation. Transactions of the Asso- Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M.
ciation for Computational Linguistics, 9:391–409. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford,
Dario Amodei, and Paul Christiano. 2022. Learning
Max Grusky, Mor Naaman, and Yoav Artzi. 2018. to summarize from human feedback. Computing Re-
Newsroom: A dataset of 1.3 million summaries with search Repository, arXiv:2009.01325. Version 3.
diverse extractive strategies. In Proceedings of the
Alex Wang, Kyunghyun Cho, and Mike Lewis. 2020.
2018 Conference of the North American Chapter of
Asking and answering questions to evaluate the fac-
the Association for Computational Linguistics: Hu-
tual consistency of summaries. In Proceedings of
man Language Technologies, Volume 1 (Long Pa-
the 58th Annual Meeting of the Association for Com-
pers), pages 708–719, New Orleans, Louisiana. As-
putational Linguistics, pages 5008–5020, Online.
sociation for Computational Linguistics.
Association for Computational Linguistics.
Yunjie Ji, Yan Gong, Yiping Peng, Chao Ni, Peiyan Jiaan Wang, Yunlong Liang, Fandong Meng, Haoxi-
Sun, Dongyu Pan, Baochang Ma, and Xiangang ang Shi, Zhixu Li, Jinan Xu, Jianfeng Qu, and Jie
Li. 2023. Exploring chatgpt’s ability to rank con- Zhou. 2023. Is chatgpt a good nlg evaluator? a
tent: A preliminary study on consistency with hu- preliminary study. Computing Research Repository,
man preferences. Computing Research Repository, arXiv:2303.04048.
arXiv:2303.07610.
Weizhe Yuan, Graham Neubig, and Pengfei Liu.
Tom Kocmi and Christian Federmann. 2023. Large 2021. Bartscore: Evaluating generated text as
language models are state-of-the-art evaluators of text generation. Computing Research Repository,
translation quality. Computing Research Repository, arXiv:2106.11520. Version 2.
arXiv:2302.14520.
Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q.
Wojciech Kryscinski, Bryan McCann, Caiming Xiong, Weinberger, and Yoav Artzi. 2020. Bertscore: Eval-
and Richard Socher. 2020. Evaluating the factual uating text generation with bert. Computing Re-
consistency of abstractive text summarization. In search Repository, arXiv:1904.09675. Version 3.
Tianyi Zhang, Faisal Ladhak, Esin Durmus, Percy
Liang, Kathleen McKeown, and Tatsunori B.
Hashimoto. 2023. Benchmarking large language
models for news summarization. Computing Re-
search Repository, arXiv:2301.13848.
Wei Zhao, Maxime Peyrard, Fei Liu, Yang Gao, Chris-
tian M. Meyer, and Steffen Eger. 2019. MoverScore:
Text generation evaluating with contextualized em-
beddings and earth mover distance. In Proceedings
of the 2019 Conference on Empirical Methods in
Natural Language Processing and the 9th Interna-
tional Joint Conference on Natural Language Pro-
cessing (EMNLP-IJCNLP), pages 563–578, Hong
Kong, China. Association for Computational Lin-
guistics.

You might also like