0% found this document useful (0 votes)
38 views

Llm-Adapters: An Adapter Family For Parameter-Efficient Fine-Tuning of Large Language Models

Llm-adapters: An adapter family for parameter-efficient fine-tuning of large language models

Uploaded by

weihaopan1
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views

Llm-Adapters: An Adapter Family For Parameter-Efficient Fine-Tuning of Large Language Models

Llm-adapters: An adapter family for parameter-efficient fine-tuning of large language models

Uploaded by

weihaopan1
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

On the Limitations of Fine-tuned Judge Models for LLM Evaluation

Hui Huang1†, Yingqi Qu2 , Hongli Zhou3 , Jing Liu2 , Muyun Yang1‡
Bing Xu1 , Tiejun Zhao1
1
Faculty of Computing, Harbin Institute of Technology, Harbin, China
2
Baidu Inc., Beijing, China
3
School of Architecture and Design, Harbin Institute of Technology, Harbin, China
{huanghui, hongli.joe}@stu.hit.edu.cn, {quyingqi, liujing46}@baidu.com
{yangmuyun, hitxb, tjzhao}@hit.edu.cn

Abstract Instruction
instruction
instruction
Pairwise Selection
Which response is better?
Instruction: What is 1 plus 1.
Recently, there has been a growing trend of uti- Response 1: The result is 2.
lizing Large Language Model (LLM) to eval- Response 2: The result is 3.
Response 1 is better.
uate the quality of other LLMs. Many studies Vicuna Belle Alpaca
arXiv:2403.02839v2 [cs.CL] 17 Jun 2024

have employed proprietary close-source mod-


els, especially GPT-4, as the evaluator. Alterna- Response
response
response
tively, other works have fine-tuned judge mod- LLaMA Judge Model
els based on open-source LLMs as the evalu-
ator. While the fine-tuned judge models are Pointwise Grading
Score the response from 1 to 5.
claimed to achieve comparable evaluation ca- GPT4 Human Instruction: What is 1 plus 1.
pability with GPT-4, in this study, we conduct Response: The result is 2.
The score is 4.
an empirical study of judge models. Our find- Evaluation
evaluation
evaluation
ings indicate that although the fine-tuned judge
models achieve high performance on in-domain Figure 1: The general training and inference procedure
test sets, even surpassing GPT-4, they underper- of fine-tuned judge models. The training data generally
form GPT-4 across several dimensions, includ- comprises triplets with instruction, response and evalua-
ing generalizability, fairness, aspect-specific tion result, which are fed to the open-sourced foundation
evaluation, and scalability. We also reveal that model (such as LLaMA) to create a judge model.
the fine-tuned judge model inherently operates
as a task-specific classifier, consequently im-
posing the limitations. Finally, we propose an
effective indicator to measure the reliability of et al., 2023), to evaluate the LLM’s response. By
fine-tuned judges, with the aim of maximizing defining evaluation schemes in the prompt template,
their utility in LLM evaluation1 . LLMs can leverage their instruction-following abil-
ity to provide reliable evaluation, achieving a high
1 Introduction agreement rate with human evaluators.
Recently, the evaluation for Large-scale Language However, relying on external API for evaluation
Models (LLMs) has drawn considerate attention may introduce consideration about privacy leakage,
of research community (Liang et al., 2022; Chang and the opacity of API models also challenges the
et al., 2023). As the capabilities of LLMs continue evaluation reproducibility. To address these issues,
to develop across various tasks, it is essential to several fine-tuned judge models are proposed (Zhu
evaluate them from comprehensive perspectives et al., 2023b; Wang et al., 2024), relying on open-
(Qin et al., 2023). However, existing benchmarks, source foundation models and data constructed
such as MMLU (Hendrycks et al., 2021) and BIG- from either GPT-4 or human annotation, as shown
bench (bench authors, 2023), cannot fully show- in Figure 1. These models are validated on their
case the generative ability of LLMs. respective meta-evaluation benchmarks, where the
Some research has proposed LLM-as-a-Judge finetuned models exhibit performance on par with
(Li et al., 2023c; Zheng et al., 2023), namley utiliz- GPT-3.5 and GPT-4, leading to the affirmation of
ing proprietary LLMs, especially GPT-4 (Achiam their evaluation capability.

In this paper, we conduct an empirical study
Contribution during internship at Baidu Inc.
‡ of the evaluation capabilities of different judge
Corresponding Authors.
1
Codes are openly available at https://github.com/ models. Our study encompasses a comprehensive
HuihuiChyan/UnlimitedJudge comparison across various benchmarks and dimen-
sions. Experiment results indicate that while the not expressive enough to recognize the good re-
fine-tuned judge models achieve superior accuracy sponses. On the other hand, after the emergence
on their respective in-domain test sets, they still of BERT (Kenton and Toutanova, 2019), there are
exhibit limitations compared with close-sourced also several benchmarks designed to evaluation the
models in the following aspects: capabilities of language models, such as MMLU
(Hendrycks et al., 2021) and BIG-bench (bench
• The fine-tuned judge model is constrained by authors, 2023). But they are mostly formalized
specific evaluation scheme; as multi-choice selection, which cannot not fully
showcase the models’ generation capability.
• The fine-tuned judge model is biased towards
superficial quality; Recent research has introduced the concept of
LLM-as-a-Judge (Li et al., 2023c; Zheng et al.,
• The fine-tuned judge model is incapable of 2023), namely utilizing proprietary LLMs, espe-
aspect-specific evaluation; cially GPT4 (Achiam et al., 2023), to evaluate the
LLM’s response. For example, Li et al. (2023c)
• The fine-tuned judge model can not benefit
constructed a test set containing 805 questions and
from prompt engineering;
used the win rate compared with text-davinci-003
We think these limitations primarily stem from as the evaluation result, which is determined by
the fine-tuning process, where the foundation GPT-4. Zheng et al. (2023) developed 80 multi-
model is transformed into a task-specific classifier round test questions covering eight common areas,
overfitted to the fine-tuning data. and then automatically scored the model’s answers
Finally, we propose an indicator to quantify the using GPT-4. The GPT-4-based evaluator is proven
reliability of the fine-tuned judge when applied for to be high accurate compared with professional
LLM evaluation. Our indicator is based on confi- human evaluators, and presenting even better con-
dence estimation from softmax probability distribu- sistency and stability compared with human.
tion, and we apply calibration to better model the However, relying on external API for evaluation
confidence brought by the task-specific fine-tuning may introduce consideration about privacy leakage,
process. We validate our indicator for the judge and the opacity of API models also challenges the
model when applied to different meta-evaluation evaluation reproducibility. Therefore, follow-up
benchmarks, and managed to select the samples works suggest fine-tuning language models special-
lying inside the scope of fine-tuning, on which the ized in evaluations. For instance, PandaLM (Wang
performance of the judge is more reliable. et al., 2024) constructs data based on Alpaca in-
Our contributions can be summarized as follows: structions and GPT-3.5 annotation, and then fine-
tunes LLaMA-7B (Touvron et al., 2023) as a judge
• We reveal several limitations of fine-tuned model. JudgeLM (Zhu et al., 2023b) constructs
judge models for LLM evaluation by conduct- data from diversified instruction sets and GPT-4
ing a comprehensive empirical study; annotations, and fine-tunes Vicuna (Chiang et al.,
• We propose an indicator to quantify the relia- 2023) as a scalable judge model. Auto-J (Li et al.,
bility of the fine-tuned judge, which can help 2023a) constructs judgement data upon multiple
to select reliable evaluations; scenarios to train a generative judge model, which
can provide both judgement and critic. Prometheus
• To the best of our knowledge, this is the first (Kim et al., 2023) defines thousands of evaluation
systematic study of the limitations of fine- criteria and construct a feedback dataset based on
tuned judge models for LLM evaluation; GPT-4, and finetunes a fine-grained judge model.
Despite relying on models with 7B or 13B pa-
2 Related Work
rameters, these fine-tuned judge models all achieve
As Large Language Models (LLMs) continue to comparable accuracy with GPT-3.5 or GPT-4 on
excel across various tasks and domains, it is es- their respective testsets. However, the evaluation
sential to design an efficient and effective eval- is mostly conducted on the in-domain testsets con-
uation method. However, traditional evaluation structed similarly with the training data. A thor-
metrics for generative models, such as BLEU (Pa- ough examination to the evaluation capability of
pineni et al., 2002) and ROUGE (Lin, 2004), are the fine-tuned judges is of urgent demand.
Model Foundation Instruction Response Annotation Evaluation Scheme Testset
JudgeLM Vicuna Instruct Datasets 11 models GPT-4 Pairwise Grading GPT-4
(Zhu et al., 2023b) (Alpaca-GPT4, (Alpaca,Vicuna...)
Dolly-15K...)
PandaLM LLaMA Alpaca 52K 5 models GPT3.5 Pairwise Selection Human
(Wang et al., 2024) (LLaMA, Bloom...)
Auto-J LLaMA2-chat Preference Datasets Preference Datasets Human Pairwise Selection Human
(Li et al., 2023a) (Chatbot Arena, Pointwise Grading
OpenAI WebGPT...)
Prometheus LLaMA2-chat GPT-4 Generated GPT-4 Generated GPT-4 Pointwise Grading GPT-4
(Kim et al., 2023)

Table 1: Detailed statistics of the four fine-tuned judge models, which is the foundation of our empirical study. All
the four models are open-source, with their training and test data also publicly released.

3 How Far can Fine-tuned Judges Go? self-designed testsets, we reveal that there exists
several limitations about the evaluation capabilities
In this section, we make a comprehensive empirical of the fine-tuned judges.
study based on four representative fine-tuned judge
models as listed in Table 1. Section 3.1 offers a 3.2 Constrained by Evaluation Scheme
brief introduction of the construction of fine-tuned
One of the most appealing attribute of LLMs is
judges, and the subsequent sections explain their
their generalization ability, enabling them to exe-
limitations one by one.
cute various task defined by various instructions
(Zhu et al., 2023a). Under the case of LLM evalua-
3.1 Preliminary: Fine-tuned LLM Judge
tion, the instruction can also be formed in various
The typical process for finetuning a judge model schemes: pairwise selection, pointwise grading,
consists of the following three steps: etc. Since different judge models are fine-tuned
Step 1: Data Collection. The training data gen- on different schemes, we would like to verify their
erally comprises three components: instructions, evaluation capability under the scheme defined by
responses and evaluations. The instructions are others. Specifically, we apply their publicly re-
typically obtained from instruction datasets, with leased checkpoints, and cross-validate the judge
the responses generated by various representative models on each other’s testsets2 .
models, and the evaluations can be derived from As shown in Table 2 and 3, all four models per-
either GPT-4 or human annotation. form the best on their own training schemes, re-
spectively, with results comparable with GPT-4.
Step 2: Prompt Designing. The prompt template However, if we employ a model on an evaluation
can be structured in various ways depending on scheme where it is not trained on, the evaluation
the evaluation scheme, such as pairwise selection performance would drop by a large margin. For
(which aims to select the better one from a pair of example, using a pairwise model (such as Pan-
responses to the instruction), and pointwise grading daLM or JudgeLM) for pointwise grading (such
(which aims to assign a score to a single response as Prometheus testset), or using a pointwise model
based on the instruction), etc. (such as Prometheus) for pairwise selection (such
as PandaLM or JudgeLM testsets), would all lead
Step 3: Model Fine-tuning. Using the designed
to catastrophic performance degradation. On the
prompt and collected data, the training process
contrary, close-sourced models such as GPT-3.5 or
of the judge model typically follows the instruc-
GPT-4 consistently exhibits superior performance
tion fine-tuning paradigm (Ouyang et al., 2022).
across various evaluation schemes.
The model is fed with a instruction alongside re-
sponse(s) to generate output, which includes evalu- We also validate the judge models on MT-bench
ation results and possibly explanations. (Zheng et al., 2023), which is a multi-turn meta-
evaluation dataset2 . As shown in Table 4, while
After that, the fine-tuned judge can be adopted in 2
We make minimal modification to the predefined prompts
evaluating the output of LLM. While these models to adapt the model to different schemes. For detailed prompts
are able to achieve superior performance on their please refer to Appendix A.2.
JudgeLM-test PandaLM-test Auto-J-test
Model Average
accuracy F1 accuracy F1 agreement
JudgeLM-7B 78.98 68.62 68.17 64.87 46.6 64.58
PandaLM-7B 66.44 56.01 68.97 60.95 40.0 58.47
Auto-J-13B 77.19 60.42 72.27 64.27 54.6 68.02
Prometheus-13B 54.24 50.04 45.25 43.58 47.8 49.10
w/o trans 24.58 23.39 29.03 27.92 16.2 23.26
GPT-3.5-0613 72.57 51.40 64.36 46.40 42.7 59.88
GPT-4-1106 85.28 76.87 74.07 68.09 56.3 71.88

Table 2: Results of evaluators on pairwise selection. Notice Prometheus can be transformed for pairwise selection by
grading two answers twice and compare the scores, therefore we release both results with and without transformation.

Prometheus-test-ind Prometheus-test-ood
Model Average
pearson kendalltau spearman pearson kendalltau spearman
Prometheus-13B 0.864 0.788 0.863 0.869 0.789 0.869 0.867
JudgeLM-7B 0.649 0.647 0.739 0.610 0.602 0.690 0.630
w/o trans 0.398 0.371 0.416 0.384 0.371 0.419 0.391
PandaLM-7B 0.417 0.368 0.423 0.386 0.333 0.383 0.402
Auto-J-13B 0.614 0.526 0.608 0.591 0.504 0.580 0.603
GPT-3.5-0613 0.636 0.536 0.617 0.563 0.453 0.521 0.600
GPT-4-1106 0.742 0.659 0.747 0.743 0.660 0.747 0.743

Table 3: Results of evaluators on pointwise grading. Notice JudgeLM can be transformed for pointwise grading by
adding the reference as the first answer, therefore we release both results with and without transformation.

MTBench LLMBar
Model Model
accuracy precision recall F1 Natu. Neig. GPTI. GPTO. Manu.
JudgeLM-7B 48.7 52.0 49.7 48.7 JudgeLM-7B 62.0 23.1 26.1 46.8 28.3
PandaLM-7B 55.2 52.6 49.4 46.8 PandaLM-7B 59.0 16.5 21.7 42.6 26.1
Auto-J-13B 51.7 50.2 46.8 43.7 Auto-J-13B 70.0 20.9 21.7 46.8 23.9
Prometheus-13B 53.2 49.6 48.4 47.1 Prometheus-7B 53.0 22.4 17.4 27.7 32.6
GPT-4-1106 66.9 63.8 62.2 61.9 DeBERTa 62.0 26.9 42.4 55.3 34.8
GPT-4-1106 93.5 64.2 76.6 76.6 75.0
Table 4: Results of evaluators on multi-turn evaluation.
Table 5: Accuracy of evaluators on bias evaluation.
Neig., GPTI., GPTO., and Manu. are the four adversar-
the four models are all trained for single-turn eval- ial test sets designed to quantify the bias.
uation, they underperforms GPT-4 on MT-bench
by a large margin. This demonstrates that the fine-
judge models on LLMBar3 as shown in Table 5.
tuned judge models are overfitted to their respective
As can be seen, the fine-tuned judge models
evaluation schemes and lost their generalibility.
achieves a poor results on adversarial testsets, even
3.3 Biased Towards Superficial Quality much worse than random-guess. This notifies that
they are severely biased to superficial quality such
Recently, there has been a lot research on the bias of
as formality or verbosity, while neglecting crucial
LLM-based evaluators, namely the evaluator would
properties such as instruction following, resulting
favor more verbose answers, or answers with simi-
in the preference to the incorrect answers. On the
lar format (Wang et al., 2023b; Saito et al., 2023).
other hand, GPT-4 does not over-rely on the su-
To address this issue, Zeng et al. (2023) proposed
perficial features and achieves decent accuracy on
LLMBar as a testbed for the fairness of evaluators.
all the testsets. This notifies that the superior per-
It comprises one natural testset (Natural) and four
formance of fine-tuned judges on the in-domain
adversarial testsets (Neighbor, Manual, GPTOut,
testsets may rely on spurious statistical features
GPTInst), and the adversarial testsets consist of
(Niven and Kao, 2019), instead of really differenti-
paired outputs with a correct answer and an in-
ating good and bad responses.
correct answer with better superficial quality (e.g.,
3
more fluent, more verbose, etc.). We evaluate the The detailed prompts are present in Appendix A.2.
HaluEval-QA HaluEval-Sum HaluEval-Dial ToxicChat SALAD-Bench
Model
accuracy F1 accuracy F1 accuracy F1 accuracy F1 accuracy F1
JudgeLM-7B - - - - - - - - 82.45 57.44
PandaLM-B - - - - - - - - 57.03 37.23
Auto-J-13B 58.30 56.03 53.10 43.34 63.10 62.90 87.40 52.24 86.88 52.66
w/o adapt 59.60 57.38 53.47 43.55 64.50 63.71 87.70 51.15 71.77 47.86
Prometheus-7B 47.90 45.84 44.50 40.38 51.00 45.17 77.10 58.14 - -
w/o adapt 48.90 45.10 46.60 36.43 53.40 50.24 81.20 61.87 - -
GPT-3.5-0613 57.50 57.10 62.60 60.27 72.10 72.08 95.10 80.80 98.75 97.54
GPT-4-1106 72.50 72.50 72.00 71.44 84.50 84.78 94.50 82.78 100 100

Table 6: Results of evaluators on aspect-specific evaluation. w/o adapt denotes using the original prompt without
adaptation to the specific aspect. As HaluEval and ToxicChat are both binary classification, we apply Auto-J and
Prometheus with pointwise grading and conduct grid-search to determine the classification threshold. On the other
hand, as SALAD-Bench is pairwise classification, we apply pairwise selection models, namely JudgeLM, PandaLM
and Auto-J to select a better response.

We also fine-tune a DeBERTa-based judge 3. SALAD-Bench (Li et al., 2024): This dataset
model in a classification style on LLMBar (please focuses on safety evaluation. It contains in-
refer to Section 3.6 for details). It deserves notic- structions and responses spanning among dif-
ing that the DeBERTa-based evaluator also outper- ferent domains and tasks. Given an instruction
forms the LLM-based evaluator by a large margin and a pair of responses, the evaluator should
in terms of fairness. This inspires us that the bias decide which response is safer.
of LLM-based evaluator may come from the casual
We validate both close-sourced and fine-tuned
language modeling process. While the model is
judges on the three datasets4 . As can be seen from
trained to generate fluent and verbose responses,
Table 6, the fine-tuned judges fall far behind the
it also tends to prefer fluent and verbose response
close-sourced judges on all fine-grained aspects.
when employed for evaluation, even if it is not
It deserves to notice that while Prometheus is de-
aligned with the instruction.
signed for fine-grained evaluation, it obtains an
3.4 Incapable of Aspect-specific Evaluation inferior performance on both benchmarks, which
notifies that it failed to learn the correlation be-
LLM evaluation covers various aspects such as
tween fine-grained aspects and evaluation results.
usefulness, safety, and factuality, and sometimes
For the purpose of comparison, we also apply
we are particularly interested in a specific aspect.
Auto-J and Prometheus with their original prompt
While previous work primarily assess the evalu-
on aspect-specific evaluation. As can be seen in
ation capability of the judge models from a gen-
Table 6, to our surprise, their performance remains
eral perspective, we would like to assess them on
roughly the same compared with aspect-specific
fine-grained aspects. We select the following three
prompts, notifying that both models have lost the
datasets:
general instruction-understanding ability, therefore
1. HaluEval (Li et al., 2023b): This dataset fo- the aspect-specific prompt is not taking effect.
cuses on factuality evaluation. It contains gen- 3.5 No Benefits from Prompt Engineering
erated and human-annotated hallucinated sam-
ples, which lie in three domains: Question- One of the most appealing features of LLM is it
Answering, Summary and Dialogue. Given an can benefit from delicate prompt engineering. Vari-
instruction-response pair, the evaluator should ous strategies have been proposed to improve the
decide whether the response is hallucinated. LLM’s capability on various tasks, including text
evaluation. In this section, we select two represen-
2. ToxicChat (Lin et al., 2023): This dataset tative strategies, namely In-context Learning (ICL)
focuses on toxicity evaluation. It contains (Dong et al., 2023) and Chain-of-Thought Prompt-
toxic and non-toxic conversations based on ing (CoT) (Wei et al., 2022), to further improve the
real-world user-AI interactions. Given an evaluation capability of the judge models:
instruction-response pair, the evaluator should 4
We make minimal modification to the prompts to adapt
decide whether the response is toxic. them to the specific aspects, as detailed in Appendix A.2.
JudgeLM-test PandaLM-test lost their general instruction-following ability and
Model
accuracy F1 accuracy F1 are constrained to a singular output pattern.
JudgeLM-7B 78.98 68.62 68.17 64.87
+ CoT 77.68 67.59 68.03 64.42 3.6 The Essence of Fine-tuned Judge: A
+ ICL 68.57 58.52 41.14 40.39
Task-specific Classifier
PandaLM-7B 66.44 56.01 68.97 60.95
+ CoT 65.85 56.59 68.03 60.42 Combining all the limitations revealed in our exper-
+ ICL 66.16 55.94 68.97 59.40 iments, we would like to claim that after the fine-
Auto-J-13B 77.19 60.42 72.27 64.27 tuning process on a single task, the judge model has
+ ICL 76.20 59.12 68.37 58.44
degenerated into a task-specific classifier, which is
GPT-3.5-0613 72.57 51.40 64.36 46.40
+ CoT 75.24 60.71 69.97 63.66 overfitted to the training data. To support this, we
+ ICL 69.38 57.46 70.67 56.12 fine-tune three groups of judges based on the four
GPT-3.5-0125 70.67 50.44 64.46 46.60 groups of data as listed in Table 16 :
+ CoT 69.24 56.51 73.37 65.89
+ ICL 70.24 60.46 70.37 53.65 1. Vicuna-generation: It formulates the evalu-
GPT-4-1106 85.28 76.87 74.07 68.09 ation task in a generation-style, and the pre-
+ CoT - - 77.08 71.77 diction head reuses the pretrained language
+ ICL - - 64.86 56.20
model head, and is trained akin to the pro-
Table 7: Results of evaluators with ICL and CoT. In- cess of language modeling, based on the 7B
creased results are in bold while decreased results are in version of Vicuna (Chiang et al., 2023);
grey. We did not apply GPT-4 on JudgeLM-test as the
annotation of JudgeLM-test is conducted with GPT-4 2. Vicuna-classification: It formulates the eval-
without ICL and CoT. We only apply ICL on Auto-J as uation task as classification or regression, and
the original prompt of Auto-J comprises CoT. the prediction head is newly initialized as a
linear projection layer, and is decoupled from
1. In-context Learning (ICL): where task demon- the language modeling process;
strations are integrated into the prompt as il- 3. DeBERTa-classification: It also formulates
lustration. In our work, we randomly select as a classification task, based on DeBERTaV3-
2-4 ICL demonstrations from the training set large (He et al., 2023), which is 20 times
based on the max-context length. smaller than the 7B version of Vicuna;
2. Chain-of-thought (CoT): where the input
Notice for fine-tuning Vicuna-generation and
prompt is structured in a way that mimics hu-
Vicuna-classification models, we adopt the same
man reasoning. In our work, the judge model
prompt and hyper-parameters, with the only differ-
is forced to generate the explanation first, then
ence lying in the prediction method7 .
provide a final judgement.
As shown in Table 8, the classification model
We validate both close-sourced and fine-tuned performs equally well as the generation model on
judges with the two strategies5 . As shown in Table both pairwise selection and pointwise grading. The
7, the close-sourced models are improved by a large formidable generative capabilities of LLMs hardly
margin through both prompt engineering strategies. bring any improvement to the evaluation, as they
Conversely, the fine-tuned judges hardly benefit are fitting to the same group of data. Moreover,
from these strategies, sometimes even experiencing the DeBERTa-based classifier achieves compara-
severe performance decline. Specifically, in the ble performance with the LLM-based evaluators8 ,
case of CoT prompting, despite we modified the which might be argued for that the encoder-only
prompts for JudgeLM and PandaLM to generate architecture is more suitable for classification.
CoT firstly, both models adhered to their original We also analyze the correlation between differ-
output format and failed to produce CoT. While ent predictions made by different evaluators. As
there exist more intricate prompting strategies such shown in Figure 2 and 3, the correlation among
as ChatEval (Chan et al., 2024) or Branch-Solve- 6
Please refer to Appendix A.1 for training details.
Merge (Saha et al., 2023), we posit they can neither 7
An illustration figure is presented in Appendix A.1.
8
bring benefit to the fine-tuned judges, as they have The only exception is on Auto-J-test, which is possibly
due to a large proportion of the test data exceeds 512 (the
5
The detailed prompts are presented in Appendix A.2. maximum context length of DeBERTa).
Model JudgeLM-test PandaLM-test Auto-J-test Prometheus-test
accuracy F1 accuracy F1 agreement pearson-ind pearson-ood
Released Models† 78.98 68.62 68.97 60.95 54.6 0.864 0.869
Vicuna-generation‡ 82.44 71.77 72.37 60.78 47.6 0.826 0.815
Vicuna-classification‡ 82.16 70.07 70.87 60.34 46.8 0.846 0.831
DeBERTa-classification‡ 81.30 68.34 72.27 51.75 31.7 0.835 0.813
GPT-3.5-0613 72.57 51.40 64.36 46.40 42.7 0.636 0.563
GPT-4-1106-preview 85.28 76.87 74.07 68.09 56.3 0.742 0.743

Table 8: Comparison of generation and classification-based evaluators. Results with † are from evaluating the four
publicly released models on their respective testsets, and results with ‡ are from evaluating models trained by us.

Vicuna- Vicuna- DeBERTa- judges entirely. While LLM exhibit excellent per-
F1 score generation classification classification GPT4
formance among various tasks, task-specific mod-
Vicuna-
100 83.27 82.74 64.96
generation els are still wildly used everywhere in natural lan-
Vicuna-
83.27 100 84.51 64.29
guage processing. Therefore, it deserves our discus-
classification
sion about the reliability of the fine-tuned judges.
DeBERTa-
82.74 84.51 100 65.03
classification For this purpose, we propose a quantitative indica-
GPT4 64.96 64.29 65.03 100
tor to estimate whether an sample can be reliably
evaluated by a judge model.
Borrowing the idea from confidence estimation
Figure 2: The F1 score between the predictions of dif-
ferent evaluators on JudgeLM testset. (Huang et al., 2024), we propose to quantify the
reliability of the judge based on softmax entropy.
Vicuna- Vicuna- DeBERTa- Given an instruction x and a fine-tuned judge
pearson generation classification classification GPT4
model with parameters θ, the reliability of gen-
Vicuna-
generation
1.0 0.961 0.954 0.630 erating response y can be factorized as:
Vicuna-
0.961 1.0 0.977 0.627
classification
T X V
DeBERTa- 1X
classification 0.954 0.977 1.0 0.623
SoftEnt(y|x, θ) = − pθ (ytv )logpθ (ytv )
T t=1 v=1
GPT4 0.630 0.627 0.623 1.0

where p(yt ) represents the conditional distribution


Figure 3: The pearson coefficient between the predic- p(yt |x, y<t , θ) at each decoding step, T is the re-
tions of different evaluators on Prometheus testset. sponse length, and V is the vocabulary size.
As we would like to quantify whether the sam-
different classification models is much closer than ple lies in the task-specific fine-tuning scope, we
their correlation with GPT-4. Different as they are further calibrate the SoftEnt as follows:
in architectures, all three models are inherently clas-
sifiers fitting to the same set of supervision, leading
SE-Cali(y|x, θj ) = SE(y|x, θj ) − SE(y|x, θb )
to similar evaluation outcomes.
Although prior research on instruction-tuning all
where θj denotes the fine-tuned judge model, and
emphasizes the importance of data diversity (Zhou
θb denotes its corresponding foundation model. By
et al., 2023; Lu et al., 2024), the fine-tuning of LLM
calibration, we aim to exclude the influence of foun-
judges is doing the opposite thing. Therefore, after
dation model, thus modeling solely the confidence
fine-tuning for a single task with a fixed prompt
instilled by the task-specific fine-tuning process.
template, the model lost its generalization ability,
To verify the effectiveness of the reliability score,
and degenerate into a task-specific classifier, which
we re-conduct the cross-validation in Section 3,
exhibits several limitations due to overfitting.
wherein the test set was split into halves based on
4 When to Trust Fine-tuned Judges? different indicators. As shown in Table 9, our pro-
posed SoftEnt-Cali managed to select the samples
Despite the limitations revealed in our study, we with better reliability, therefore achieves a higher
aim not to disregard the significance of fine-tuned accuracy than random split. On the other hand,
SALAD-Bench JudgeLM-test PandLM-test Auto-J-test
Model Method Average
accuracy F1 accuracy F1 accuracy F1 agreement
random† 82.50 57.36 80.40 71.87 67.54 65.20 43.25 59.42
perplexity 88.09 59.57 85.48 74.38 72.34 63.25 49.57 61.69
JudgeLM-7B
SoftEnt 95.10 63.62 89.40 78.09 77.96 64.01 55.89 65.40
SoftEnt-Cali 91.56 61.47 88.70 76.30 79.76 67.05 56.32 65.29
random† 56.67 36.67 68.09 48.26 66.73 58.86 39.37 45.79
perplexity 59.37 55.03 70.15 52.62 68.53 57.73 42.53 51.98
PandaLM-7B
SoftEnt 63.13 62.07 72.96 53.67 75.55 62.53 43.53 55.45
SoftEnt-Cali 66.77 63.77 73.91 55.97 76.35 67.30 45.40 58.11
random† 72.29 48.10 77.27 60.7 71.54 63.51 46.12 54.61
perplexity 77.47 52.56 80.28 61.85 75.55 65.84 50.35 57.65
Auto-J-13B
SoftEnt 81.98 54.68 79.13 63.58 80.16 66.61 53.01 59.47
SoftEnt-Cali 82.08 54.75 80.19 63.58 81.36 70.00 52.30 60.16

Table 9: Comparison of different reliability indicators for the judge models. We split the test sets into halves based
on different indicators, and report the performance of the judge on the half with higher scores. The † on random-split
baseline denotes that the results are averaged over three times.

whereas those with lower scores performed akin to


random guessing. This underscores the efficacy of
SoftEnt-Cali as a reliability indicator of the judge
model’s performance for evaluation.
The reliability indicator serves as a useful tool
to compensate for the limitations and improve the
LLM evaluation. For example, it can be used to
decide whether an evaluation sample requires hu-
(a) JudgeLM-7B applied on SALAD-Bench
man intervention, or to select the most reliable one
when multiple judge models are available.

5 Conclusion

In this work, we conduct an empirical study on the


judge models for LLM evaluation. As revealed in
our experiments, despite achieving superior eval-
uation performance on the in-domain testset, the
fine-tuned judge models underperforms GPT-4 in
(b) Auto-J-13B applied on SALAD-Bench terms of several aspects by a large margin, which
Figure 4: Accuracy of judge models when applied to we believes originates from the task-specific fine-
different buckets of data grouped with reliability scores. tuning process. We also propose an indicator to
quantify the reliability of fine-tuned judges.
Although it is possible to incorporate more di-
split based solely on the SoftEnt or perplexity of verse fine-tuning data to amend the limitations, as
the judge model underperforms, as they did not the potential of LLM extends beyond boundaries,
exclude the influence of the foundation model. there would always be new domains and tasks not
We conducted an additional experiment wherein covered by the fine-tuning process. Therefore, to
samples were grouped into buckets based on their draw a conclusion, the fine-tuned judge model can-
SoftEnt-Cali scores, allowing us to assess the judge not serve as a general substitution for GPT-4 in
model across different buckets. As shown in Figure terms of LLM evaluation. It is advisable to ex-
4, the accuracy of judge models exhibit a strong ercise caution when leveraging fine-tuned judge
correlation with the SoftEnt-Cali score. Notably, models for evaluation in real applications, watch-
for both judges on SALAD-bench, buckets with ing for the overlap between the evaluation scenario
higher scores demonstrated an accuracy of 90%, and the fine-tuning process.
Limitations style pre-training with gradient-disentangled embed-
ding sharing. In The Eleventh International Confer-
Our work still has some limitations: 1) The relia- ence on Learning Representations.
bility score proposed in our paper can be combined
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou,
with human inspection or model ensemble, to fur- Mantas Mazeika, Dawn Song, and Jacob Steinhardt.
ther improve the LLM evaluation pipeline. For 2021. Measuring massive multitask language under-
example, when applying judge model for LLM standing. In International Conference on Learning
evaluation, we can select the least reliable sam- Representations.
ples to be evaluated by human evaluator, therefore Hui Huang, Yingqi Qu, Jing Liu, Muyun Yang, and
improving the evaluation accuracy with minimal Tiejun Zhao. 2024. Self-evaluation of large language
expense. Due to time and resource constraints , we model based on glass-box features.
have to leave this as future work. 2) The work of Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina
Zeng et al. (2023) is only a general assessment of Toutanova. 2019. Bert: Pre-training of deep bidirec-
evaluator bias, and we did not include fine-grained tional transformers for language understanding. In
Proceedings of NAACL-HLT, pages 4171–4186.
assessment for different biases, such as position
bias (Wang et al., 2023a), verbosity bias (Saito Seungone Kim, Jamin Shin, Yejin Cho, Joel Jang,
et al., 2023), etc. 3) Due to time constraints, we did Shayne Longpre, Hwaran Lee, Sangdoo Yun,
Seongjin Shin, Sungdong Kim, James Thorne, et al.
not incorporate manual inspection into the meta- 2023. Prometheus: Inducing fine-grained evalua-
evaluation process. Including human evaluators tion capability in language models. arXiv preprint
would enhance the credibility of our claims. arXiv:2310.08491.
Junlong Li, Shichao Sun, Weizhe Yuan, Run-Ze Fan,
Hai Zhao, and Pengfei Liu. 2023a. Generative
References judge for evaluating alignment. arXiv preprint
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama arXiv:2310.05470.
Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Junyi Li, Xiaoxue Cheng, Xin Zhao, Jian-Yun Nie, and
Diogo Almeida, Janko Altenschmidt, Sam Altman, Ji-Rong Wen. 2023b. HaluEval: A large-scale hal-
Shyamal Anadkat, et al. 2023. Gpt-4 technical report. lucination evaluation benchmark for large language
arXiv preprint arXiv:2303.08774. models. In Proceedings of the 2023 Conference on
BIG bench authors. 2023. Beyond the imitation game: Empirical Methods in Natural Language Processing,
Quantifying and extrapolating the capabilities of lan- pages 6449–6464, Singapore. Association for Com-
guage models. Transactions on Machine Learning putational Linguistics.
Research. Lijun Li, Bowen Dong, Ruohui Wang, Xuhao Hu, Wang-
meng Zuo, Dahua Lin, Yu Qiao, and Jing Shao.
Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu,
2024. Salad-bench: A hierarchical and comprehen-
Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu.
sive safety benchmark for large language models.
2024. Chateval: Towards better LLM-based eval-
uators through multi-agent debate. In The Twelfth Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori,
International Conference on Learning Representa- Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and
tions. Tatsunori B. Hashimoto. 2023c. Alpacaeval: An
automatic evaluator of instruction-following models.
Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, https://github.com/tatsu-lab/alpaca_eval.
Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi,
Cunxiang Wang, Yidong Wang, et al. 2023. A sur- Percy Liang, Rishi Bommasani, Tony Lee, Dimitris
vey on evaluation of large language models. ACM Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian
Transactions on Intelligent Systems and Technology. Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Ku-
mar, et al. 2022. Holistic evaluation of language
Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, models. arXiv preprint arXiv:2211.09110.
Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan
Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Chin-Yew Lin. 2004. Rouge: A package for automatic
Stoica, and Eric P. Xing. 2023. Vicuna: An open- evaluation of summaries. In Text summarization
source chatbot impressing gpt-4 with 90%* chatgpt branches out, pages 74–81.
quality.
Zi Lin, Zihan Wang, Yongqi Tong, Yangkun Wang,
Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiyong Yuxin Guo, Yujia Wang, and Jingbo Shang. 2023.
Wu, Baobao Chang, Xu Sun, Jingjing Xu, Lei Li, and ToxicChat: Unveiling hidden challenges of toxic-
Zhifang Sui. 2023. A survey on in-context learning. ity detection in real-world user-AI conversation. In
Findings of the Association for Computational Lin-
Pengcheng He, Jianfeng Gao, and Weizhu Chen. 2023. guistics: EMNLP 2023, pages 4694–4702, Singapore.
DeBERTav3: Improving deBERTa using ELECTRA- Association for Computational Linguistics.
Keming Lu, Hongyi Yuan, Zheng Yuan, Runji Lin, Jun- Yidong Wang, Zhuohao Yu, Zhengran Zeng, Linyi Yang,
yang Lin, Chuanqi Tan, Chang Zhou, and Jingren Cunxiang Wang, Hao Chen, Chaoya Jiang, Rui Xie,
Zhou. 2024. #instag: Instruction tagging for analyz- Jindong Wang, Xing Xie, Wei Ye, Shikun Zhang, and
ing supervised fine-tuning of large language models. Yue Zhang. 2024. Pandalm: An automatic evaluation
In The Twelfth International Conference on Learning benchmark for llm instruction tuning optimization.
Representations.
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
Timothy Niven and Hung-Yu Kao. 2019. Probing neu- Bosma, brian ichter, Fei Xia, Ed H. Chi, Quoc V Le,
ral network comprehension of natural language argu- and Denny Zhou. 2022. Chain of thought prompt-
ments. In Proceedings of the 57th Annual Meeting of ing elicits reasoning in large language models. In
the Association for Computational Linguistics, pages Advances in Neural Information Processing Systems.
4658–4664, Florence, Italy. Association for Compu-
tational Linguistics. Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
Chaumond, Clement Delangue, Anthony Moi, Pier-
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, ric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz,
Carroll Wainwright, Pamela Mishkin, Chong Zhang, Joe Davison, Sam Shleifer, Patrick von Platen, Clara
Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le
2022. Training language models to follow instruc- Scao, Sylvain Gugger, Mariama Drame, Quentin
tions with human feedback. Advances in Neural Lhoest, and Alexander M. Rush. 2020. Transform-
Information Processing Systems, 35:27730–27744. ers: State-of-the-art natural language processing. In
Proceedings of the 2020 Conference on Empirical
Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Methods in Natural Language Processing: System
Jing Zhu. 2002. Bleu: a method for automatic evalu- Demonstrations, pages 38–45, Online. Association
ation of machine translation. In Proceedings of the for Computational Linguistics.
40th annual meeting of the Association for Computa-
tional Linguistics, pages 311–318. Zhiyuan Zeng, Jiatong Yu, Tianyu Gao, Yu Meng, Tanya
Goyal, and Danqi Chen. 2023. Evaluating large
Chengwei Qin, Aston Zhang, Zhuosheng Zhang, Jiaao language models at evaluating instruction following.
Chen, Michihiro Yasunaga, and Diyi Yang. 2023. Is arXiv preprint arXiv:2310.07641.
chatgpt a general-purpose natural language process-
ing task solver? arXiv preprint arXiv:2302.06476. Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan
Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin,
Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Zhuohan Li, Dacheng Li, Eric Xing, et al. 2023.
Yuxiong He. 2020. Deepspeed: System optimiza- Judging llm-as-a-judge with mt-bench and chatbot
tions enable training deep learning models with over arena. arXiv preprint arXiv:2306.05685.
100 billion parameters. In Proceedings of the 26th
ACM SIGKDD International Conference on Knowl- Chunting Zhou, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao
edge Discovery & Data Mining, pages 3505–3506. Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu,
LILI YU, Susan Zhang, Gargi Ghosh, Mike Lewis,
Swarnadeep Saha, Omer Levy, Asli Celikyilmaz, Luke Zettlemoyer, and Omer Levy. 2023. LIMA:
Mohit Bansal, Jason Weston, and Xian Li. Less is more for alignment. In Thirty-seventh Con-
2023. Branch-solve-merge improves large language ference on Neural Information Processing Systems.
model evaluation and generation. arXiv preprint
arXiv:2310.15123. Kaijie Zhu, Qinlin Zhao, Hao Chen, Jindong Wang, and
Xing Xie. 2023a. Promptbench: A unified library for
Keita Saito, Akifumi Wachi, Koki Wataoka, and Youhei evaluation of large language models. arXiv preprint
Akimoto. 2023. Verbosity bias in preference la- arXiv:2312.07910.
beling by large language models. arXiv preprint
arXiv:2310.10076. Lianghui Zhu, Xinggang Wang, and Xinlong Wang.
2023b. Judgelm: Fine-tuned large language
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier models are scalable judges. arXiv preprint
Martinet, Marie-Anne Lachaux, Timothée Lacroix, arXiv:2310.17631.
Baptiste Rozière, Naman Goyal, Eric Hambro,
Faisal Azhar, et al. 2023. Llama: Open and effi-
cient foundation language models. arXiv preprint
arXiv:2302.13971.
Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu,
Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and
Zhifang Sui. 2023a. Large language models are not
fair evaluators. arXiv preprint arXiv:2305.17926.
Peiyi Wang, Lei Li, Liang Chen, Dawei Zhu, Binghuai
Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhifang Sui.
2023b. Large language models are not fair evaluators.
ArXiv, abs/2305.17926.
A Appendix Configuration Vicuna DeBERTa
max length 2048 512
A.1 Training Settings learning rate 2e-5 2e-5
As mentioned in Section 3, we fine-tune our own scheduler cosine decay cosine decay
judge models based on the four groups of data optimizer AdamW AdamW
(JudgeLM (Zhu et al., 2023b), PandaLM (Wang AdamW beta1 0.9 0.9
et al., 2024), Auto-J (Li et al., 2023a), Prometheus AdamW beta2 0.999 0.98
weight decay 0.0 0.0
(Kim et al., 2023)), both in generation-style and in
training epochs 3 3
classification-style, for the purpose of comparison.
batch size 128 128
We train all the models on NVIDIA A100-80GB warmup ratio 0.003 0.003
GPUs with Huggingface-transformers (Wolf et al., numerical precision bf16 fp16
2020) and DeepSpeed (Rasley et al., 2020). De- ZeRO optimizer stage 2 None
tailed hyper-parameters are presented in Table 10.
Notice when comparing generation and classifica- Table 10: Configurations of the fine-tuned judge models.
tion models, we adopt the same prompt template Both classification and generation models leverage the
and same hyper-parameters, with the only differ- same group of configs based on their foundation model.
ence lies in the prediction method, as illstrated in
Figure 5. For generation model, the prediction head is pair-wise selection.
reused the pretrained language model head, and is
trained akin to the process of language modeling. 3. For Section 3.4, we adopt the prompts pre-
For classification (regression) model, the predic- sented in Figure 18 to 21 for JudgeLM,
tion head is newly initialized as a linear projection PandaLM and Auto-J, respectively. For
layer, and is decoupled from the language modeling Prometheus, as its original prompt comprises
process9 . of scoring rubrics, we simply define the corre-
sponding rubrics for different benchmarks.
A.2 Prompt Templates
As mentioned in Section 3, we take the publicly 4. For Section 3.5, we adopt the prompts pre-
released checkpoints of the four fine-tuned judge sented in Figure 22 and Figure 23 for chain-
models and validate their performance. To make a of-thought prompting.
fair comparison, we make minimal modifications
to their pre-defined prompts, to adapt them to dif-
ferent scenarios. The specific prompts designed for
different sections are listed as follows:
1. For Section 3.2, we adopt the prompts pre-
sented in Figure 6 to 13 for cross validation.
Notice for JudgeLM and PandaLM, their pre-
defined prompts are in the form of pairwise
selection, and we make slight modifications
to apply them on pointwise grading. For
Prometheus, the predefined prompt is in the
form of pointwise grading, and we make slight
modifications to apply it on pairwise selection.
For Auto-J, they predefined prompts both for
pairwise selection and pointwise grading. We
also adopt the prompts presented from Figure
14 to 17 on MT-Bench, which are all adapted
to multi-turn evaluation.
2. For Section 3.3, we adopt the prompts pre-
sented in Figure 6, 8, 10 and 12, as LLMBar
9
Please refer to the class AutoModelForSequence
Classification in Huggingface library for more details.
Figure 5: The architecture of classification-based judge model. The major difference lies in the prediction head,
where a new classification (regression) head is initialized for predicting the result.

Figure 6: Prompt template for JudgeLM applied for pairwise selection.

Figure 7: Prompt template for JudgeLM applied for pointwise grading.


Figure 8: Prompt template for PandaLM applied for pairwise selection.

Figure 9: Prompt template for PandaLM applied for pointwise grading.

Figure 10: Prompt template for Auto-J applied for pairwise selection.

Figure 11: Prompt template for Auto-J applied for pointwise grading.
Figure 12: Prompt template for Prometheus applied for pairwise selection.

Figure 13: Prompt template for Prometheus applied for pointwise grading.
Figure 14: Prompt template for JudgeLM applied for multi-turn grading.

Figure 15: Prompt template for PandaLM applied for multi-turn grading.
Figure 16: Prompt template for Auto-J applied for multi-turn grading.

Figure 17: Prompt template for Prometheus applied for multi-turn grading.
Figure 18: Prompt template for JudgeLM applied on SALAD-Bench.

Figure 19: Prompt template for Auto-J applied on HaluEval.

Figure 20: Prompt template for Auto-J applied on ToxicChat.


Figure 21: Prompt template for Auto-J applied on SALAD-Bench.

Figure 22: Prompt template for JudgeLM applied with chain-of-thought prompting.
Figure 23: Prompt template for PandaLM applied with chain-of-thought prompting.

You might also like