Llm-Adapters: An Adapter Family For Parameter-Efficient Fine-Tuning of Large Language Models
Llm-Adapters: An Adapter Family For Parameter-Efficient Fine-Tuning of Large Language Models
Hui Huang1†, Yingqi Qu2 , Hongli Zhou3 , Jing Liu2 , Muyun Yang1‡
Bing Xu1 , Tiejun Zhao1
1
Faculty of Computing, Harbin Institute of Technology, Harbin, China
2
Baidu Inc., Beijing, China
3
School of Architecture and Design, Harbin Institute of Technology, Harbin, China
{huanghui, hongli.joe}@stu.hit.edu.cn, {quyingqi, liujing46}@baidu.com
{yangmuyun, hitxb, tjzhao}@hit.edu.cn
Abstract Instruction
instruction
instruction
Pairwise Selection
Which response is better?
Instruction: What is 1 plus 1.
Recently, there has been a growing trend of uti- Response 1: The result is 2.
lizing Large Language Model (LLM) to eval- Response 2: The result is 3.
Response 1 is better.
uate the quality of other LLMs. Many studies Vicuna Belle Alpaca
arXiv:2403.02839v2 [cs.CL] 17 Jun 2024
Table 1: Detailed statistics of the four fine-tuned judge models, which is the foundation of our empirical study. All
the four models are open-source, with their training and test data also publicly released.
3 How Far can Fine-tuned Judges Go? self-designed testsets, we reveal that there exists
several limitations about the evaluation capabilities
In this section, we make a comprehensive empirical of the fine-tuned judges.
study based on four representative fine-tuned judge
models as listed in Table 1. Section 3.1 offers a 3.2 Constrained by Evaluation Scheme
brief introduction of the construction of fine-tuned
One of the most appealing attribute of LLMs is
judges, and the subsequent sections explain their
their generalization ability, enabling them to exe-
limitations one by one.
cute various task defined by various instructions
(Zhu et al., 2023a). Under the case of LLM evalua-
3.1 Preliminary: Fine-tuned LLM Judge
tion, the instruction can also be formed in various
The typical process for finetuning a judge model schemes: pairwise selection, pointwise grading,
consists of the following three steps: etc. Since different judge models are fine-tuned
Step 1: Data Collection. The training data gen- on different schemes, we would like to verify their
erally comprises three components: instructions, evaluation capability under the scheme defined by
responses and evaluations. The instructions are others. Specifically, we apply their publicly re-
typically obtained from instruction datasets, with leased checkpoints, and cross-validate the judge
the responses generated by various representative models on each other’s testsets2 .
models, and the evaluations can be derived from As shown in Table 2 and 3, all four models per-
either GPT-4 or human annotation. form the best on their own training schemes, re-
spectively, with results comparable with GPT-4.
Step 2: Prompt Designing. The prompt template However, if we employ a model on an evaluation
can be structured in various ways depending on scheme where it is not trained on, the evaluation
the evaluation scheme, such as pairwise selection performance would drop by a large margin. For
(which aims to select the better one from a pair of example, using a pairwise model (such as Pan-
responses to the instruction), and pointwise grading daLM or JudgeLM) for pointwise grading (such
(which aims to assign a score to a single response as Prometheus testset), or using a pointwise model
based on the instruction), etc. (such as Prometheus) for pairwise selection (such
as PandaLM or JudgeLM testsets), would all lead
Step 3: Model Fine-tuning. Using the designed
to catastrophic performance degradation. On the
prompt and collected data, the training process
contrary, close-sourced models such as GPT-3.5 or
of the judge model typically follows the instruc-
GPT-4 consistently exhibits superior performance
tion fine-tuning paradigm (Ouyang et al., 2022).
across various evaluation schemes.
The model is fed with a instruction alongside re-
sponse(s) to generate output, which includes evalu- We also validate the judge models on MT-bench
ation results and possibly explanations. (Zheng et al., 2023), which is a multi-turn meta-
evaluation dataset2 . As shown in Table 4, while
After that, the fine-tuned judge can be adopted in 2
We make minimal modification to the predefined prompts
evaluating the output of LLM. While these models to adapt the model to different schemes. For detailed prompts
are able to achieve superior performance on their please refer to Appendix A.2.
JudgeLM-test PandaLM-test Auto-J-test
Model Average
accuracy F1 accuracy F1 agreement
JudgeLM-7B 78.98 68.62 68.17 64.87 46.6 64.58
PandaLM-7B 66.44 56.01 68.97 60.95 40.0 58.47
Auto-J-13B 77.19 60.42 72.27 64.27 54.6 68.02
Prometheus-13B 54.24 50.04 45.25 43.58 47.8 49.10
w/o trans 24.58 23.39 29.03 27.92 16.2 23.26
GPT-3.5-0613 72.57 51.40 64.36 46.40 42.7 59.88
GPT-4-1106 85.28 76.87 74.07 68.09 56.3 71.88
Table 2: Results of evaluators on pairwise selection. Notice Prometheus can be transformed for pairwise selection by
grading two answers twice and compare the scores, therefore we release both results with and without transformation.
Prometheus-test-ind Prometheus-test-ood
Model Average
pearson kendalltau spearman pearson kendalltau spearman
Prometheus-13B 0.864 0.788 0.863 0.869 0.789 0.869 0.867
JudgeLM-7B 0.649 0.647 0.739 0.610 0.602 0.690 0.630
w/o trans 0.398 0.371 0.416 0.384 0.371 0.419 0.391
PandaLM-7B 0.417 0.368 0.423 0.386 0.333 0.383 0.402
Auto-J-13B 0.614 0.526 0.608 0.591 0.504 0.580 0.603
GPT-3.5-0613 0.636 0.536 0.617 0.563 0.453 0.521 0.600
GPT-4-1106 0.742 0.659 0.747 0.743 0.660 0.747 0.743
Table 3: Results of evaluators on pointwise grading. Notice JudgeLM can be transformed for pointwise grading by
adding the reference as the first answer, therefore we release both results with and without transformation.
MTBench LLMBar
Model Model
accuracy precision recall F1 Natu. Neig. GPTI. GPTO. Manu.
JudgeLM-7B 48.7 52.0 49.7 48.7 JudgeLM-7B 62.0 23.1 26.1 46.8 28.3
PandaLM-7B 55.2 52.6 49.4 46.8 PandaLM-7B 59.0 16.5 21.7 42.6 26.1
Auto-J-13B 51.7 50.2 46.8 43.7 Auto-J-13B 70.0 20.9 21.7 46.8 23.9
Prometheus-13B 53.2 49.6 48.4 47.1 Prometheus-7B 53.0 22.4 17.4 27.7 32.6
GPT-4-1106 66.9 63.8 62.2 61.9 DeBERTa 62.0 26.9 42.4 55.3 34.8
GPT-4-1106 93.5 64.2 76.6 76.6 75.0
Table 4: Results of evaluators on multi-turn evaluation.
Table 5: Accuracy of evaluators on bias evaluation.
Neig., GPTI., GPTO., and Manu. are the four adversar-
the four models are all trained for single-turn eval- ial test sets designed to quantify the bias.
uation, they underperforms GPT-4 on MT-bench
by a large margin. This demonstrates that the fine-
judge models on LLMBar3 as shown in Table 5.
tuned judge models are overfitted to their respective
As can be seen, the fine-tuned judge models
evaluation schemes and lost their generalibility.
achieves a poor results on adversarial testsets, even
3.3 Biased Towards Superficial Quality much worse than random-guess. This notifies that
they are severely biased to superficial quality such
Recently, there has been a lot research on the bias of
as formality or verbosity, while neglecting crucial
LLM-based evaluators, namely the evaluator would
properties such as instruction following, resulting
favor more verbose answers, or answers with simi-
in the preference to the incorrect answers. On the
lar format (Wang et al., 2023b; Saito et al., 2023).
other hand, GPT-4 does not over-rely on the su-
To address this issue, Zeng et al. (2023) proposed
perficial features and achieves decent accuracy on
LLMBar as a testbed for the fairness of evaluators.
all the testsets. This notifies that the superior per-
It comprises one natural testset (Natural) and four
formance of fine-tuned judges on the in-domain
adversarial testsets (Neighbor, Manual, GPTOut,
testsets may rely on spurious statistical features
GPTInst), and the adversarial testsets consist of
(Niven and Kao, 2019), instead of really differenti-
paired outputs with a correct answer and an in-
ating good and bad responses.
correct answer with better superficial quality (e.g.,
3
more fluent, more verbose, etc.). We evaluate the The detailed prompts are present in Appendix A.2.
HaluEval-QA HaluEval-Sum HaluEval-Dial ToxicChat SALAD-Bench
Model
accuracy F1 accuracy F1 accuracy F1 accuracy F1 accuracy F1
JudgeLM-7B - - - - - - - - 82.45 57.44
PandaLM-B - - - - - - - - 57.03 37.23
Auto-J-13B 58.30 56.03 53.10 43.34 63.10 62.90 87.40 52.24 86.88 52.66
w/o adapt 59.60 57.38 53.47 43.55 64.50 63.71 87.70 51.15 71.77 47.86
Prometheus-7B 47.90 45.84 44.50 40.38 51.00 45.17 77.10 58.14 - -
w/o adapt 48.90 45.10 46.60 36.43 53.40 50.24 81.20 61.87 - -
GPT-3.5-0613 57.50 57.10 62.60 60.27 72.10 72.08 95.10 80.80 98.75 97.54
GPT-4-1106 72.50 72.50 72.00 71.44 84.50 84.78 94.50 82.78 100 100
Table 6: Results of evaluators on aspect-specific evaluation. w/o adapt denotes using the original prompt without
adaptation to the specific aspect. As HaluEval and ToxicChat are both binary classification, we apply Auto-J and
Prometheus with pointwise grading and conduct grid-search to determine the classification threshold. On the other
hand, as SALAD-Bench is pairwise classification, we apply pairwise selection models, namely JudgeLM, PandaLM
and Auto-J to select a better response.
We also fine-tune a DeBERTa-based judge 3. SALAD-Bench (Li et al., 2024): This dataset
model in a classification style on LLMBar (please focuses on safety evaluation. It contains in-
refer to Section 3.6 for details). It deserves notic- structions and responses spanning among dif-
ing that the DeBERTa-based evaluator also outper- ferent domains and tasks. Given an instruction
forms the LLM-based evaluator by a large margin and a pair of responses, the evaluator should
in terms of fairness. This inspires us that the bias decide which response is safer.
of LLM-based evaluator may come from the casual
We validate both close-sourced and fine-tuned
language modeling process. While the model is
judges on the three datasets4 . As can be seen from
trained to generate fluent and verbose responses,
Table 6, the fine-tuned judges fall far behind the
it also tends to prefer fluent and verbose response
close-sourced judges on all fine-grained aspects.
when employed for evaluation, even if it is not
It deserves to notice that while Prometheus is de-
aligned with the instruction.
signed for fine-grained evaluation, it obtains an
3.4 Incapable of Aspect-specific Evaluation inferior performance on both benchmarks, which
notifies that it failed to learn the correlation be-
LLM evaluation covers various aspects such as
tween fine-grained aspects and evaluation results.
usefulness, safety, and factuality, and sometimes
For the purpose of comparison, we also apply
we are particularly interested in a specific aspect.
Auto-J and Prometheus with their original prompt
While previous work primarily assess the evalu-
on aspect-specific evaluation. As can be seen in
ation capability of the judge models from a gen-
Table 6, to our surprise, their performance remains
eral perspective, we would like to assess them on
roughly the same compared with aspect-specific
fine-grained aspects. We select the following three
prompts, notifying that both models have lost the
datasets:
general instruction-understanding ability, therefore
1. HaluEval (Li et al., 2023b): This dataset fo- the aspect-specific prompt is not taking effect.
cuses on factuality evaluation. It contains gen- 3.5 No Benefits from Prompt Engineering
erated and human-annotated hallucinated sam-
ples, which lie in three domains: Question- One of the most appealing features of LLM is it
Answering, Summary and Dialogue. Given an can benefit from delicate prompt engineering. Vari-
instruction-response pair, the evaluator should ous strategies have been proposed to improve the
decide whether the response is hallucinated. LLM’s capability on various tasks, including text
evaluation. In this section, we select two represen-
2. ToxicChat (Lin et al., 2023): This dataset tative strategies, namely In-context Learning (ICL)
focuses on toxicity evaluation. It contains (Dong et al., 2023) and Chain-of-Thought Prompt-
toxic and non-toxic conversations based on ing (CoT) (Wei et al., 2022), to further improve the
real-world user-AI interactions. Given an evaluation capability of the judge models:
instruction-response pair, the evaluator should 4
We make minimal modification to the prompts to adapt
decide whether the response is toxic. them to the specific aspects, as detailed in Appendix A.2.
JudgeLM-test PandaLM-test lost their general instruction-following ability and
Model
accuracy F1 accuracy F1 are constrained to a singular output pattern.
JudgeLM-7B 78.98 68.62 68.17 64.87
+ CoT 77.68 67.59 68.03 64.42 3.6 The Essence of Fine-tuned Judge: A
+ ICL 68.57 58.52 41.14 40.39
Task-specific Classifier
PandaLM-7B 66.44 56.01 68.97 60.95
+ CoT 65.85 56.59 68.03 60.42 Combining all the limitations revealed in our exper-
+ ICL 66.16 55.94 68.97 59.40 iments, we would like to claim that after the fine-
Auto-J-13B 77.19 60.42 72.27 64.27 tuning process on a single task, the judge model has
+ ICL 76.20 59.12 68.37 58.44
degenerated into a task-specific classifier, which is
GPT-3.5-0613 72.57 51.40 64.36 46.40
+ CoT 75.24 60.71 69.97 63.66 overfitted to the training data. To support this, we
+ ICL 69.38 57.46 70.67 56.12 fine-tune three groups of judges based on the four
GPT-3.5-0125 70.67 50.44 64.46 46.60 groups of data as listed in Table 16 :
+ CoT 69.24 56.51 73.37 65.89
+ ICL 70.24 60.46 70.37 53.65 1. Vicuna-generation: It formulates the evalu-
GPT-4-1106 85.28 76.87 74.07 68.09 ation task in a generation-style, and the pre-
+ CoT - - 77.08 71.77 diction head reuses the pretrained language
+ ICL - - 64.86 56.20
model head, and is trained akin to the pro-
Table 7: Results of evaluators with ICL and CoT. In- cess of language modeling, based on the 7B
creased results are in bold while decreased results are in version of Vicuna (Chiang et al., 2023);
grey. We did not apply GPT-4 on JudgeLM-test as the
annotation of JudgeLM-test is conducted with GPT-4 2. Vicuna-classification: It formulates the eval-
without ICL and CoT. We only apply ICL on Auto-J as uation task as classification or regression, and
the original prompt of Auto-J comprises CoT. the prediction head is newly initialized as a
linear projection layer, and is decoupled from
1. In-context Learning (ICL): where task demon- the language modeling process;
strations are integrated into the prompt as il- 3. DeBERTa-classification: It also formulates
lustration. In our work, we randomly select as a classification task, based on DeBERTaV3-
2-4 ICL demonstrations from the training set large (He et al., 2023), which is 20 times
based on the max-context length. smaller than the 7B version of Vicuna;
2. Chain-of-thought (CoT): where the input
Notice for fine-tuning Vicuna-generation and
prompt is structured in a way that mimics hu-
Vicuna-classification models, we adopt the same
man reasoning. In our work, the judge model
prompt and hyper-parameters, with the only differ-
is forced to generate the explanation first, then
ence lying in the prediction method7 .
provide a final judgement.
As shown in Table 8, the classification model
We validate both close-sourced and fine-tuned performs equally well as the generation model on
judges with the two strategies5 . As shown in Table both pairwise selection and pointwise grading. The
7, the close-sourced models are improved by a large formidable generative capabilities of LLMs hardly
margin through both prompt engineering strategies. bring any improvement to the evaluation, as they
Conversely, the fine-tuned judges hardly benefit are fitting to the same group of data. Moreover,
from these strategies, sometimes even experiencing the DeBERTa-based classifier achieves compara-
severe performance decline. Specifically, in the ble performance with the LLM-based evaluators8 ,
case of CoT prompting, despite we modified the which might be argued for that the encoder-only
prompts for JudgeLM and PandaLM to generate architecture is more suitable for classification.
CoT firstly, both models adhered to their original We also analyze the correlation between differ-
output format and failed to produce CoT. While ent predictions made by different evaluators. As
there exist more intricate prompting strategies such shown in Figure 2 and 3, the correlation among
as ChatEval (Chan et al., 2024) or Branch-Solve- 6
Please refer to Appendix A.1 for training details.
Merge (Saha et al., 2023), we posit they can neither 7
An illustration figure is presented in Appendix A.1.
8
bring benefit to the fine-tuned judges, as they have The only exception is on Auto-J-test, which is possibly
due to a large proportion of the test data exceeds 512 (the
5
The detailed prompts are presented in Appendix A.2. maximum context length of DeBERTa).
Model JudgeLM-test PandaLM-test Auto-J-test Prometheus-test
accuracy F1 accuracy F1 agreement pearson-ind pearson-ood
Released Models† 78.98 68.62 68.97 60.95 54.6 0.864 0.869
Vicuna-generation‡ 82.44 71.77 72.37 60.78 47.6 0.826 0.815
Vicuna-classification‡ 82.16 70.07 70.87 60.34 46.8 0.846 0.831
DeBERTa-classification‡ 81.30 68.34 72.27 51.75 31.7 0.835 0.813
GPT-3.5-0613 72.57 51.40 64.36 46.40 42.7 0.636 0.563
GPT-4-1106-preview 85.28 76.87 74.07 68.09 56.3 0.742 0.743
Table 8: Comparison of generation and classification-based evaluators. Results with † are from evaluating the four
publicly released models on their respective testsets, and results with ‡ are from evaluating models trained by us.
Vicuna- Vicuna- DeBERTa- judges entirely. While LLM exhibit excellent per-
F1 score generation classification classification GPT4
formance among various tasks, task-specific mod-
Vicuna-
100 83.27 82.74 64.96
generation els are still wildly used everywhere in natural lan-
Vicuna-
83.27 100 84.51 64.29
guage processing. Therefore, it deserves our discus-
classification
sion about the reliability of the fine-tuned judges.
DeBERTa-
82.74 84.51 100 65.03
classification For this purpose, we propose a quantitative indica-
GPT4 64.96 64.29 65.03 100
tor to estimate whether an sample can be reliably
evaluated by a judge model.
Borrowing the idea from confidence estimation
Figure 2: The F1 score between the predictions of dif-
ferent evaluators on JudgeLM testset. (Huang et al., 2024), we propose to quantify the
reliability of the judge based on softmax entropy.
Vicuna- Vicuna- DeBERTa- Given an instruction x and a fine-tuned judge
pearson generation classification classification GPT4
model with parameters θ, the reliability of gen-
Vicuna-
generation
1.0 0.961 0.954 0.630 erating response y can be factorized as:
Vicuna-
0.961 1.0 0.977 0.627
classification
T X V
DeBERTa- 1X
classification 0.954 0.977 1.0 0.623
SoftEnt(y|x, θ) = − pθ (ytv )logpθ (ytv )
T t=1 v=1
GPT4 0.630 0.627 0.623 1.0
Table 9: Comparison of different reliability indicators for the judge models. We split the test sets into halves based
on different indicators, and report the performance of the judge on the half with higher scores. The † on random-split
baseline denotes that the results are averaged over three times.
5 Conclusion
Figure 10: Prompt template for Auto-J applied for pairwise selection.
Figure 11: Prompt template for Auto-J applied for pointwise grading.
Figure 12: Prompt template for Prometheus applied for pairwise selection.
Figure 13: Prompt template for Prometheus applied for pointwise grading.
Figure 14: Prompt template for JudgeLM applied for multi-turn grading.
Figure 15: Prompt template for PandaLM applied for multi-turn grading.
Figure 16: Prompt template for Auto-J applied for multi-turn grading.
Figure 17: Prompt template for Prometheus applied for multi-turn grading.
Figure 18: Prompt template for JudgeLM applied on SALAD-Bench.
Figure 22: Prompt template for JudgeLM applied with chain-of-thought prompting.
Figure 23: Prompt template for PandaLM applied with chain-of-thought prompting.