0% found this document useful (0 votes)
43 views

Routing To The Expert Efficient Reward-Guided Ensemble of Large Language Models

This paper proposes a method called ZOOTER for efficiently routing queries to the most expert large language model in an ensemble. ZOOTER trains a routing function to distribute queries based on rewards obtained from evaluating models' responses on a training set. This avoids needing to evaluate all models for each query. ZOOTER introduces only minor overhead compared to ranking all models. It outperforms the best single model on average across tasks in a benchmark, and ranks first on 44% of tasks, surpassing other ensemble methods.

Uploaded by

VKB Library iisu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views

Routing To The Expert Efficient Reward-Guided Ensemble of Large Language Models

This paper proposes a method called ZOOTER for efficiently routing queries to the most expert large language model in an ensemble. ZOOTER trains a routing function to distribute queries based on rewards obtained from evaluating models' responses on a training set. This avoids needing to evaluate all models for each query. ZOOTER introduces only minor overhead compared to ranking all models. It outperforms the best single model on average across tasks in a benchmark, and ranks first on 44% of tasks, surpassing other ensemble methods.

Uploaded by

VKB Library iisu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Routing to the Expert: Efficient Reward-guided Ensemble of Large

Language Models
Keming Lu, Hongyi Yuan∗, Runji Lin∗
Junyang Lin, Zheng Yuan, Chang Zhou, Jingren Zhou
Alibaba Inc.
{lukeming.lkm,yuanhongyi.yhy,linrunji.lrj}@alibaba-inc.com
{junyang.ljy,yuanzheng.yuanzhen}@alibaba-inc.com
{ericzhou.zc,jingren.zhou}@alibaba-inc.com

Abstract Share ideas for adapting art masterpieces into interactive experiences
for children. List 5 specific artworks and associated ideas.

The complementary potential of Large Lan- LLaMa-2 Reward Oracle Zooter


Certainly! As a helpful and respectful assistant, I'm happy to
guage Models (LLM) assumes off-the-shelf provide ideas for adapting art masterpieces ... 1. "The Starry
arXiv:2311.08692v1 [cs.CL] 15 Nov 2023

Night" by Vincent van Gogh:\n\n Interactive Experience Idea: 2.19 10 0.10


LLMs have heterogeneous expertise in a wide Create a digital or ...
range of domains and tasks so that an ensem- WizardCoder
1. The Mona Lisa by Leonardo da Vinci: Create a virtual
ble of LLMs can achieve consistently better reality experience where children can explore the painting
-0.03 10 0.02
and interact with it in different ways, such as …… \n\n2. The
performance. Existing ensemble methods for Scream by Edvard Munch: Create a ……

LLMs mainly focus on reward model ranking WizardMath


1. The Scream by Edvard Munch: This painting is a perfect
of outputs, leading to significant computation opportunity to discuss emotions with children. …… Then, you -0.36 7 0.00
can guide them through a process ……
overhead. To combat this issue, we revisit the WizardLM
complementary potential of LLMs and further 1. "The Starry Night" by Vincent van Gogh:\n Interactive
Experience: Create a virtual reality (VR) experience ……. They
elaborate it by mining latent expertise with off- can learn about the constellations and stars, and even create
2.63 10 0.12
their own ……
the-shelf reward models. We propose Z OOTER, Vicuna
a reward-guided routing method distilling re- 1. "The Starry Night" by Vincent van Gogh - Create a virtual
reality experience ……. They can learn about the different 1.77 9 0.13
wards on training queries to train a routing func- constellations and the techniques van Gogh used to create
the swirling effects ……
tion, which can precisely distribute each query OpenChat
Creating interactive experiences based on art masterpieces
to the LLM with expertise about it. We also can be a great way to introduce children to the world of art 4.22 10 0.63
and spark their creativity. ... 1. "The Starry Night" by Vincent
integrate a tag-based label enhancement to miti- van Gogh\nIdea: Create a virtual reality (VR) experience …

gate noise from uncertainty when using rewards


as silver supervision. Z OOTER shows compu- Figure 1: An example of the large language model en-
tation efficiency in inference as it only intro- semble. Reward model ranking marked in blue needs
duces minor computation overhead of a routing to generate responses from all models while Z OOTER
function compared with reward model ranking routers the given query to the best model and only infers
methods. We evaluate Z OOTER on a compre- one model. This case is collected from the MT-Bench
hensive benchmark collection with 26 subsets benchmark and we also present oracle judgements of
on different domains and tasks. Z OOTER out- each response.
performs the best single model on average and
to diverse strengths and weaknesses in versatile
ranks first on 44% of tasks, even surpassing
multiple reward model ranking methods. 1 downstream tasks (Jiang et al., 2023). Therefore,
the ensemble of LLMs harnesses the complemen-
1 Introduction tary potential among them and may achieve better
performance than a single best-on-average model
Large Language Models (LLMs) aligned with hu- across diverse tasks.
man preference rapidly emerge and are almost daily One of the key challenges in the LLM ensemble
released (Touvron et al., 2023a,b; Anil et al., 2023; is computation efficiency due to the large parame-
Bai et al., 2023). These off-the-shelf LLMs are ter size of existing LLMs. Previous research (Jiang
further finetuned or aligned with human preference et al., 2023; Shnitzer et al., 2023) provides solid
to be generalists (Xu et al., 2023; Touvron et al., methods to merge generation outputs of LLMs as
2023b,a) or specialists (Yuan et al., 2023a; Luo an ensemble. Such methods require tremendous
et al., 2023a,b; Roziere et al., 2023) for solving inference cost that makes it unscalable and thus
versatile tasks. It is worth noticing that LLMs are not competitive to the best-on-average model un-
pretrained and aligned with various data, leading der low-resource scenarios. To efficiently assemble

Work done during internship at Alibaba Inc. off-the-shelf LLMs, we first dive deeper into the
1
Work in progress. considerably straightforward but still understudied
assumption: Off-the-shelf aligned LLMs, even for LLM ensemble, and show rewards from off-the-
those aligned as “generalists”, have heterogeneous shelf RMs can be silver supervision for model
expertise in a wide range of domains and topics. expertise.
However, analyzing the expertise of an LLM is also
• We propose Z OOTER, an efficient reward-guided
challenged as the latent expertise of LLMs is highly
routing method distilling rewards from off-the-
related to the pretrained and alignment data, which
shelf reward model for probing model expertise.
is very vague and inaccessible even for popular
Then, we develop a tag-based label enhancement
open-source LLMs such as L LAMA -2-C HAT (Tou-
to mitigate noise from the uncertainty of reward
vron et al., 2023b) and W IZARD LM (Xu et al.,
models.
2023).
If this assumption strongly holds, off-the-shelf • We comprehensively evaluate ensemble meth-
LLMs can be assembled efficiently by assigning ods, including reward model ranking and
queries to the model that is proficient in the query Z OOTER on four groups of benchmarks with
without additional inference costs on each model. 26 subsets on different tasks and domains. Our
Such an efficient routing strategy only requires in- evaluation shows Z OOTER can effectively assem-
ference cost for a single model for each query and ble LLMs and even outperforms reward model
the overhead cost of a much smaller query router. ranking methods with significantly fewer com-
However, probing the detailed expertise of off-the- putation overhead.
shelf LLMs and generating supervision for train-
ing routers also require annotations. Developing a 2 Related Works
data-efficient training method for routing queries
Instruction Tuning and Alignment. Instruction
is significantly understudied.
tuning (Longpre et al., 2023) helps LLMs to fol-
To combat these issues, we propose Z OOTER, a
low versatile instructions, which is widely adopted
reward-guided query routing method for efficiently
to align LLMs with human preference (Chiang
assembling off-the-shelf LLMs. Z OOTER obtains
et al., 2023; Xu et al., 2023; Bai et al., 2023). In
and enhances silver supervision from existing re-
this work, we focus on assembling aligned LLMs,
ward models (RM) for query router training and
such as Llama-2-Chat (Touvron et al., 2023b), Wiz-
distributes queries in advance to “expertise”. As
ardLM (Xu et al., 2023), Vicuna (Chiang et al.,
shown in Fig. 1, the reward distribution implies
2023), and so on. And we evaluate them on a wide
the oracle judgments and reveals a latent exper-
range of alignment evaluation tasks.
tise between LLMs. And Z OOTER captures the
expertise from reward distributions and provides Large Language Model Ensemble. The ensem-
query distribution during inference. Specifically, ble of LLMs is an emerging topic due to the ex-
we first conduct a comprehensive study involving plosion of open-source LLMs. LLM ensemble
four groups of benchmarks across more than 26 aims to merge off-the-shelf LLMs to achieve con-
subsets in various domains and tasks. We investi- sistently better performance across diverse down-
gate six widely used open-source LLMs and show stream tasks. Few works explore the complemen-
the complementary potential of such wide-range tary potential assumption of LLMs and how to as-
downstream tasks by aggregating them via reward semble LLMs with it. Jiang et al. (2023) presents
model ranking. We then collect a diverse training an ensembling framework consisting of a pair
query set and distill rewards of model expertise as ranker and a generation fuser. Chen et al. (2023)
indirect supervision for training an LLM router and sequentially infers off-the-shelf LLMs and stops
develop tag-based label enhancement to overcome until the response meets a sufficient quality. Wang
the shortage of such silver labels from reward mod- et al. (2023b) proposes a fusing-of-experts problem
els further. With comprehensive experiments, we that fuses outputs of expert models with comple-
show Z OOTER can benefit from RM silver supervi- mentary knowledge of the data distribution and
sion to learn the latent expertise among LLMs and formulates it as supervised learning. Shnitzer et al.
conduct efficient routing for the model ensemble. (2023) show the utility and limitations of learning
Our contributions are mainly three-fold: model routers from various benchmark datasets.
Although these works all focus on reward ranking
• We revisit the complementary potential of open- or routing strategies to assemble LLMs, Z OOTER
source LLMs, which proves the effectiveness of distinguishes from these concurrent works in two
aspects. First, our concurrent works require output formly better performance than other LLMs in
generations or the forward process to get prompt M for any query qj ∈ Qmi , which is mi =
representations of all candidates, leading to sig- argmaxm∈M P (qj , m(qj )). P can be any prefer-
nificant computation overhead. Z OOTER infers ence or metric for performance assessment. In this
model expertise by distilling rewards on a prede- work, we further enhance this assumption and aim
fined training query set to avoid such inference to show that the complementary between LLMs re-
overhead. Then, all these works are developed and veals their expertise in different domains and tasks,
evaluated on a set of benchmarks, while Z OOTER so that we can categorize queries and choose the
can be developed with only queries without golden best LLM for each category.
responses, and Z OOTER aims for more diverse Reward Model Ranking. Reward model rank-
alignment tasks. Therefore, Z OOTER stands out ing (RMR) leverages the complementary poten-
for its efficiency in data and computation. We also tial to ensemble LLMs and achieve surpass per-
evaluate Z OOTER on more diverse alignment tasks formance. RMR tries to find a reward function
to comprehensively examine the complementary P̂ to estimate the oracle preference P so that we
potential of LLMs. can obtain the best model for each query (Jiang
Reward Model Guided Generation. Reward et al., 2023). However, RMR infers all candidate
models in the context of large language models models to get outputs and then rank them with a
are commonly used to improve alignment perfor- reward function, introducing a large computation
mance by reinforcement learning (Schulman et al., overhead.
2017; Ouyang et al., 2022) or preference learning Query Routing. Query routing mitigates effi-
(Yuan et al., 2023b; Rafailov et al., 2023; Song ciency concerns in the LLM ensemble, especially
et al., 2023). Reward models can also improve compared with existing RMR methods. In gen-
the performance during the generation phase. The eral, query routing tries to find a routing function
math reasoning ability of language models can be Z(q, mi ) with respect to qj ∈ Q exists, so that
improved by using reward models ranking multiple mi = argmaxm∈M Z (qj , m). The routing func-
generated reasoning paths (Cobbe et al., 2021; Ue- tion distributes queries based on themselves with-
sato et al., 2022; Lightman et al., 2023). Liu et al. out generating outputs. If the complementary po-
(2023) uses reward models to formulate reward- tential of LLMs holds, the routing function predicts
guided decoding. Inspired by these successful ap- the probability that a query q belongs to the exper-
plications of reward models in alignment, Z OOTER tise of an LLM Qm .
also takes advantage of off-the-shelf reward models
to investigate the latent expertise of LLMs. 3.2 Zooter
3 Methods In this section, we propose Z OOTER, a reward-
guided query routing method for efficiently assem-
We first revisit the complementary potential of bling large language models. Z OOTER learns from
LLMs (§3.1) and then introduce Z OOTER as an the reward model ranking to interpret the latent
efficient LLM ensemble method (§3.2). expertise of each model. So, as shown in Fig. 2,
Z OOTER first infers all candidate LLMs on a train-
3.1 Complementary Potential of LLMs
ing set containing diverse queries to generate re-
In this section, we present the preliminaries about sponses. Then, all responses will be rewarded by
the assumption: Off-the-shelf aligned LLMs have an off-the-shelf reward model providing scalar re-
heterogeneous expertise in a wide range of domains wards, marked in blue dash lines in Fig. 2. The
and topics. We also briefly introduce two LLM rewards are first enhanced by a tag-based prior
ensemble strategies, reward model ranking, and for smoothing and denoising. The normalized re-
query routing. ward distribution is then used as supervision in the
Complementary Potential Assumption. Consid- knowledge distillation training of the routing func-
ering a set of LLMs denoted as M = {mi |i ∈ tion, shown in the green dash lines in Fig. 2. During
Z + } and a set of downstream queries denoted inference, the routing function categorizes the in-
as Q = {qi |i ∈ Z + }, we assume that for each put query to an LLM with the strongest expertise
LLM mi in M, there exists a non-empty query potential in this query, and the LLM will gener-
subset Qmi such that the LLM can achieve uni- ate an expert response. By training such a routing
Reward-guided Query Routing (Light)
Reward Model Ranking (Heavy)

Query Routing Off-the-shelf


Reward Model Ranking Responses Rewards
Zooter Training
Response 1 7.28

LLM 1
Distribute

Response 2 0.33

Query LLM 2
. . .
. . .
. Reward
. .
Zooter Model

Response N -1.73

LLM N

Knowledge
Reward Tag-based Label
Distillation
Distribution Enhancement
Training

Figure 2: Overview of Z OOTER. Z OOTER aims to assemble a set of off-the-shelf LLMs by first conducting a reward
model ranking on a diverse training set to obtain supervision of model expertise, highlighted in blue in the figure.
Instruction tags are then used to mitigate the uncertainty in reward estimation. Z OOTER uses the normalized rewards
as supervision to train a routing function by knowledge distillation. The training circle is marked in green, and the
inference is marked in orange. Z OOTER is much lighter in computation as it routes the query to the corresponding
expert LLM during inference time, while reward model ranking has to generate outputs for all candidates.

function, Z OOTER achieves a much more efficient counterparts.


ensemble as it only needs to infer one expert LLM, To estimate the expertise of each model and
plus a small computation overhead of the routing train the routing function, we need to apply the
function. In this section, we introduce the two key reward preference ranking on a diverse training
components along with the design motivations. set Q̂. We first infer all candidate models on each
Reward Distillation. As we discussed in §3.1, query q̂ ∈ Q̂, and then assign rewards by an off-
query routing aims to find a routing function pre- the-shelf reward model to obtain a scalar reward
dicting the probability that a query q belongs to for each query and model
the expertise of an LLM Qm , where Qm is a set of |M|
queries that an LLM m consistently achieves maxi- ri = {P̂ (q̂i , mj (q̂i ))}j=1 , i = 1, . . . , |Q̂|.
mum preference among all candidates. Recalling . Then, we train the router function Z on the train-
the reward model ranking, we notice the estimated ing set by knowledge distillation with a Kullback-
preferences P̂ (q, mi (q)), i.e., reward, can be inter- Leibler divergence as the loss function:
preted as the relative advantages of an LLM mi
among all candidates on the query q. Therefore, L(qi , ri ) = KL(Z(qi ), softmax(ri )).
the normalized reward can be used as a silver su-
pervision for the routing function: Z OOTER is a data-efficient and low-resource
method as the training set Q̂ only contains queries
Z(q)i = P (q ∈ Qmi ) without annotations of responses. However, queries
exp(P̂ (q, mi (q))) in the training set are expected to be as diverse as
:= P , possible to maximize the generalization abilities of
mi ∈M exp(P̂ (q, mi (q)))
the routing function. The distillation process helps
as the higher advantages inherently present the ex- Z OOTER to learn the latent expertise of each model.
pertise of an LLM on a query compared with its So, we can mitigate the computation cost by only
judging whether a query belongs to the expertise with query augmentation, ChatGPT rewards and
set with our routing function during inference. PPO optimization, (d) Vicuna (Chiang et al.,
Tag-based Label Enhancement. Although reward 2023) is aligned on tremendous conversations be-
distillation provides a feasible way for routing func- tween users and proprietary chatbots, (e) Open-
tions to leverage silver supervision from reward Chat (Wang et al., 2023a) is aligned with a selected
model ranking, the language reward model pro- set of ShareGPT with additional training strate-
vides rewards with uncertainty, introducing certain gies, (f) Llama-2-Chat (Touvron et al., 2023b)
noises (Gleave and Irving, 2022). We first empir- is first aligned by supervised fine-tuning and then
ically analyze this uncertainty in §4.3. Existing multi-turn rejection sampling. Both baselines and
off-the-shelf reward models will all involve noises Z OOTER are experimented and evaluated based on
in terms of uncertainty, as shown in Fig. 3. There- these six candidates.
fore, we leverage instruction tagging to enhance Training Datasets. We create a diverse mix in-
rewards on the training queries further. The tag- struction dataset from the open-source data to max-
based label enhancement we proposed is similar imize the generalization abilities of Z OOTER. We
to the widely used label smoothing techniques and first collect and tag open-source data from 13
proven effective in knowledge distillation (Yuan datasets with a local tagger developed by Lu et al.
et al., 2020). Specifically, we first tag each query (2023). For trustworthy evaluation results, we de-
q̂i ∈ Q̂ with a local tagger T (·) to obtain a set contaminate all samples containing queries that
of tags T (qi ). Then, we aggregate all rewards on have a 6-gram overlap with any samples in our
queries with the same tags for the tag-wise rewards benchmarks described below to avoid data leakage.
as follows: Then, we randomly select ten samples for each
unique tag to form a diverse mix instruction dataset
Qt = {qi |t ∈ T (qi ), i = 1, . . . , |Q̂|}
D IV I NSTRUCT with 47,986 instructions and sam-
1 X ples across 6,270 different tags. Detailed statistics
rt = ri
|Qt | of D IV I NSTRUCT is in Appx. §A.
i∈Qt
Benchmarks. We actively involve four sets of
Then, we enhance rewards for each query with tag-
benchmarks to evaluate Z OOTER on various down-
wise rewards by a linear combination:
stream tasks comprehensively. We first include
r∗i = βri + (1 − β)rt ; t = T (qi ), i = 1, . . . , |Q̂| three widely-used alignment benchmarks with
GPT-4 judge:
,where β is a hyper-parameter for the trade-off be- • AlpcaEval (Li et al., 2023b) consists of 5 subsets
tween coarse-grained tag-wise rewards and fine- from the koala, vicuna, and others evaluation
grained sample-level rewards. Then, we replace sets. It contains 805 samples in total.
original rewards in the KL divergence loss training
with tag-based enhanced rewards r∗ during routing • FLASK (Ye et al., 2023) is a fine-grained evalu-
function training. ation for alignment. We evaluate 10 domains in
FLASK and report the average score across all
4 Experiments domains as a final score.

In this section, we report experimental • MT-Bench (Chiang et al., 2023) is a multi-turn


setup (§4.1), main results (§4.2), and analy- evaluation across eight aspects, including mathe-
sis about Z OOTER (§4.3). matics and coding. We only train and route with
the first-turn query but evaluate in the multi-turn
4.1 Experimental Setup manner as the original recipe.
Candidate LLMs. We select six LL A M A-based However, as reported by Wang et al. (2023c),
LLMs of the same 13B size as the candidate LLMs GPT-4 judgments may have bias and significant
for query routing. (a) WizardLM (Xu et al., 2023) disagreement with humans. Therefore, we also
is aligned with queries and responses augmented include a group of benchmarks consisting of
by E VOL I NSTRUCT, (b) WizardCoder (Luo et al., MMLU (Hendrycks et al., 2021), GSM8K (Cobbe
2023b) is a coding expert LLM using the same et al., 2021), and HumanEval (Chen et al., 2021).
techniques as WizardLM, (c) WizardMath (Luo Metrics. Comparing ensemble models on various
et al., 2023a) is a math expert LLM aligned benchmarks is challenging as the scale of scores
#Param AlpacaEval (5) FLASK (10) MT-Bench (8) Benchmarks (3) All (26)
Model
Ranker Infer Avg. MTR Avg. MTR Avg. MTR Avg. MTR MTR % Uplift
Routing Candidates
W IZARD C ODER −− 13B 0.42 5.6 3.12 5.2 4.44 5.38 30.9 4.33 5.3 0.06
W IZARD LM −− 13B 0.89 2.0 3.89 1.8 7.15 2.0 44.2 2.0 1.83 0.25
W IZARD M ATH −− 13B 0.47 5.0 3.28 5.0 5.73 4.38 34.8 4.0 4.6 0.03
L LAMA -2- CHAT −− 13B 0.91 1.6 3.88 1.5 6.72 2.88 32.3 3.67 2.23 0.31
O PEN C HAT −− 13B 0.89 2.2 3.79 3.1 7.12 2.0 31.2 3.33 2.67 0.19
V ICUNA −− 13B 0.8 3.8 3.7 3.5 6.58 3.25 33.6 2.67 3.4 0.06
BMA −− 13B 0.91 1.6 3.88 1.5 6.72 2.88 32.3 3.67 2.23 0.31
Z OOTER
Ours 86M 13B 0.93 1.17 3.89 1.82 7.11 2.33 34.2 3.0 1.94 0.44
Reward Model Ranking (RMR)
W/ OA SSIST RM 300M 6×13B 0.79 4.0 3.75 3.73 6.59 3.22 35.1 3.25 3.42 0.19
W/ LLM-B LENDER 300M 6×13B 0.83 3.67 3.77 3.36 6.21 4.0 36.4 2.75 3.39 0.17
W/ AUTO -J 13B 6×13B 0.89 2.67 3.92 1.64 7.03 2.22 32.2 3.5 2.25 0.42
W/ U LTRA RM 13B 6×13B 0.92 1.17 4.06 1.0 7.18 1.89 40.1 3.25 1.53 0.72
W/ Q WEN RM 7B 6×13B 0.92 1.33 4.04 1.0 7.26 2.11 38.6 3.0 1.58 0.67
W/ O RACLE −− 6×13B 0.98 1.0 4.56 1.0 8.25 1.0 75.3 1.0 1.0 1.0
Proprietary Models
GPT-3.5-turbo −− −− 0.89 2.67 4.06 1.91 7.94 1.78 73.0 1.0 1.78 0.61
GPT-4 −− −− 0.94 1.0 4.37 1.0 8.99 1.0 88.3 1.0 1.0 1.0

Table 1: Main results of both Z OOTER and reward model ranking. We report performance across four groups of
benchmarks and report the number of subsets beside the name of benchmarks. We also report the parameters of
ranker and total inference models for both candidates and ensemble methods. MTR denotes the mean task rate, and
%Uplift denotes the rate of uplift. The average scores and uplift rate are as higher as better while MTR is as lower
as better. We mark better scores in darker blue for better visualization and easier interpretation.

is different on each benchmark. To combat this methods. Besides, we also report the performance
issue, we do not only report the scores on each of proprietary models across our benchmark collec-
benchmark but also the mean task rank (MTR). All tions for reference, including GPT-3.5-turbo and
benchmarks we evaluate have multiple subsets, we GPT-4.
define MTR as the rank of the evaluated model Configurations. We train our routing function
among all baselines average on all subsets. MTR from mdeberta-v3-base. And we use QwenRM
is only about the rank among baselines so it can to generate rewards on training queries as supervi-
be easily adopted across benchmarks that have dif- sion for our routing function, as it achieves the best
ferent score scales. Similarly, we also propose an performance in reward model ranking with consid-
uplift rate, denoting the rate of subsets that the erably smaller model parameters described in §4.2.
evaluated model achieves the best performance of And we run all training and inference on 8 A100
benchmarks. We report these two metrics on a total GPUs. We infer and evaluate all benchmarks with
of 26 evaluation subsets in all benchmarks. Lower corresponding configurations and GPT-4 settings.
MTR and higher uplift rates show the evaluated We use greedy decoding for MMLU, GSM8K, and
model has consistently higher performance among HumanEval.
versatile downstream tasks.
4.2 Results
Baselines. We also compare Z OOTER with existing
reward model ranking (RMR) methods. We set up We present the main results in Tab. 1. We report the
RMR baselines with the latest rewards models, in- performance of six routing candidates across our
cluding OA SSIST RM, AUTO-J (Li et al., 2023a), benchmarks, and the best model on average (BMA)
U LTRA RM (Cui et al., 2023), Q WEN RM (Bai is LL A M A -2-C HAT. And we report Z OOTER with
et al., 2023), and an Oracle ranking for refer- β = 0.3 in tag-based label enhancement. We fur-
ence. We also consider the pair ranking in LLM- ther analyze the results in the following two as-
Blender (Jiang et al., 2023) as one of the RMR pects:
Complementary Potential. We evaluate the en-
semble with reward model ranking (RMR) on five 1.8
different off-the-shelf reward models. RMR with 1.6
UltraRM achieves the best performance in MTR

Reward Entropy
1.4
and uplift rate on the aggregation of all bench-
marks, which ranks at 1.53 and achieves the best 1.2
model across 72% subtasks. RMR with QwenRM 1.0
achieves the second best and has similar perfor-
0.8
mance with UltraRM with smaller parameter sizes,
followed by RMR with Auto-J, LLM-Blender, and 2 4 6 8 10
OAssistRM. RMR with QwenRM, UltraRM, and MT-Bench Score
Auto-J outperform that of BMA, showing the effec- Figure 3: Analysis between reward entropy and scores
tiveness of RMR. Furthermore, we also calculate of reward preference ranking on MT-bench.
the score of RMR with an Oracle ranker, which
consistently outperforms all candidates and even
β AlpacaEval FLASK MT-Bench Benchmarks All
outperforms GPT-4 on AlpacaEval and FLASK.
0 1.4 2.2 2.25 3.67 2.06
Such results provide solid evidence for the com- 0.1 1.2 2.1 2.38 3.67 2.00
plementary potential of off-the-shelf LLMs and 0.3 1.2 1.9 2.50 3.67 1.97
also support the key motivation behind Z OOTER, 0.5 1.2 2.2 3.12 3.67 2.23
0.7 1.2 2.2 3.38 4.00 2.31
i.e., using rewards from off-the-shelf reward mod- 0.9 1.2 2.3 3.12 4.00 2.31
els as silver supervision for the routing function 1.0 1.2 2.3 3.25 4.00 2.34
training. However, we notice RMR fails on bench-
marks, such as MMLU, GSM8K, and HumanEval, Table 2: Mean task rank (MTR) of different β in tag-
showing that precisely judging knowledge, mathe- based label enhancement across all benchmarks. The
best value of β is marked in blue.
matics, and coding problems are still challenging
for existing RMs.
Zooter Performance. We then compare the per- research, RM may have uncertainty on its scalar
formance of Z OOTER with that of BMA and RMR. rewards, which may introduce noise in the routing
Z OOTER outperforms BMA on AlpacaEval, MT- training since we use RM scores as silver supervi-
Bench, and Benchmarks, and achieves similar per- sion. In this subsection, we first present the exis-
formance on FLASK. The most significant im- tence of this uncertainty to explain the motivation
provement is witnessed on MT-Bench, where the behind tag-based label enhancement, the method
performance of Z OOTER is higher than that of we propose to mitigate such uncertainty in the rout-
BMA by 0.39. In general, Z OOTER achieves top- ing function training. We calculate the entropy of
1 on 44% subtasks while BMA is only on 31%. rewards from QwenRM among all candidate LLMs
With the evidence above, Z OOTER successfully uti- for each query in MT-Bench and draw it with the
lizes the complementary potential between LLMs MT-Bench scores of each sample by reward prefer-
to achieve the best performance more consistently ence ranking with QwenRM. As shown in Fig. 3,
over our benchmarks, with computation overhead samples with lower reward entropy tend to have
from only 86M ranker. At the same time, Z OOTER high MT-bench scores. We interpret this observa-
outperforms RMR with OAssistRM, LLM-Blender, tion as higher reward entropy reveals more uncer-
and Auto-J, by significantly less computation over- tainty in the reward. Therefore, we propose tag-
head. However, though Z OOTER outperforms based label enhancement to leverage a tag-based
RMR with QwenRM on AlpacaEval, there are still prior to adjust reward entropy.
obvious gaps between Z OOTER and RMR with
Label Enhancement. The tag-based label en-
QwenRM in general.
hancement proposed in §3.2 contains a hyper-
4.3 Analysis parameter β, which represents the trade-off be-
tween fine-grained sample-level rewards and
We provide further analysis on how RM uncertainty coarse-grained tag-level rewards. We conduct ex-
may influence the training of Z OOTER. periments to tune this hyperparameter and ana-
RM Uncertainty. As presented in the previous lyze how rewards in different granularities may
influence the training of our routing function. As Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan
shown in Tab. 2, Z OOTER achieves the best perfor- Morikawa, Alec Radford, Matthew Knight, Miles
Brundage, Mira Murati, Katie Mayer, Peter Welinder,
mance when β equals 0.3, proving a combination
Bob McGrew, Dario Amodei, Sam McCandlish, Ilya
of sample-level and tag-level rewards will benefit Sutskever, and Wojciech Zaremba. 2021. Evaluating
the reward distillation. The ablation also shows the large language models trained on code.
necessity of tag-based label enhancement. Fur-
thermore, distilling tag-level rewards (β = 0) Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng,
Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan
shows significantly better performance than distill- Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion
ing sample-level rewards (β = 1), supporting the Stoica, and Eric P. Xing. 2023. Vicuna: An open-
analysis that noises from the uncertainty of RMs in source chatbot impressing gpt-4 with 90%* chatgpt
sample-level rewards damage reward distillation. quality.

5 Conclusion Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian,


Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias
In this work, we revisit the complementary po- Plappert, Jerry Tworek, Jacob Hilton, Reiichiro
Nakano, et al. 2021. Training verifiers to solve math
tential of open-source LLMs and reward model word problems. arXiv preprint arXiv:2110.14168.
ranking of multiple off-the-shelf reward models,
providing evidence to the effectiveness of LLM en- Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao,
semble. We propose Z OOTER, an efficient reward- Wei Zhu, Yuan Ni, Guotong Xie, Zhiyuan Liu, and
Maosong Sun. 2023. Ultrafeedback: Boosting lan-
guided routing method for ensemble off-the-shelf
guage models with high-quality feedback.
LLMs. Comprehensive evaluation shows Z OOTER
can outperform the best single model on average Adam Gleave and Geoffrey Irving. 2022. Uncer-
and even ensemble models by reward model rank- tainty estimation for language reward models. arXiv
ing with significantly fewer computation overhead. preprint arXiv:2203.07472.
Valuable future works include diving deep into the
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou,
interpretation of latent expertise in each LLM. Mantas Mazeika, Dawn Song, and Jacob Steinhardt.
2021. Measuring massive multitask language under-
standing. In International Conference on Learning
References Representations.
Rohan Anil, Andrew M Dai, Orhan Firat, Melvin John- Dongfu Jiang, Xiang Ren, and Bill Yuchen Lin. 2023.
son, Dmitry Lepikhin, Alexandre Passos, Siamak Llm-blender: Ensembling large language models
Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng with pairwise ranking and generative fusion. arXiv
Chen, et al. 2023. Palm 2 technical report. arXiv preprint arXiv:2306.02561.
preprint arXiv:2305.10403.
Junlong Li, Shichao Sun, Weizhe Yuan, Run-Ze Fan,
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang,
Hai Zhao, and Pengfei Liu. 2023a. Generative
Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei
judge for evaluating alignment. arXiv preprint
Huang, et al. 2023. Qwen technical report. arXiv
arXiv:2310.05470.
preprint arXiv:2309.16609.

Lingjiao Chen, Matei Zaharia, and James Zou. 2023. Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan
Frugalgpt: How to use large language models while Taori, Ishaan Gulrajani, Carlos Guestrin, Percy
reducing cost and improving performance. Liang, and Tatsunori B. Hashimoto. 2023b. Al-
pacaeval: An automatic evaluator of instruction-
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming following models. https://github.com/
Yuan, Henrique Ponde de Oliveira Pinto, Jared Ka- tatsu-lab/alpaca_eval.
plan, Harri Edwards, Yuri Burda, Nicholas Joseph,
Greg Brockman, Alex Ray, Raul Puri, Gretchen Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri
Krueger, Michael Petrov, Heidy Khlaaf, Girish Sas- Edwards, Bowen Baker, Teddy Lee, Jan Leike,
try, Pamela Mishkin, Brooke Chan, Scott Gray, John Schulman, Ilya Sutskever, and Karl Cobbe.
Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz 2023. Let’s verify step by step. arXiv preprint
Kaiser, Mohammad Bavarian, Clemens Winter, arXiv:2305.20050.
Philippe Tillet, Felipe Petroski Such, Dave Cum-
mings, Matthias Plappert, Fotios Chantzis, Eliza- Jiacheng Liu, Andrew Cohen, Ramakanth Pasunuru,
beth Barnes, Ariel Herbert-Voss, William Hebgen Yejin Choi, Hannaneh Hajishirzi, and Asli Celiky-
Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie ilmaz. 2023. Don’t throw away your value model!
Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, making ppo even better via value-guided monte-carlo
William Saunders, Christopher Hesse, Andrew N. tree search decoding.
Shayne Longpre, Le Hou, Tu Vu, Albert Webson, Hugo Touvron, Louis Martin, Kevin Stone, Peter Al-
Hyung Won Chung, Yi Tay, Denny Zhou, Quoc V. Le, bert, Amjad Almahairi, Yasmine Babaei, Nikolay
Barret Zoph, Jason Wei, and Adam Roberts. 2023. Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti
The flan collection: Designing data and methods for Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton
effective instruction tuning. Ferrer, Moya Chen, Guillem Cucurull, David Esiobu,
Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller,
Keming Lu, Hongyi Yuan, Zheng Yuan, Runji Lin, Jun- Cynthia Gao, Vedanuj Goswami, Naman Goyal, An-
yang Lin, Chuanqi Tan, and Chang Zhou. 2023. # thony Hartshorn, Saghar Hosseini, Rui Hou, Hakan
instag: Instruction tagging for diversity and complex- Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa,
ity analysis. arXiv preprint arXiv:2308.07074. Isabel Kloumann, Artem Korenev, Punit Singh Koura,
Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Di-
Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jian- ana Liskovich, Yinghai Lu, Yuning Mao, Xavier Mar-
guang Lou, Chongyang Tao, Xiubo Geng, Qingwei tinet, Todor Mihaylov, Pushkar Mishra, Igor Moly-
Lin, Shifeng Chen, and Dongmei Zhang. 2023a. Wiz- bog, Yixin Nie, Andrew Poulton, Jeremy Reizen-
ardmath: Empowering mathematical reasoning for stein, Rashi Rungta, Kalyan Saladi, Alan Schelten,
large language models via reinforced evol-instruct. Ruan Silva, Eric Michael Smith, Ranjan Subrama-
nian, Xiaoqing Ellen Tan, Binh Tang, Ross Tay-
Ziyang Luo, Can Xu, Pu Zhao, Qingfeng Sun, Xiubo lor, Adina Williams, Jian Xiang Kuan, Puxin Xu,
Geng, Wenxiang Hu, Chongyang Tao, Jing Ma, Qing- Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan,
wei Lin, and Daxin Jiang. 2023b. Wizardcoder: Melanie Kambadur, Sharan Narang, Aurelien Ro-
Empowering code large language models with evol- driguez, Robert Stojnic, Sergey Edunov, and Thomas
instruct. Scialom. 2023b. Llama 2: Open foundation and
fine-tuned chat models.
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Car-
roll L. Wainwright, Pamela Mishkin, Chong Zhang, Jonathan Uesato, Nate Kushman, Ramana Kumar, Fran-
Sandhini Agarwal, Katarina Slama, Alex Ray, John cis Song, Noah Siegel, Lisa Wang, Antonia Creswell,
Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Geoffrey Irving, and Irina Higgins. 2022. Solv-
Maddie Simens, Amanda Askell, Peter Welinder, ing math word problems with process-and outcome-
Paul Christiano, Jan Leike, and Ryan Lowe. 2022. based feedback. arXiv preprint arXiv:2211.14275.
Training language models to follow instructions with
human feedback. Guan Wang, Sijie Cheng, Qiying Yu, and Changling
Liu. 2023a. OpenChat: Advancing Open-source Lan-
guage Models with Imperfect Data.
Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano
Ermon, Christopher D. Manning, and Chelsea Finn.
Hongyi Wang, Felipe Maia Polo, Yuekai Sun, Souvik
2023. Direct preference optimization: Your language
Kundu, Eric Xing, and Mikhail Yurochkin. 2023b.
model is secretly a reward model.
Fusing models with complementary expertise.
Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Yizhong Wang, Hamish Ivison, Pradeep Dasigi, Jack
Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Hessel, Tushar Khot, Khyathi Raghavi Chandu,
Jingyu Liu, Tal Remez, Jérémy Rapin, et al. 2023. David Wadden, Kelsey MacMillan, Noah A. Smith,
Code llama: Open foundation models for code. arXiv Iz Beltagy, and Hannaneh Hajishirzi. 2023c. How
preprint arXiv:2308.12950. far can camels go? exploring the state of instruction
tuning on open resources.
John Schulman, Filip Wolski, Prafulla Dhariwal,
Alec Radford, and Oleg Klimov. 2017. Proxi- Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng,
mal policy optimization algorithms. arXiv preprint Pu Zhao, Jiazhan Feng, Chongyang Tao, and Daxin
arXiv:1707.06347. Jiang. 2023. Wizardlm: Empowering large language
models to follow complex instructions.
Tal Shnitzer, Anthony Ou, Mírian Silva, Kate Soule,
Yuekai Sun, Justin Solomon, Neil Thompson, and Seonghyeon Ye, Doyoung Kim, Sungdong Kim, Hyeon-
Mikhail Yurochkin. 2023. Large language model bin Hwang, Seungone Kim, Yongrae Jo, James
routing with benchmark datasets. Thorne, Juho Kim, and Minjoon Seo. 2023. Flask:
Fine-grained language model evaluation based on
Feifan Song, Bowen Yu, Minghao Li, Haiyang Yu, Fei alignment skill sets.
Huang, Yongbin Li, and Houfeng Wang. 2023. Pref-
erence ranking optimization for human alignment. Li Yuan, Francis EH Tay, Guilin Li, Tao Wang, and
Jiashi Feng. 2020. Revisiting knowledge distillation
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier via label smoothing regularization. In Proceedings of
Martinet, Marie-Anne Lachaux, Timothée Lacroix, the IEEE/CVF Conference on Computer Vision and
Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Pattern Recognition, pages 3903–3911.
Azhar, Aurelien Rodriguez, Armand Joulin, Edouard
Grave, and Guillaume Lample. 2023a. Llama: Open Zheng Yuan, Hongyi Yuan, Chengpeng Li, Guanting
and efficient foundation language models. Dong, Keming Lu, Chuanqi Tan, Chang Zhou, and
Jingren Zhou. 2023a. Scaling relationship on learn- Dataset Amount
ing mathematical reasoning with large language mod-
els. ultrachat 18,588
sharedgpt 10432
Zheng Yuan, Hongyi Yuan, Chuanqi Tan, Wei Wang,
Songfang Huang, and Fei Huang. 2023b. Rrhf: Rank wizardlm(sharedgpt) 5325
responses to align language models with human feed- wizardlm(alpaca) 5145
back without tears. alpaca 2186
repair 1034
openchat 1033
flan 862
math 849
unnatural 582
dmcc 573
dolly 560
oasst 183
lima 70
mbpp 43

Table 3: Composition of D IV I NSTRUCT

A Datasets
D IV I NSTRUCT is a diverse mix instruction set from
multiple open-source datasets with careful decon-
tamination on all benchmarks evaluated in this
work. The detailed composition of D IV I NSTRUCT
is report in Tab. 3.

You might also like