Routing To The Expert Efficient Reward-Guided Ensemble of Large Language Models
Routing To The Expert Efficient Reward-Guided Ensemble of Large Language Models
Language Models
Keming Lu, Hongyi Yuan∗, Runji Lin∗
Junyang Lin, Zheng Yuan, Chang Zhou, Jingren Zhou
Alibaba Inc.
{lukeming.lkm,yuanhongyi.yhy,linrunji.lrj}@alibaba-inc.com
{junyang.ljy,yuanzheng.yuanzhen}@alibaba-inc.com
{ericzhou.zc,jingren.zhou}@alibaba-inc.com
Abstract Share ideas for adapting art masterpieces into interactive experiences
for children. List 5 specific artworks and associated ideas.
LLM 1
Distribute
Response 2 0.33
Query LLM 2
. . .
. . .
. Reward
. .
Zooter Model
Response N -1.73
LLM N
Knowledge
Reward Tag-based Label
Distillation
Distribution Enhancement
Training
Figure 2: Overview of Z OOTER. Z OOTER aims to assemble a set of off-the-shelf LLMs by first conducting a reward
model ranking on a diverse training set to obtain supervision of model expertise, highlighted in blue in the figure.
Instruction tags are then used to mitigate the uncertainty in reward estimation. Z OOTER uses the normalized rewards
as supervision to train a routing function by knowledge distillation. The training circle is marked in green, and the
inference is marked in orange. Z OOTER is much lighter in computation as it routes the query to the corresponding
expert LLM during inference time, while reward model ranking has to generate outputs for all candidates.
Table 1: Main results of both Z OOTER and reward model ranking. We report performance across four groups of
benchmarks and report the number of subsets beside the name of benchmarks. We also report the parameters of
ranker and total inference models for both candidates and ensemble methods. MTR denotes the mean task rate, and
%Uplift denotes the rate of uplift. The average scores and uplift rate are as higher as better while MTR is as lower
as better. We mark better scores in darker blue for better visualization and easier interpretation.
is different on each benchmark. To combat this methods. Besides, we also report the performance
issue, we do not only report the scores on each of proprietary models across our benchmark collec-
benchmark but also the mean task rank (MTR). All tions for reference, including GPT-3.5-turbo and
benchmarks we evaluate have multiple subsets, we GPT-4.
define MTR as the rank of the evaluated model Configurations. We train our routing function
among all baselines average on all subsets. MTR from mdeberta-v3-base. And we use QwenRM
is only about the rank among baselines so it can to generate rewards on training queries as supervi-
be easily adopted across benchmarks that have dif- sion for our routing function, as it achieves the best
ferent score scales. Similarly, we also propose an performance in reward model ranking with consid-
uplift rate, denoting the rate of subsets that the erably smaller model parameters described in §4.2.
evaluated model achieves the best performance of And we run all training and inference on 8 A100
benchmarks. We report these two metrics on a total GPUs. We infer and evaluate all benchmarks with
of 26 evaluation subsets in all benchmarks. Lower corresponding configurations and GPT-4 settings.
MTR and higher uplift rates show the evaluated We use greedy decoding for MMLU, GSM8K, and
model has consistently higher performance among HumanEval.
versatile downstream tasks.
4.2 Results
Baselines. We also compare Z OOTER with existing
reward model ranking (RMR) methods. We set up We present the main results in Tab. 1. We report the
RMR baselines with the latest rewards models, in- performance of six routing candidates across our
cluding OA SSIST RM, AUTO-J (Li et al., 2023a), benchmarks, and the best model on average (BMA)
U LTRA RM (Cui et al., 2023), Q WEN RM (Bai is LL A M A -2-C HAT. And we report Z OOTER with
et al., 2023), and an Oracle ranking for refer- β = 0.3 in tag-based label enhancement. We fur-
ence. We also consider the pair ranking in LLM- ther analyze the results in the following two as-
Blender (Jiang et al., 2023) as one of the RMR pects:
Complementary Potential. We evaluate the en-
semble with reward model ranking (RMR) on five 1.8
different off-the-shelf reward models. RMR with 1.6
UltraRM achieves the best performance in MTR
Reward Entropy
1.4
and uplift rate on the aggregation of all bench-
marks, which ranks at 1.53 and achieves the best 1.2
model across 72% subtasks. RMR with QwenRM 1.0
achieves the second best and has similar perfor-
0.8
mance with UltraRM with smaller parameter sizes,
followed by RMR with Auto-J, LLM-Blender, and 2 4 6 8 10
OAssistRM. RMR with QwenRM, UltraRM, and MT-Bench Score
Auto-J outperform that of BMA, showing the effec- Figure 3: Analysis between reward entropy and scores
tiveness of RMR. Furthermore, we also calculate of reward preference ranking on MT-bench.
the score of RMR with an Oracle ranker, which
consistently outperforms all candidates and even
β AlpacaEval FLASK MT-Bench Benchmarks All
outperforms GPT-4 on AlpacaEval and FLASK.
0 1.4 2.2 2.25 3.67 2.06
Such results provide solid evidence for the com- 0.1 1.2 2.1 2.38 3.67 2.00
plementary potential of off-the-shelf LLMs and 0.3 1.2 1.9 2.50 3.67 1.97
also support the key motivation behind Z OOTER, 0.5 1.2 2.2 3.12 3.67 2.23
0.7 1.2 2.2 3.38 4.00 2.31
i.e., using rewards from off-the-shelf reward mod- 0.9 1.2 2.3 3.12 4.00 2.31
els as silver supervision for the routing function 1.0 1.2 2.3 3.25 4.00 2.34
training. However, we notice RMR fails on bench-
marks, such as MMLU, GSM8K, and HumanEval, Table 2: Mean task rank (MTR) of different β in tag-
showing that precisely judging knowledge, mathe- based label enhancement across all benchmarks. The
best value of β is marked in blue.
matics, and coding problems are still challenging
for existing RMs.
Zooter Performance. We then compare the per- research, RM may have uncertainty on its scalar
formance of Z OOTER with that of BMA and RMR. rewards, which may introduce noise in the routing
Z OOTER outperforms BMA on AlpacaEval, MT- training since we use RM scores as silver supervi-
Bench, and Benchmarks, and achieves similar per- sion. In this subsection, we first present the exis-
formance on FLASK. The most significant im- tence of this uncertainty to explain the motivation
provement is witnessed on MT-Bench, where the behind tag-based label enhancement, the method
performance of Z OOTER is higher than that of we propose to mitigate such uncertainty in the rout-
BMA by 0.39. In general, Z OOTER achieves top- ing function training. We calculate the entropy of
1 on 44% subtasks while BMA is only on 31%. rewards from QwenRM among all candidate LLMs
With the evidence above, Z OOTER successfully uti- for each query in MT-Bench and draw it with the
lizes the complementary potential between LLMs MT-Bench scores of each sample by reward prefer-
to achieve the best performance more consistently ence ranking with QwenRM. As shown in Fig. 3,
over our benchmarks, with computation overhead samples with lower reward entropy tend to have
from only 86M ranker. At the same time, Z OOTER high MT-bench scores. We interpret this observa-
outperforms RMR with OAssistRM, LLM-Blender, tion as higher reward entropy reveals more uncer-
and Auto-J, by significantly less computation over- tainty in the reward. Therefore, we propose tag-
head. However, though Z OOTER outperforms based label enhancement to leverage a tag-based
RMR with QwenRM on AlpacaEval, there are still prior to adjust reward entropy.
obvious gaps between Z OOTER and RMR with
Label Enhancement. The tag-based label en-
QwenRM in general.
hancement proposed in §3.2 contains a hyper-
4.3 Analysis parameter β, which represents the trade-off be-
tween fine-grained sample-level rewards and
We provide further analysis on how RM uncertainty coarse-grained tag-level rewards. We conduct ex-
may influence the training of Z OOTER. periments to tune this hyperparameter and ana-
RM Uncertainty. As presented in the previous lyze how rewards in different granularities may
influence the training of our routing function. As Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan
shown in Tab. 2, Z OOTER achieves the best perfor- Morikawa, Alec Radford, Matthew Knight, Miles
Brundage, Mira Murati, Katie Mayer, Peter Welinder,
mance when β equals 0.3, proving a combination
Bob McGrew, Dario Amodei, Sam McCandlish, Ilya
of sample-level and tag-level rewards will benefit Sutskever, and Wojciech Zaremba. 2021. Evaluating
the reward distillation. The ablation also shows the large language models trained on code.
necessity of tag-based label enhancement. Fur-
thermore, distilling tag-level rewards (β = 0) Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng,
Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan
shows significantly better performance than distill- Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion
ing sample-level rewards (β = 1), supporting the Stoica, and Eric P. Xing. 2023. Vicuna: An open-
analysis that noises from the uncertainty of RMs in source chatbot impressing gpt-4 with 90%* chatgpt
sample-level rewards damage reward distillation. quality.
Lingjiao Chen, Matei Zaharia, and James Zou. 2023. Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan
Frugalgpt: How to use large language models while Taori, Ishaan Gulrajani, Carlos Guestrin, Percy
reducing cost and improving performance. Liang, and Tatsunori B. Hashimoto. 2023b. Al-
pacaeval: An automatic evaluator of instruction-
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming following models. https://github.com/
Yuan, Henrique Ponde de Oliveira Pinto, Jared Ka- tatsu-lab/alpaca_eval.
plan, Harri Edwards, Yuri Burda, Nicholas Joseph,
Greg Brockman, Alex Ray, Raul Puri, Gretchen Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri
Krueger, Michael Petrov, Heidy Khlaaf, Girish Sas- Edwards, Bowen Baker, Teddy Lee, Jan Leike,
try, Pamela Mishkin, Brooke Chan, Scott Gray, John Schulman, Ilya Sutskever, and Karl Cobbe.
Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz 2023. Let’s verify step by step. arXiv preprint
Kaiser, Mohammad Bavarian, Clemens Winter, arXiv:2305.20050.
Philippe Tillet, Felipe Petroski Such, Dave Cum-
mings, Matthias Plappert, Fotios Chantzis, Eliza- Jiacheng Liu, Andrew Cohen, Ramakanth Pasunuru,
beth Barnes, Ariel Herbert-Voss, William Hebgen Yejin Choi, Hannaneh Hajishirzi, and Asli Celiky-
Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie ilmaz. 2023. Don’t throw away your value model!
Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, making ppo even better via value-guided monte-carlo
William Saunders, Christopher Hesse, Andrew N. tree search decoding.
Shayne Longpre, Le Hou, Tu Vu, Albert Webson, Hugo Touvron, Louis Martin, Kevin Stone, Peter Al-
Hyung Won Chung, Yi Tay, Denny Zhou, Quoc V. Le, bert, Amjad Almahairi, Yasmine Babaei, Nikolay
Barret Zoph, Jason Wei, and Adam Roberts. 2023. Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti
The flan collection: Designing data and methods for Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton
effective instruction tuning. Ferrer, Moya Chen, Guillem Cucurull, David Esiobu,
Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller,
Keming Lu, Hongyi Yuan, Zheng Yuan, Runji Lin, Jun- Cynthia Gao, Vedanuj Goswami, Naman Goyal, An-
yang Lin, Chuanqi Tan, and Chang Zhou. 2023. # thony Hartshorn, Saghar Hosseini, Rui Hou, Hakan
instag: Instruction tagging for diversity and complex- Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa,
ity analysis. arXiv preprint arXiv:2308.07074. Isabel Kloumann, Artem Korenev, Punit Singh Koura,
Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Di-
Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jian- ana Liskovich, Yinghai Lu, Yuning Mao, Xavier Mar-
guang Lou, Chongyang Tao, Xiubo Geng, Qingwei tinet, Todor Mihaylov, Pushkar Mishra, Igor Moly-
Lin, Shifeng Chen, and Dongmei Zhang. 2023a. Wiz- bog, Yixin Nie, Andrew Poulton, Jeremy Reizen-
ardmath: Empowering mathematical reasoning for stein, Rashi Rungta, Kalyan Saladi, Alan Schelten,
large language models via reinforced evol-instruct. Ruan Silva, Eric Michael Smith, Ranjan Subrama-
nian, Xiaoqing Ellen Tan, Binh Tang, Ross Tay-
Ziyang Luo, Can Xu, Pu Zhao, Qingfeng Sun, Xiubo lor, Adina Williams, Jian Xiang Kuan, Puxin Xu,
Geng, Wenxiang Hu, Chongyang Tao, Jing Ma, Qing- Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan,
wei Lin, and Daxin Jiang. 2023b. Wizardcoder: Melanie Kambadur, Sharan Narang, Aurelien Ro-
Empowering code large language models with evol- driguez, Robert Stojnic, Sergey Edunov, and Thomas
instruct. Scialom. 2023b. Llama 2: Open foundation and
fine-tuned chat models.
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Car-
roll L. Wainwright, Pamela Mishkin, Chong Zhang, Jonathan Uesato, Nate Kushman, Ramana Kumar, Fran-
Sandhini Agarwal, Katarina Slama, Alex Ray, John cis Song, Noah Siegel, Lisa Wang, Antonia Creswell,
Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Geoffrey Irving, and Irina Higgins. 2022. Solv-
Maddie Simens, Amanda Askell, Peter Welinder, ing math word problems with process-and outcome-
Paul Christiano, Jan Leike, and Ryan Lowe. 2022. based feedback. arXiv preprint arXiv:2211.14275.
Training language models to follow instructions with
human feedback. Guan Wang, Sijie Cheng, Qiying Yu, and Changling
Liu. 2023a. OpenChat: Advancing Open-source Lan-
guage Models with Imperfect Data.
Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano
Ermon, Christopher D. Manning, and Chelsea Finn.
Hongyi Wang, Felipe Maia Polo, Yuekai Sun, Souvik
2023. Direct preference optimization: Your language
Kundu, Eric Xing, and Mikhail Yurochkin. 2023b.
model is secretly a reward model.
Fusing models with complementary expertise.
Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Yizhong Wang, Hamish Ivison, Pradeep Dasigi, Jack
Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Hessel, Tushar Khot, Khyathi Raghavi Chandu,
Jingyu Liu, Tal Remez, Jérémy Rapin, et al. 2023. David Wadden, Kelsey MacMillan, Noah A. Smith,
Code llama: Open foundation models for code. arXiv Iz Beltagy, and Hannaneh Hajishirzi. 2023c. How
preprint arXiv:2308.12950. far can camels go? exploring the state of instruction
tuning on open resources.
John Schulman, Filip Wolski, Prafulla Dhariwal,
Alec Radford, and Oleg Klimov. 2017. Proxi- Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng,
mal policy optimization algorithms. arXiv preprint Pu Zhao, Jiazhan Feng, Chongyang Tao, and Daxin
arXiv:1707.06347. Jiang. 2023. Wizardlm: Empowering large language
models to follow complex instructions.
Tal Shnitzer, Anthony Ou, Mírian Silva, Kate Soule,
Yuekai Sun, Justin Solomon, Neil Thompson, and Seonghyeon Ye, Doyoung Kim, Sungdong Kim, Hyeon-
Mikhail Yurochkin. 2023. Large language model bin Hwang, Seungone Kim, Yongrae Jo, James
routing with benchmark datasets. Thorne, Juho Kim, and Minjoon Seo. 2023. Flask:
Fine-grained language model evaluation based on
Feifan Song, Bowen Yu, Minghao Li, Haiyang Yu, Fei alignment skill sets.
Huang, Yongbin Li, and Houfeng Wang. 2023. Pref-
erence ranking optimization for human alignment. Li Yuan, Francis EH Tay, Guilin Li, Tao Wang, and
Jiashi Feng. 2020. Revisiting knowledge distillation
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier via label smoothing regularization. In Proceedings of
Martinet, Marie-Anne Lachaux, Timothée Lacroix, the IEEE/CVF Conference on Computer Vision and
Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Pattern Recognition, pages 3903–3911.
Azhar, Aurelien Rodriguez, Armand Joulin, Edouard
Grave, and Guillaume Lample. 2023a. Llama: Open Zheng Yuan, Hongyi Yuan, Chengpeng Li, Guanting
and efficient foundation language models. Dong, Keming Lu, Chuanqi Tan, Chang Zhou, and
Jingren Zhou. 2023a. Scaling relationship on learn- Dataset Amount
ing mathematical reasoning with large language mod-
els. ultrachat 18,588
sharedgpt 10432
Zheng Yuan, Hongyi Yuan, Chuanqi Tan, Wei Wang,
Songfang Huang, and Fei Huang. 2023b. Rrhf: Rank wizardlm(sharedgpt) 5325
responses to align language models with human feed- wizardlm(alpaca) 5145
back without tears. alpaca 2186
repair 1034
openchat 1033
flan 862
math 849
unnatural 582
dmcc 573
dolly 560
oasst 183
lima 70
mbpp 43
A Datasets
D IV I NSTRUCT is a diverse mix instruction set from
multiple open-source datasets with careful decon-
tamination on all benchmarks evaluated in this
work. The detailed composition of D IV I NSTRUCT
is report in Tab. 3.