Automatic Model Selection With Large Language Models For Reasoning
Automatic Model Selection With Large Language Models For Reasoning
Table 1: Results comparison (Accuracy %) on 7 arithmetic datasets and 1 symbolic dataset with greedy decoding.
We evaluate our methods on three backbone LLMs.
Table 3: We define the following terms: ∆UpperBound = AccUpper Bound − Accm1 , where AccUpper Bound is the
upper bound accuracy where we assume a perfect model selection and m1 is the stronger one of the two base models.
∆UpperBound reflects the expected performance difference between the two base models. Success Rate calculates
the correct selection rates when either CoT is correct or PAL is correct, i.e., we ignore the cases where both methods
are either correct or wrong. Improvement is the performance improvement achieved over the performance of the
stronger base model.
As demonstrated in Table 2, we can achieve sub- This echoes with the theoretical analysis shown
stantial improvements over self-consistency alone. in Section 2.2. Essentially, to achieve better per-
With GPT-4, even though both base models al- formance, we need a different behavior of model
ready score around 95%, combining them leads to results, i.e., a high |R(x)| and a high success rate ρ,
a new state-of-the-art performance on GSM8K at which jointly contribute more substantial improve-
96.5%. The performance of using 5 samples and ments. Indeed, the study on ∆Upper Bound and suc-
10 samples turn out to be the same with GPT-4. Ad- cess rate explains the significant performance im-
ditionally, using ChatGPT, we attain an accuracy provement on these datasets.
of 88.9% when sampling 10 times.
4 Analysis
3.3 Influencing Factors In this section, we provide a few analysis to see
We have shown that our model performs well across when and how the method works.
different datasets. To better understand the rea- 4.1 Combination between Similar Methods
sons for the performance improvement on differ-
ent datasets and backbone LLMs, we present the Backbone ChatGPT GPT-4
performance improvement, and the associated in- m1 CoT CoT
fluencing factors in Table 3. We can find that m2 PAL CoT′ PAL CCoT
there is a high expected performance difference Accm1 80.8 80.8 94.6 94.6
Accm2 79.2 79.2 94.0 95.1
between the two base methods, reflected in the Ours 82.6 80.8 95.6 95.2
high ∆Upper Bound , which means how different the Improvement (+1.8) (+0) (+1.0) (+0.1)
two base models behave across questions. A larger ∆Upper Bound 8.6 7.5 2.5 1.7
Success Rate 60.4 52.2 72.6 58.8
∆Upper Bound indicates a larger room for potential
improvement. Specifically, notice that on GSM8K, Table 4: Other model combinations results on GSM8K.
with ChatGPT, ∆Upper Bound is 8.6% although the CoT′ denotes the base CoT model when the temperature
accuracy of CoT and PAL are very similar (80.8% is set to 0.1. CCoT denotes ComplexCoT (Fu et al.,
and 79.2% respectively). Similarly, with GPT-4, 2022).
∆Upper Bound is 2.5% while both the accuracy of
the two base models are close (94.6% vs 94.0%). We choose CoT and PAL as our two base mod-
In addition, we find that the success rate of model els due to the motivation of combining different
selection is relatively high with Codex and GPT-4. strengths of distinct models. We conduct exper-
For example, with GPT-4, the success rate is 72.6% iments to examine whether the performance im-
on GSM8K. In contrast, ChatGPT suffers on the proves when we combine two similar base mod-
success rate, which explains the relatively weaker els. We use two variants of CoT in this experi-
performance improvement on ChatGPT. ment: CoT′ where we set the temperature at 0.1,
ComplexCoT (Fu et al., 2022) where we use more Success
complex examples in the prompt. Both of these Backbone Explanation Acc
Rate
methods’ accuracies are similar or higher than the
accuracy of PAL. The results are shown in Table 4. w/o exp 74.7 74.9
Codex
We have the following observations: w/ exp 74.6 74.2
w/o exp 81.8 55.9
• In our experiments, we found that model se- ChatGPT
w/ exp 82.6 60.4
lection between CoT and CoT′ , or CoT and
ComplexCoT, does not lead to substantial per- w/o exp 95.5 69.9
GPT-4
formance gains, even though the accuracy of w/ exp 95.6 72.6
CoT′ and ComplexCoT is on par with PAL.
Table 5: Accuracy and success rate with and without
On the other hand, model selection between
explanation on GSM8K.
CoT and PAL results in consistent perfor-
mance improvements. To understand the rea-
sons behind these outcomes, we further in- higher success rate compared to selecting between
vestigate the ∆Upper Bound and the success two similar base models like CoT-CoT′ . This holds
selection rate. true even when different prompts or temperature
settings are used.
• ∆Upper Bound of CoT-PAL exceeds that
of other combinations, CoT-CoT′ and 4.2 Ablation Study on Explanation
ComplexCoT-CoT, despite their employing To perform model selection, we provide expla-
two stronger or equivalent two base models. nations in the prompt and also ask the model to
This observation suggests a larger absolute generate explanations after making a choice, as
value of the accuracy difference per question we expect to improve the model’s selection ability
for CoT-PAL. It indicates that CoT and PAL by pointing out why the other choice is incorrect.
perform more dissimilarly than other model To investigate the potential role of explanations
combinations. Theoretically, it represents a in enhancing the model’s selection capability, we
larger |R(x)|. As Proposition 1 highlights, conduct experiments on GSM8K by excluding ex-
without a substantial |R(x)|, it is unlikely to planations from the answer.
achieve significant performance gain since The results shown in Table 5 reveal that for back-
the improvement component is factored by bone LLMs with more powerful in-context learning
|R(x)|. abilities, such as Codex and GPT-4, the inclusion
• The success selection rate of CoT-PAL sur- of explanations within the answer seems to have
passes that of other model combinations. It negligible influence on the model’s performance.
means that the selection model is more likely In other words, for Codex and GPT-4, they have a
to select the correct choice when one solution relatively high success selection rate even when no
derives from CoT and the other from PAL. In explanation is given. However, for ChatGPT, which
theory, this higher success rate implies that possesses relatively low in-context learning capa-
when |R(x)| is high for a given question x, bility, the integration of explanations yields a larger
the success selection probability ρx for CoT- improvement in the model’s success selection rate.
PAL is higher than others. This observation suggests that for backbones with
limited in-context learning abilities, explanations
These findings support our initial motivation and can serve as a valuable tool to improve the success
hypothesis. We choose CoT and PAL as our two selection rate.
base models because they represent distinct rea-
soning approaches using natural language and pro- 4.3 Qualitative Studies
gramming language. We expect these models to We are also interested in gaining an intuitive un-
exhibit a significant difference in errors and accura- derstanding of the algorithm, so we conduct some
cies, indicated by a high ∆Upper Bound . Moreover, qualitative analysis. As displayed in Table 6, in
the considerable disparity in errors for a particular the first example, both ChatGPT and GPT-4 can
question makes it easier for large language models accurately identify that PAL is the correct solu-
(LLMs) to select the correct option, leading to a tion. However, ChatGPT provides an incorrect
Question: Courtney attended a concert and reported that the audience was 48 in number. However, Kelly went to the
same concert and said that Courtney had made the mistake of overstating the number of people in attendance by 20%.
If Kelly was right, how many people really attended the concert?
(A) (B)
Answer:
Courtney reported 48 people in attendance. def solution () :
If this number is overstated by 20%, then we need to find courtney_report = 48
the actual number of people. overstatement_percentage = 0.20
First, we need to find 20% of 48. actual_attendance =
20% of 48 is (20/100) * 48 = 9.6. courtney_report /(1 +
Now, we subtract this number from Courtney’s reported overstatement_percentage )
number to find the actual number of people. result = int ( actual_attendance )
48 - 9.6 = 38.4. return result
Since we cannot have a fraction of a person, we round the
number to the nearest whole number.
So, the actual number of people in attendance was approxi-
mately 38.
So the answer is 38.
Which of the above two choices can correctly answer the math problem?
ChatGPT: (B) can correctly answer the math problem. Because (A) rounds up the result to the nearest whole number,
which is not necessary.
GPT-4: (B) can correctly answer the math problem. Because (A) incorrectly calculated the actual number of people
by subtracting the overstatement percentage from the reported number instead of dividing it.
Question: How much does it cost you for lunch today at Subway if you pay $40 for a foot-long fish sub and thrice as
much for a six-inch cold-cut combo sub?
(A) (B)
Which of the above two choices can correctly answer the math problem?
ChatGPT: (A) can correctly answer the math problem. Because (B) calculates the cost of the six-inch cold-cut combo
sub as one-third of the cost of the foot-long fish sub instead of three times the cost of the foot-long fish sub.
GPT-4: (A) can correctly answer the math problem. Because (B) incorrectly calculates the cost of the six-inch cold-cut
combo sub as one-third of the foot-long fish sub instead of thrice as much.
Table 6: In the first example, ChatGPT performs model selection correctly, but gives a wrong explanation, while
GPT-4 gives a correct explanation. The second example shows where both ChatGPT and GPT-4 select correctly and
give the correct explanation.
Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Wein-
scalable tree boosting system. In Proceedings of berger. 2017. On calibration of modern neural net-
the 22nd acm sigkdd international conference on works. In International Conference on Machine
knowledge discovery and data mining, pages 785– Learning.
794.
Tin Kam Ho. 1995. Random decision forests. In Pro-
Wenhu Chen, Xueguang Ma, Xinyi Wang, and ceedings of 3rd international conference on docu-
William W. Cohen. 2022. Program of thoughts ment analysis and recognition, volume 1, pages 278–
prompting: Disentangling computation from rea- 282. IEEE.
soning for numerical reasoning tasks. ArXiv,
abs/2211.12588. Zhengbao Jiang, J. Araki, Haibo Ding, and Graham
Neubig. 2020. How can we know when language
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, models know? on the calibration of language models
Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul for question answering. Transactions of the Associa-
Barham, Hyung Won Chung, Charles Sutton, Sebas- tion for Computational Linguistics, 9:962–977.
tian Gehrmann, Parker Schuh, Kensen Shi, Sasha
Tsvyashchenko, Joshua Maynez, Abhishek Rao, Saurav Kadavath, Tom Conerly, Amanda Askell, T. J.
Parker Barnes, Yi Tay, Noam M. Shazeer, Vinod- Henighan, Dawn Drain, Ethan Perez, Nicholas
kumar Prabhakaran, Emily Reif, Nan Du, Benton C. Schiefer, Zachary Dodds, Nova DasSarma, Eli Tran-
Hutchinson, Reiner Pope, James Bradbury, Jacob Johnson, Scott Johnston, Sheer El-Showk, Andy
Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin, Jones, Nelson Elhage, Tristan Hume, Anna Chen,
Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Yuntao Bai, Sam Bowman, Stanislav Fort, Deep
Sunipa Dev, Henryk Michalewski, Xavier García, Ganguli, Danny Hernandez, Josh Jacobson, John
Vedant Misra, Kevin Robinson, Liam Fedus, Denny Kernion, Shauna Kravec, Liane Lovitt, Kamal
Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim, Ndousse, Catherine Olsson, Sam Ringer, Dario
Barret Zoph, Alexander Spiridonov, Ryan Sepassi, Amodei, Tom B. Brown, Jack Clark, Nicholas Joseph,
David Dohan, Shivani Agrawal, Mark Omernick, An- Benjamin Mann, Sam McCandlish, Christopher Olah,
drew M. Dai, Thanumalayan Sankaranarayana Pillai, and Jared Kaplan. 2022. Language models (mostly)
Marie Pellat, Aitor Lewkowycz, Erica Moreira, Re- know what they know. ArXiv, abs/2207.05221.
won Child, Oleksandr Polozov, Katherine Lee, Zong-
wei Zhou, Xuezhi Wang, Brennan Saeta, Mark Díaz, Seyed Mehran Kazemi, Najoung Kim, Deepti Bhatia,
Orhan Firat, Michele Catasta, Jason Wei, Kathleen S. Xinyuan Xu, and Deepak Ramachandran. 2022. Lam-
Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov, bada: Backward chaining for automated reasoning in
and Noah Fiedel. 2022. Palm: Scaling language mod- natural language. ArXiv, abs/2212.13894.
eling with pathways. ArXiv, abs/2204.02311.
Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang,
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu.
Jacob Hilton, Reiichiro Nakano, Christopher Hesse, 2017. Lightgbm: A highly efficient gradient boost-
and John Schulman. 2021. Training verifiers to solve ing decision tree. Advances in neural information
math word problems. ArXiv, abs/2110.14168. processing systems, 30.
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yu- Singh, Charles Rathkopf, Chenlin Meng, Chitta
taka Matsuo, and Yusuke Iwasawa. 2022. Large Baral, Chiyu Wu, Chris Callison-Burch, Chris Waites,
language models are zero-shot reasoners. ArXiv, Christian Voigt, Christopher D. Manning, Christo-
abs/2205.11916. pher Potts, Cindy Tatiana Ramirez, Clara Rivera,
Clemencia Siro, Colin Raffel, Courtney Ashcraft,
Rik Koncel-Kedziorski, Subhro Roy, Aida Amini, Nate Cristina Garbacea, Damien Sileo, Daniel H Garrette,
Kushman, and Hannaneh Hajishirzi. 2016. Mawps: Dan Hendrycks, Dan Kilman, Dan Roth, Daniel
A math word problem repository. In North Ameri- Freeman, Daniel Khashabi, Daniel Levy, Daniel
can Chapter of the Association for Computational Gonz’alez, Danny Hernandez, Danqi Chen, Daphne
Linguistics. Ippolito, Dar Gilboa, David Dohan, D. Drakard,
David Jurgens, Debajyoti Datta, Deep Ganguli, De-
Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler nis Emelin, Denis Kleyko, Deniz Yuret, Derek Chen,
Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Derek Tam, Dieuwke Hupkes, Diganta Misra, Dil-
Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, yar Buzan, Dimitri Coelho Mollo, Diyi Yang, Dong-
Sean Welleck, Bodhisattwa Prasad Majumder, Ho Lee, Ekaterina Shutova, Ekin Dogus Cubuk,
Shashank Gupta, Amir Yazdanbakhsh, and Peter Elad Segal, Eleanor Hagerman, Elizabeth Barnes,
Clark. 2023. Self-refine: Iterative refinement with Elizabeth P. Donoway, Ellie Pavlick, Emanuele
self-feedback. ArXiv, abs/2303.17651. Rodolà, Emma FC Lam, Eric Chu, Eric Tang,
Erkut Erdem, Ernie Chang, Ethan A. Chi, Ethan
Shen-Yun Miao, Chao-Chun Liang, and Keh-Yih Su. Dyer, Ethan J. Jerzak, Ethan Kim, Eunice Engefu
2020. A diverse corpus for evaluating and devel- Manyasi, Evgenii Zheltonozhskii, Fan Xia, Fate-
oping english math word problem solvers. ArXiv, meh Siar, Fernando Mart’inez-Plumed, Francesca
abs/2106.15772. Happ’e, François Chollet, Frieda Rong, Gaurav
OpenAI. 2023. Gpt-4 technical report. ArXiv, Mishra, Genta Indra Winata, Gerard de Melo, Ger-
abs/2303.08774. mán Kruszewski, Giambattista Parascandolo, Gior-
gio Mariani, Gloria Wang, Gonzalo Jaimovitch-
Arkil Patel, S. Bhattamishra, and Navin Goyal. 2021. L’opez, Gregor Betz, Guy Gur-Ari, Hana Galija-
Are nlp models really able to solve simple math word sevic, Han Sol Kim, Hannah Rashkin, Hanna Ha-
problems? In North American Chapter of the Associ- jishirzi, Harsh Mehta, Hayden Bogar, Henry Shevlin,
ation for Computational Linguistics. Hinrich Schütze, Hiromu Yakura, Hongming Zhang,
Hubert Wong, Ian Aik-Soon Ng, Isaac Noble, Jaap
Debjit Paul, Mete Ismayilzada, Maxime Peyrard, Beat- Jumelet, Jack Geissinger, John Kernion, Jacob Hilton,
riz Borges, Antoine Bosselut, Robert West, and Boi Jaehoon Lee, Jaime Fernández Fisac, J. Brooker
Faltings. 2023. Refiner: Reasoning feedback on in- Simon, James Koppel, James Zheng, James Zou,
termediate representations. ArXiv, abs/2304.01904. Jan Koco’n, Jana Thompson, Jared Kaplan, Jarema
Radom, Jascha Narain Sohl-Dickstein, Jason Phang,
Noah Shinn, Beck Labash, and Ashwin Gopinath. 2023. Jason Wei, Jason Yosinski, Jekaterina Novikova,
Reflexion: an autonomous agent with dynamic mem- Jelle Bosscher, Jenni Marsh, Jeremy Kim, Jeroen
ory and self-reflection. ArXiv, abs/2303.11366. Taal, Jesse Engel, Jesujoba Oluwadara Alabi, Ji-
acheng Xu, Jiaming Song, Jillian Tang, Jane W
Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Waweru, John Burden, John Miller, John U. Balis,
Abu Awal Md Shoeb, Abubakar Abid, Adam Jonathan Berant, Jorg Frohberg, Jos Rozen, José
Fisch, Adam R. Brown, Adam Santoro, Aditya Hernández-Orallo, Joseph Boudeman, Joseph Jones,
Gupta, Adrià Garriga-Alonso, Agnieszka Kluska, Joshua B. Tenenbaum, Joshua S. Rule, Joyce Chua,
Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Kamil Kanclerz, Karen Livescu, Karl Krauth, Karthik
Alex Ray, Alex Warstadt, Alexander W. Kocurek, Gopalakrishnan, Katerina Ignatyeva, Katja Markert,
Ali Safaya, Ali Tazarv, Alice Xiang, Alicia Par- Kaustubh D. Dhole, Kevin Gimpel, Kevin Ochieng’
rish, Allen Nie, Aman Hussain, Amanda Askell, Omondi, Kory Wallace Mathewson, Kristen Chia-
Amanda Dsouza, Ameet Annasaheb Rahane, Anan- fullo, Ksenia Shkaruta, Kumar Shridhar, Kyle Mc-
tharaman S. Iyer, Anders Andreassen, Andrea San- Donell, Kyle Richardson, Laria Reynolds, Leo Gao,
tilli, Andreas Stuhlmuller, Andrew M. Dai, An- Li Zhang, Liam Dugan, Lianhui Qin, Lidia Contreras-
drew D. La, Andrew Kyle Lampinen, Andy Zou, Ochando, Louis-Philippe Morency, Luca Moschella,
Angela Jiang, Angelica Chen, Anh Vuong, Ani- Luca Lam, Lucy Noble, Ludwig Schmidt, Luheng
mesh Gupta, Anna Gottardi, Antonio Norelli, Anu He, Luis Oliveros Col’on, Luke Metz, Lutfi Kerem
Venkatesh, Arash Gholamidavoodi, Arfa Tabassum, cSenel, Maarten Bosma, Maarten Sap, Maartje ter
Arul Menezes, Arun Kirubarajan, Asher Mullokan- Hoeve, Madotto Andrea, Maheen Saleem Farooqi,
dov, Ashish Sabharwal, Austin Herrick, Avia Efrat, Manaal Faruqui, Mantas Mazeika, Marco Baturan,
Aykut Erdem, Ayla Karakacs, Bridget R. Roberts, Marco Marelli, Marco Maru, M Quintana, Marie
Bao Sheng Loe, Barret Zoph, Bartlomiej Bojanowski, Tolkiehn, Mario Giulianelli, Martha Lewis, Martin
Batuhan Ozyurt, Behnam Hedayatnia, Behnam Potthast, Matthew Leavitt, Matthias Hagen, M’aty’as
Neyshabur, Benjamin Inden, Benno Stein, Berk Ek- Schubert, Medina Baitemirova, Melissa Arnaud,
mekci, Bill Yuchen Lin, Blake Stephen Howald, Melvin Andrew McElrath, Michael A. Yee, Michael
Cameron Diao, Cameron Dour, Catherine Stinson, Cohen, Mi Gu, Michael I. Ivanitskiy, Michael Star-
Cedrick Argueta, C’esar Ferri Ram’irez, Chandan
ritt, Michael Strube, Michal Swkedrowski, Michele Jonathan Uesato, Nate Kushman, Ramana Kumar, Fran-
Bevilacqua, Michihiro Yasunaga, Mihir Kale, Mike cis Song, Noah Siegel, L. Wang, Antonia Creswell,
Cain, Mimee Xu, Mirac Suzgun, Monica Tiwari, Mo- Geoffrey Irving, and Irina Higgins. 2022. Solving
hit Bansal, Moin Aminnaseri, Mor Geva, Mozhdeh math word problems with process- and outcome-
Gheini, T MukundVarma, Nanyun Peng, Nathan based feedback. ArXiv, abs/2211.14275.
Chi, Nayeon Lee, Neta Gur-Ari Krakover, Nicholas
Cameron, Nicholas S. Roberts, Nicholas Doiron, Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc
Nikita Nangia, Niklas Deckers, Niklas Muennighoff, Le, Ed Huai hsin Chi, and Denny Zhou. 2022a.
Nitish Shirish Keskar, Niveditha Iyer, Noah Con- Rationale-augmented ensembles in language mod-
stant, Noah Fiedel, Nuan Wen, Oliver Zhang, Omar els. ArXiv, abs/2207.00747.
Agha, Omar Elbaghdadi, Omer Levy, Owain Evans,
Pablo Antonio Moreno Casares, Parth Doshi, Pas- Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le,
cale Fung, Paul Pu Liang, Paul Vicol, Pegah Ed Huai hsin Chi, and Denny Zhou. 2022b. Self-
Alipoormolabashi, Peiyuan Liao, Percy Liang, Pe- consistency improves chain of thought reasoning in
ter W. Chang, Peter Eckersley, Phu Mon Htut, Pi- language models. ArXiv, abs/2203.11171.
Bei Hwang, P. Milkowski, Piyush S. Patil, Pouya Jason Wei, Yi Tay, Rishi Bommasani, Colin Raf-
Pezeshkpour, Priti Oli, Qiaozhu Mei, QING LYU, fel, Barret Zoph, Sebastian Borgeaud, Dani Yo-
Qinlang Chen, Rabin Banjade, Rachel Etta Rudolph, gatama, Maarten Bosma, Denny Zhou, Donald Met-
Raefer Gabriel, Rahel Habacker, Ram’on Risco zler, Ed Huai hsin Chi, Tatsunori Hashimoto, Oriol
Delgado, Raphaël Millière, Rhythm Garg, Richard
Vinyals, Percy Liang, Jeff Dean, and William Fedus.
Barnes, Rif A. Saurous, Riku Arakawa, Robbe
2022a. Emergent abilities of large language models.
Raymaekers, Robert Frank, Rohan Sikand, Roman
Trans. Mach. Learn. Res., 2022.
Novak, Roman Sitelew, Ronan Le Bras, Rosanne
Liu, Rowan Jacobs, Rui Zhang, Ruslan Salakhut- Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
dinov, Ryan Chi, Ryan Lee, Ryan Stovall, Ryan Bosma, Ed Huai hsin Chi, F. Xia, Quoc Le, and
Teehan, Rylan Yang, Sahib J. Singh, Saif M. Mo- Denny Zhou. 2022b. Chain of thought prompting
hammad, Sajant Anand, Sam Dillavou, Sam Shleifer, elicits reasoning in large language models. ArXiv,
Sam Wiseman, Samuel Gruetter, Sam Bowman, abs/2201.11903.
Samuel S. Schoenholz, Sanghyun Han, Sanjeev
Kwatra, Sarah A. Rous, Sarik Ghazarian, Sayan Yuxi Xie, Kenji Kawaguchi, Yiran Zhao, Xu Zhao, Min-
Ghosh, Sean Casey, Sebastian Bischoff, Sebastian Yen Kan, Junxian He, and Qizhe Xie. 2023. De-
Gehrmann, Sebastian Schuster, Sepideh Sadeghi, composition enhances reasoning via self-evaluation
Shadi S. Hamdan, Sharon Zhou, Shashank Srivas- guided decoding. ArXiv, abs/2305.00633.
tava, Sherry Shi, Shikhar Singh, Shima Asaadi,
Shixiang Shane Gu, Shubh Pachchigar, Shubham Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran,
Toshniwal, Shyam Upadhyay, Shyamolima Deb- Thomas L. Griffiths, Yuan Cao, and Karthik
nath, Siamak Shakeri, Simon Thormeyer, Simone Narasimhan. 2023. Tree of thoughts: Deliberate
Melzi, Siva Reddy, Sneha Priscilla Makini, Soo problem solving with large language models.
hwan Lee, Spencer Bradley Torene, Sriharsha Hat-
war, Stanislas Dehaene, Stefan Divic, Stefano Ermon, Chuanyang Zheng, Zhengying Liu, Enze Xie, Zhenguo
Stella Rose Biderman, Stephanie C. Lin, S. Prasad, Li, and Yu Li. 2023. Progressive-hint prompting
Steven T. Piantadosi, Stuart M. Shieber, Summer improves reasoning in large language models. ArXiv,
Misherghi, Svetlana Kiritchenko, Swaroop Mishra, abs/2304.09797.
Tal Linzen, Tal Schuster, Tao Li, Tao Yu, Tariq A.
Ali, Tatsuo Hashimoto, Te-Lin Wu, Theo Desbordes, Denny Zhou, Nathanael Scharli, Le Hou, Jason Wei,
Theodore Rothschild, Thomas Phan, Tianle Wang, Nathan Scales, Xuezhi Wang, Dale Schuurmans,
Tiberius Nkinyili, Timo Schick, T. N. Kornev, Tim- Olivier Bousquet, Quoc Le, and Ed Huai hsin
Chi. 2022. Least-to-most prompting enables com-
othy Telleen-Lawton, Titus Tunduny, Tobias Ger-
plex reasoning in large language models. ArXiv,
stenberg, Trenton Chang, Trishala Neeraj, Tushar
abs/2205.10625.
Khot, Tyler O’Brien Shultz, Uri Shaham, Vedant
Misra, Vera Demberg, Victoria Nyamai, Vikas Rau-
nak, Vinay Venkatesh Ramasesh, Vinay Uday Prabhu,
Vishakh Padmakumar, Vivek Srikumar, William Fe-
dus, William Saunders, William Zhang, W Vossen,
Xiang Ren, Xiaoyu Tong, Xinyi Wu, Xudong Shen,
Yadollah Yaghoobzadeh, Yair Lakretz, Yang Song,
Yasaman Bahri, Ye Ji Choi, Yichi Yang, Yiding Hao,
Yifu Chen, Yonatan Belinkov, Yu Hou, Yu Hou, Yun-
tao Bai, Zachary Seid, Zhao Xinran, Zhuoye Zhao,
Zi Fu Wang, Zijie J. Wang, Zirui Wang, Ziyi Wu,
Sahib Singh, and Uri Shaham. 2022. Beyond the imi-
tation game: Quantifying and extrapolating the capa-
bilities of language models. ArXiv, abs/2206.04615.
A A detailed version of Theorem 1
In this appendix, we provide a detailed version of Theorem 1. Whereas Theorem 1 only states the
existence of problem instances, Theorem 2 constructs such instances concretely: i.e., Theorem 2 implies
Theorem 1. Define µx [X ] to be the distribution for the expected errors: i.e., an expected error can be
written by Ex∼µx [X ],y,f [1[y ̸= f (x)]] for some function f . Define S[X ] = {x ∈ X : R(x) < 0}. Let us
denote U [X ] as the uniform distribution over X . Given any X , we write n = |X |, T = |S[X ]|, α = T /n.
Assume that 1 ≤ T < n.
Theorem 2. Let µx [X ] = U [X ] and X be given such that |X | < ∞. Let ϵ, δ ∈ (0, 1) and λ ∈ (0, 1]
ϵT β
such that β = n−T ∈ (0, 1) and λ ≥ 1 − ϵT (n −PT − δ). Let R and ρx be set suchP that R(x) = −ϵ
for x ∈ S[X ], R(x) = β for x ∈ X \ S[X ], (1/T ) x∈S[X ] ρx = λ, and (1/(n − T )) x∈X \S[X ] ρx =
ϵ(T /(n − T ))(1 − λ)β −1 + δ/(n − T ). Then, we have that err < err1 , err1 ≤ err2 , and
δ
ρ = 1 − α + λ[2α − 1] + .
n
In particular, when α ≥ 0.5, we have ρ → 0 as α → 1 and (δ/(n−T )) → 0 (with λ = 1− Tβϵ (n−T −δ));
when α < 0.5, we have ρ → 0 as α → 0 and (δ/n) → 0 (with λ = 1).
The proof of Theorem 2 is presented in Appendix B. Theorem 2 shows that the overall success
probability of the selection process can be much worse than a random guess to achieve the improvement
over the base methods m1 and m2 ; i.e., err < err1 and err1 ≤ err2 can happen with ρ < 0.5. Indeed, it
is possible to have ρ → 0 with the improvement (err < err1 and err1 ≤ err2 ) when the size of X is large:
when α ≥ 0.5, we can choose λ = 1 − Tβϵ (n − T − δ) with which err < err1 , err1 ≤ err2 , and ρ → 0 as
α → 1 and (δ/(n − T )) → 0. When α < 0.5, we can choose λ = 1 with which err < err1 , err1 ≤ err2 ,
and ρ → 0 as α → 0 and (δ/n) → 0. This supports our proposition that, despite not training a new model
for the selection process and with the in-context learning limited to a few-shot prompt, it is possible to
achieve improvement, even if we do not achieve ρx > 0.5 in some instances.
Theorem 2 also suggests that if the overall performance of two base methods is similar, captured
by ϵ, the overall selection process can be weak to achieve some improvement, as long as the success
selection probability is relatively high when the two methods have very different expected errors (or
accuracies) for a given question. In essence, Theorem 2 suggests a trade-off: we want |R(x)| to be larger
when deciding which two base methods m1 and m2 to choose, implying that we prefer base methods to
perform dissimilarly on X . On the other hand, if two base methods exhibit a substantial expected accuracy
difference, then the selection process needs to be stronger to improve the performance (i.e., ρ needs to
be larger). However, if the expected accuracy difference between two base methods is relatively small,
increasing the power of the selection process is not that necessary to boost performance.
B Proofs
B.1 Proof of Proposition 1
Proof. Define acc = 1 − err and acci = 1 − erri for i ∈ {1, 2}. Since expected error =
E[1[incorrect prediction]] = P (incorrect prediction) = 1 − P (correct prediction), we have that
where p(mi |x) represents the probability of selecting method mi given x via the proposed method. Thus,
Thus, we have err < err1 if Ex [|R(x)| (ρx − 1{R(x) < 0})] > 0. This condition can be rewritten as
This is satisfied by the condition on ρ that (1/(n − T )) x∈X \S[X ] ρx = ϵ(T /(n − T ))(1 − λ)β −1 +
P
δ/(n − T ) for some δ > 0: i.e.,
X ϵT (1 − λ)
ρx = + δ.
β
x∈X \S[X ]
δ
ρ = α − [2α − 1]Q + .
n
Here,
β ϵT 1 1 δ
Q= (n − T − δ) = (n − T − δ) = (n − T − δ) = 1 −
Tϵ n − T Tϵ n−T n−T
Thus,
δ δ
ρ = α − [2α − 1] 1 − +
n−T n
2α − 1 δ
= α − 2α + 1 + δ +
n − T n
2α − 1 1
=1−α+δ + →0
n−T n
Math Problem: Olivia has $23. She bought five bagels for $3 each. How much money does she have left?
Question: Which of the following two choices can correctly answer the math problem?
(A) (B)
def solution () : Answer:
money_initial = 23 Olivia had 23 dollars.
bagels = 5 5 bagels for 3 dollars each will be 5 * 3 = 15 dollars.
bagel_cost = 3 So she has 23 - 5 = 18 dollars left.
money_spent = bagels * bagel_cost The answer is 18.
money_left = money_initial -
money_spent
result = money_left
return result
Answer: (A)
Table 7: An example of 8-shot model selection prompts used on 7 arithmetic datasets with Codex.
Date Understanding Problem: 2015 is coming in 36 hours. What is the date one week from today in MM/DD/YYYY?
Question: Which of the following two choices can correctly answer the date understanding problem?
(A) (B)
def solution () : A:
# If 2015 is coming in 36 hours , If 2015 is coming in 36 hours, then it is coming in 2 days.
then today is 36 hours before . 2 days before 01/01/2015 is 12/30/2014, so today is
today = datetime (2015 , 1, 1) - 12/30/2014.
relativedelta ( hours =36) So one week from today will be 01/06/2015.
# One week from today , So the answer is 01/06/2015.
one_week_from_today = today +
relativedelta ( weeks =1)
# The answer formatted with %m /% d
/% Y is
result = one_week_from_today .
strftime ( '%m /% d /% Y ')
return result
Answer: (A)
Table 8: An example of 6-shot model selection prompts used on Date Understanding task with Codex.
There are two choices to the same math problem. One uses natural language to answer the question, while the other
uses Python program to answer it. Either of them can correctly answer the math problem. You need to identify which
choice can correctly answer the math problem. Here is one example how to do it,
Math problem: Olivia has $23. She bought five bagels for $3 each. How much money does she have left?
(A) (B)
Which of the above two choices can correctly answer the math problem?
(A) can correctly answer the math problem. Because (B) adds the number of bagels to the cost of each bagel instead of
multiplying them.
Now it’s your turn. Here is another math problem and two choices.
Math Problem: Michael had 58 golf balls. On tuesday, he lost 23 golf balls. On wednesday, he lost 2 more. How
many golf balls did he have at the end of wednesday?
(A) (B)
Which of the above two choices can correctly answer the math problem?
(B) can correctly answer the math problem. Because (A) adds 2 more balls after losing 2 more on Wednesday instead
of subtracting them.
Table 9: Two examples of 5-shot model selection prompts used on 7 arithmetic datasets with ChatGPT.
There are two choices to the same date understanding problem. One uses natural language to answer the question,
while the other uses Python program to answer it. Either of them can correctly answer the date understanding problem.
You need to identify which choice can correctly answer the problem. Here is one example how to do it,
Date Understanding Problem: 2015 is coming in 36 hours. What is the date one week from today in MM/DD/YYYY?
(A) (B)
Which of the above two choices can correctly answer the date understanding problem?
(A) can correctly answer the date understanding problem. Because (B) incorrectly calculates the date 36 hours later
instead of 36 hours before.
Now it’s your turn. Here is another date understanding problem and two choices.
Date Understanding Problem: The first day of 2019 is a Tuesday, and today is the first Monday of 2019. What is the
date today in MM/DD/YYYY?
(A) (B)
Which of the above two choices can correctly answer the date understanding problem?
(B) can correctly answer the problem. Because (A) missed the fact that there are 6 days between the first day of 2019
and the first Monday of 2019.
Table 10: Two examples of 6-shot model selection prompts used on Date Understanding task with ChatGPT and
GPT-4.
There are two choices to the same math problem. One uses natural language to answer the question, while the other
uses Python code to answer it. Either of them can correctly answer the math problem. You need to identify which
choice can correctly answer the math problem. Here is one example how to do it,
Math problem: There were nine computers in the server room. Five more computers were installed each day, from
monday to thursday. How many computers are now in the server room?
(A) (B)
Which of the above two choices can correctly answer the math problem?
(A) can correctly answer the math problem. Because (B) missed the fact that computers were added each day from
monday to thursday.
Now it’s your turn. Here is another math problem and two choices.
Math Problem: A piece of square paper has a perimeter of 32 centimeters. Nicky’s dog, Rocky, tore off 1/4 of the
paper. What is the area of the remaining paper?
(A) (B)
Which of the above two choices can correctly answer the math problem?
(B) can correctly answer the math problem. Because (A) incorrectly calculated the area of the torn-off portion instead
of the remaining portion.
Table 11: Two examples of 5-shot model selection prompts used on 7 arithmetic datasets with GPT-4.