o1 Tutorial
o1 Tutorial
on
Test-Time Scaling
Sasha Rush Daniel Ritter
Cornell
[Brown et al., 2020]
[OpenAI, 2024]
[Hendrycks et al., 2021c]
AIME
For any finite set X, let |X| denote the number of elements in X. Define
X
Sn = |A ∩ B|,
where the sum is taken over all ordered pairs (A, B) such that A and
B are subsets of {1, 2, 3, · · · , n} with |A| = |B|. For example, S2 = 4
because the sum is taken over the pairs of subsets
(A, B) ∈ {(∅, ∅), ({1}, {1}), ({1}, {2}), ({2}, {1}), ({2}, {2}), ({1, 2}, {1, 2})}
Introduction
The Clues
Technical Background
The Suspects
What do we do now?
[OpenAI, 2024]
o1 Description
Implementation Outline:
1. Capture input string as argument.
2. Remove any spaces (if any).
3. Parse input string to extract numbers as arrays.
- Since the input is in the format ’[1,2],[3,4]’, we can:
- Remove outer brackets if necessary.
- Split the string by ’],’ to get each row.
- For each row, remove ’[’ and ’]’, then split by ’,’ to get
elements.
4. Build a 2D array in bash (arrays containing arrays).
[OpenAI, 2024]
o1 CoT: Planning
Introduction
The Clues
Technical Background
The Suspects
What do we do now?
[Cobbe et al., 2021]
Technical Background
• x; problem specification
• y ∈ Y; final answer
p(y|x) = Ez p(y|x, z)
[Wei et al., 2022]
Warm-up: Ancestral Sampling
z1:T ∼ p(·|x)
y ∼ p(·|x, z1:T )
For N samples,
z1:T ∼ p(·|x)
y n ∼ p(·|x, z1:T )
Verx : Y → {0, 1}
Common datasets:
• Regular expression for math [Cobbe et al., 2021]
For n = 1 to N :
z n ∼ p(z|x)
y n ∼ p(y|x, z n )
Ey∼p(·|x) Ver(y)
Maximum likelihood;
log Ez p(Ver(y)|x, z; θ)
X
Classic combinatorial
expectation
Reinforcement Learning
• KL Constraints on learning.
Introduction
The Clues
Technical Background
The Suspects
What do we do now?
The Suspects
• Guess + Check
• Guided Search
• AlphaZero
• Learning to Correct
The Suspects
• Guess + Check
• Guided Search
• AlphaZero
• Learning to Correct
Suspect 1: Guess + Check
• 1) Sample N CoTs
• 2) Check if successful
• E-Step: For n = 1 to N :
z n ∼ p(·|x)
y n ∼ p(·|x, z n )
Pro
X Extremely simple and
scalable
Pro
× No evidence this learns to
X Extremely simple and
correct, plan
scalable
× Computationally inefficient
X Positive results in past
search
work
The Suspects
• Guess + Check
• Guided Search
• AlphaZero
• Learning to Correct
Suspect 2: Guided Search
r : S t → R; Guide function
[Snell et al., 2024]
Framework: Beam Search with Guide
r : S t → R; Guide function
For each step t,
1. Sample many next steps,
zt i ∼ p(·|x, z1:t−1 )
• Self-Evaluation Verifier
[Kazemnejad et al., 2024]
Beam Search with Roll-Outs
y n ∼ p(·|x, z1:t−1 )
N
1 X
rM C (zt ) = Ver(y n )
N n=1
[Kazemnejad et al., 2024]
Beam Search with Roll-Outs
y n ∼ p(·|x, z1:t−1 )
N
1 X
rM C (zt ) = Ver(y n )
N n=1
[Kazemnejad et al., 2024]
Beam Search with Roll-Outs
y n ∼ p(·|x, z1:t−1 )
N
1 X
rM C (zt ) = Ver(y n )
N n=1
[Kazemnejad et al., 2024]
Beam Search with Roll-Outs
y n ∼ p(·|x, z1:t−1 )
N
1 X
rM C (zt ) = Ver(y n )
N n=1
[Lightman et al., 2023, Wang et al., 2023]
Learned Value Function
• Learn rψ (zt ) to
approximate
• Search Heuristic
• Value Function
• Guess + Check
• Guided Search
• AlphaZero
• Learning to Correct
[Silver et al., 2017]
Reminder: AlphaZero
• Canonical example
of self-learning
• Scaling model
without data
Suspect 3: AlphaZero
X Major demonstrated RL
result
• Guess + Check
• Guided Search
• AlphaZero
• Learning to Correct
What does exploration look like?
• Start with z 0
∗
• Find z1:T as optimal length
CoT
0
• Find z1:T 0
0 with T > T
0
• Train model on z1:T 0
From Tree to Stream
Introduction
The Clues
Technical Background
The Suspects
What do we do now?
[Ouyang et al., 2022]
Does it need to be the same?
[Tunstall et al., 2023]
Open-Source Models
Let’s Build
Thank You
https://github.com/srush/awesome-o1
Reference I
[Anthony et al., 2017] Anthony, T., Tian, Z., and Barber, D. (2017).
Thinking fast and slow with deep learning and tree search.
arXiv [cs.AI].
[Brown et al., 2020] Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P.,
Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger,
G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen,
M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford,
A., Sutskever, I., and Amodei, D. (2020).
Language models are few-shot learners.
arXiv [cs.CL].
Reference II
[Cobbe et al., 2021] Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L.,
Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. (2021).
Training verifiers to solve math word problems.
arXiv [cs.LG].
[Feng et al., 2023] Feng, X., Wan, Z., Wen, M., McAleer, S. M., Wen, Y., Zhang, W., and
Wang, J. (2023).
Alphazero-like tree-search can guide large language model decoding and training.
arXiv [cs.LG].
[Gandhi et al., 2024] Gandhi, K., Lee, D., Grand, G., Liu, M., Cheng, W., Sharma, A., and
Goodman, N. D. (2024).
Stream of search (SoS): Learning to search in language.
arXiv [cs.LG].
Reference III
[Gulcehre et al., 2023] Gulcehre, C., Paine, T. L., Srinivasan, S., Konyushkova, K., Weerts, L.,
Sharma, A., Siddhant, A., Ahern, A., Wang, M., Gu, C., Macherey, W., Doucet, A., Firat, O.,
and de Freitas, N. (2023).
Reinforced self-training (ReST) for language modeling.
arXiv [cs.CL].
[Hendrycks et al., 2021a] Hendrycks, D., Basart, S., Kadavath, S., Mazeika, M., Arora, A.,
Guo, E., Burns, C., Puranik, S., He, H., Song, D., and Steinhardt, J. (2021a).
Measuring coding challenge competence with APPS.
arXiv [cs.SE].
[Hendrycks et al., 2021b] Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song,
D., and Steinhardt, J. (2021b).
Measuring massive multitask language understanding.
Reference IV
[Hendrycks et al., 2021c] Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang,
E., Song, D., and Steinhardt, J. (2021c).
Measuring mathematical problem solving with the MATH dataset.
arXiv [cs.LG].
[Hosseini et al., 2024] Hosseini, A., Yuan, X., Malkin, N., Courville, A., Sordoni, A., and
Agarwal, R. (2024).
V-STar: Training verifiers for self-taught reasoners.
In First Conference on Language Modeling.
[Hubert et al., 2021] Hubert, T., Schrittwieser, J., Antonoglou, I., Barekatain, M., Schmitt, S.,
and Silver, D. (2021).
Learning and planning in complex action spaces.
arXiv [cs.LG].
Reference V
[Kazemnejad et al., 2024] Kazemnejad, A., Aghajohari, M., Portelance, E., Sordoni, A.,
Reddy, S., Courville, A., and Roux, N. L. (2024).
VinePPO: Unlocking RL potential for LLM reasoning through refined credit assignment.
arXiv [cs.LG].
[Lightman et al., 2023] Lightman, H., Kosaraju, V., Burda, Y., Edwards, H., Baker, B., Lee, T.,
Leike, J., Schulman, J., Sutskever, I., and Cobbe, K. (2023).
Let’s verify step by step.
arXiv [cs.LG].
Reference VI
[Nakano et al., 2021] Nakano, R., Hilton, J., Balaji, S., Wu, J., Ouyang, L., Kim, C., Hesse, C.,
Jain, S., Kosaraju, V., Saunders, W., Jiang, X., Cobbe, K., Eloundou, T., Krueger, G., Button,
K., Knight, M., Chess, B., and Schulman, J. (2021).
WebGPT: Browser-assisted question-answering with human feedback.
arXiv [cs.CL].
[Nye et al., 2021] Nye, M., Andreassen, A. J., Gur-Ari, G., Michalewski, H., Austin, J., Bieber,
D., Dohan, D., Lewkowycz, A., Bosma, M., Luan, D., Sutton, C., and Odena, A. (2021).
Show your work: Scratchpads for intermediate computation with language models.
arXiv [cs.LG].
Reference VII
[OpenAI, 2024] OpenAI (2024).
Learning to reason with LLMs.
https://openai.com/index/learning-to-reason-with-llms/.
Accessed: 2024-10-29.
[Ouyang et al., 2022] Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin,
P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L.,
Simens, M., Askell, A., Welinder, P., Christiano, P., Leike, J., and Lowe, R. (2022).
Training language models to follow instructions with human feedback.
arXiv [cs.CL], pages 27730–27744.
[Putta et al., 2024] Putta, P., Mills, E., Garg, N., Motwani, S., Finn, C., Garg, D., and Rafailov,
R. (2024).
Agent Q: Advanced reasoning and learning for autonomous AI agents.
arXiv [cs.AI].
Reference VIII
[Setlur et al., 2024] Setlur, A., Nagpal, C., Fisch, A., Geng, X., Eisenstein, J., Agarwal, R.,
Agarwal, A., Berant, J., and Kumar, A. (2024).
Rewarding progress: Scaling automated process verifiers for LLM reasoning.
arXiv [cs.LG].
[Silver et al., 2017] Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A.,
Lanctot, M., Sifre, L., Kumaran, D., Graepel, T., Lillicrap, T., Simonyan, K., and Hassabis, D.
(2017).
Mastering chess and shogi by self-play with a general reinforcement learning algorithm.
arXiv [cs.AI].
Reference IX
[Singh et al., 2023] Singh, A., Co-Reyes, J. D., Agarwal, R., Anand, A., Patil, P., Garcia, X., Liu,
P. J., Harrison, J., Lee, J., Xu, K., Parisi, A., Kumar, A., Alemi, A., Rizkowsky, A., Nova, A.,
Adlam, B., Bohnet, B., Elsayed, G., Sedghi, H., Mordatch, I., Simpson, I., Gur, I., Snoek, J.,
Pennington, J., Hron, J., Kenealy, K., Swersky, K., Mahajan, K., Culp, L., Xiao, L., Bileschi,
M. L., Constant, N., Novak, R., Liu, R., Warkentin, T., Qian, Y., Bansal, Y., Dyer, E.,
Neyshabur, B., Sohl-Dickstein, J., and Fiedel, N. (2023).
Beyond human data: Scaling self-training for problem-solving with language models.
arXiv [cs.LG].
[Snell et al., 2024] Snell, C., Lee, J., Xu, K., and Kumar, A. (2024).
Scaling LLM test-time compute optimally can be more effective than scaling model
parameters.
arXiv [cs.LG].
[Uesato et al., 2022] Uesato, J., Kushman, N., Kumar, R., Song, F., Siegel, N., Wang, L.,
Creswell, A., Irving, G., and Higgins, I. (2022).
Solving math word problems with process- and outcome-based feedback.
arXiv [cs.LG].
[Wang et al., 2023] Wang, P., Li, L., Shao, Z., Xu, R. X., Dai, D., Li, Y., Chen, D., Wu, Y., and
Sui, Z. (2023).
Math-shepherd: Verify and reinforce LLMs step-by-step without human annotations.
arXiv [cs.AI].
Reference XI
[Wang et al., 2022] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S.,
Chowdhery, A., and Zhou, D. (2022).
Self-consistency improves chain of thought reasoning in language models.
arXiv [cs.CL].
[Wei et al., 2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E.,
Le, Q., and Zhou, D. (2022).
Chain-of-thought prompting elicits reasoning in large language models.
arXiv [cs.CL], pages 24824–24837.
[Welleck et al., 2024] Welleck, S., Bertsch, A., Finlayson, M., Schoelkopf, H., Xie, A.,
Neubig, G., Kulikov, I., and Harchaoui, Z. (2024).
From decoding to meta-generation: Inference-time algorithms for large language
models.
arXiv [cs.CL].
Reference XII
[Welleck et al., 2022] Welleck, S., Lu, X., West, P., Brahman, F., Shen, T., Khashabi, D., and
Choi, Y. (2022).
Generating sequences by learning to self-correct.
arXiv [cs.CL].
[Xie et al., 2024] Xie, Y., Goyal, A., Zheng, W., Kan, M.-Y., Lillicrap, T. P., Kawaguchi, K., and
Shieh, M. (2024).
Monte carlo tree search boosts reasoning via iterative preference learning.
arXiv [cs.AI].
[Xie et al., 2023] Xie, Y., Kawaguchi, K., Zhao, Y., Zhao, X., Kan, M.-Y., He, J., and Xie, Q.
(2023).
Self-evaluation guided beam search for reasoning.
arXiv [cs.CL].
Reference XIII
[Zelikman et al., 2022] Zelikman, E., Wu, Y., Mu, J., and Goodman, N. D. (2022).
STaR: Bootstrapping reasoning with reasoning.
arXiv [cs.LG].
[Zhang et al., 2024] Zhang, L., Hosseini, A., Bansal, H., Kazemi, M., Kumar, A., and Agarwal,
R. (2024).
Generative verifiers: Reward modeling as next-token prediction.
arXiv [cs.LG].