0% found this document useful (0 votes)

3 views103 pages

o1 Tutorial

The document discusses test-time scaling in AI, particularly through the lens of reinforcement learning and the Chain of Thought (CoT) methodology. It highlights the importance of scaling computation and search techniques for improving AI performance, while also outlining various approaches such as Guess + Check, Guided Search, and AlphaZero. The document emphasizes the need for data-efficient training and the integration of learned behaviors to enhance model capabilities.

Uploaded by

plexus-tipster.2n

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views103 pages

o1 Tutorial

Uploaded by

plexus-tipster.2n

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 103

Speculations

on
Test-Time Scaling
Sasha Rush Daniel Ritter
Cornell
[Brown et al., 2020]
[OpenAI, 2024]
[Hendrycks et al., 2021c]
AIME

For any finite set X, let |X| denote the number of elements in X. Define
X
Sn = |A ∩ B|,

where the sum is taken over all ordered pairs (A, B) such that A and
B are subsets of {1, 2, 3, · · · , n} with |A| = |B|. For example, S2 = 4
because the sum is taken over the pairs of subsets

(A, B) ∈ {(∅, ∅), ({1}, {1}), ({1}, {2}), ({2}, {1}), ({2}, {2}), ({1, 2}, {1, 2})}

giving S2 = 0 + 1 + 0 + 0 + 1 + 2 = 4. Let SS20222021

= pq , where p and q
are relatively prime positive integers. Find the remainder when p + q is
divided by 1000.
[Sutton, 2019]
The Bitter Lesson

The bitter lesson is based on the histor-

ical observations that 1) AI researchers
have often tried to build knowledge into
their agents, 2) this always helps in the
short term, and is personally satisfying
to the researcher, but 3) in the long
run it plateaus and even inhibits further
progress, and 4) breakthrough progress
eventually arrives by an opposing ap-
proach based on scaling computation by
search and learning.
[Brown and Sandholm, 2017],https:
Importance of Search //x.com/polynoamial/status/
1840822629625688469

The most important [lesson] is that I and

other researchers simply didn’t know
how much of a difference scaling up
search would make. If I had seen
those scaling results at the start of my
PhD, I would have shifted to research-
ing search algorithms for poker much
sooner and we probably would have
gotten superhuman poker bots much
sooner.
Sources

• Survey of the public literature

• Synthesis of discussions with expert

• Rumors from social media

Thanks to Lewis Tunstall, Edward Beeching, Aviral Kumar, Charlie Snell,

Michael Hassid, Yoav Artzi, Risab Agarwal, Kanishk Gandhi, Wenting Zhao,
Yuntian Deng, Nathan Lambert, Noah Goodman
Outline

Introduction

The Clues

Technical Background

The Suspects

What do we do now?
[OpenAI, 2024]
o1 Description

Our large-scale reinforcement learning

algorithm teaches the model how to
think productively using its chain of
thought in a highly data-efficient train-
ing process.
Implication

• RL; Signal from verifiable problems

• CoT; Test-time occurs in token stream

• Data-Efficient; Bounded set of problems

Current Assumptions

• Single final language model

• Not following from expert examples

• Behaviors are learned

[Wei et al., 2022, Nye et al., 2021]
Review: Chain of Thought

Question: 4 baskets. 3 have 9 apples, 15

• The model oranges, 14 bananas each. 4th has 2 less
generates of each. Total fruits?
intermediate steps. Let’s solve step-by-step:
Fruits in one of first 3 baskets: 9 + 15 +
14 = 38
• These steps provide Total in first 3 baskets: 38 * 3 = 114
an scratchpad for 4th basket: (9-2) + (15-2) + (14-2) = 32
technical problems. Total fruits: 114 + 32 = 146
Answer: 146 fruits
[OpenAI, 2024]
o1 Chain of Thought

o1 learns to hone its chain of thought

and refine the strategies it uses. It learns
to recognize and correct its mistakes. It
learns to break down tricky steps into
simpler ones. It learns to try a differ-
ent approach when the current one isn’t
working.
[OpenAI, 2024]
o1 CoT: Outlining

Implementation Outline:
1. Capture input string as argument.
2. Remove any spaces (if any).
3. Parse input string to extract numbers as arrays.
- Since the input is in the format ’[1,2],[3,4]’, we can:
- Remove outer brackets if necessary.
- Split the string by ’],’ to get each row.
- For each row, remove ’[’ and ’]’, then split by ’,’ to get
elements.
4. Build a 2D array in bash (arrays containing arrays).
[OpenAI, 2024]
o1 CoT: Planning

First, the cipher seems connected to the plaintext.

Given the time constraints, perhaps the easiest way
is to try to see patterns.
Option 1: Try to find mapping from letters to letters.
Do any letters match?
First, let’s write down the ciphertext and plaintext let-
ters on top of each other.
[OpenAI, 2024]
o1 CoT: Backtracking

Similarly, m(x)* (−x 2) = (−x2n + 2 + m2n−2x2n +

lower terms)m(x) * (−x 2) = (−x 2n + 2 + m 2n−2 x
2n + lower terms).
Wait, actually, this may not help us directly without
specific terms.An alternative is to consider the known
polynomials.
So m(x) = k ...
[OpenAI, 2024]
o1 CoT: Self-Evaluation

Let’s analyze each option.

Option A: “because appetite regulation is a field of
staggering complexity.”
Is that a good explanation? Hmm.
Option B: “because researchers seldom ask the right
questions.”
Does this make sense with the main clause?
Summary

• CoT provides test-time scaling

• CoT looks like search / planning in a classical sense

• RL needed to induce this behavior

Outline

Introduction

The Clues

Technical Background

The Suspects

What do we do now?
[Cobbe et al., 2021]
Technical Background

• Formalize sampling Question: 4 baskets. 3 have 9 apples, 15

of latent reasoning oranges, 14 bananas each. 4th has 2 less
of each. Total fruits?
Let’s solve step-by-step:
• Techniques from Fruits in one of first 3 baskets: 9 + 15 +
combinatorial 14 = 38
sampling Total in first 3 baskets: 38 * 3 = 114
4th basket: (9-2) + (15-2) + (14-2) = 32
Total fruits: 114 + 32 = 146
• No learning yet. Answer: 146 fruits
[Welleck et al., 2024]
Stepwise CoT Sampling

• x; problem specification

• z1:T ∈ S T ; chain of thought (CoT) steps

• y ∈ Y; final answer

p(y|x) = Ez p(y|x, z)
[Wei et al., 2022]
Warm-up: Ancestral Sampling

z1:T ∼ p(·|x)
y ∼ p(·|x, z1:T )

T is the amount of test-time compute

[Wang et al., 2022]
Monte-Carlo Self-Consistency

For N samples,

z1:T ∼ p(·|x)
y n ∼ p(·|x, z1:T )

Pick majority choice y n

[Uesato et al., 2022]
Assumption: Verifier at Training

Verx : Y → {0, 1}

Common datasets:
• Regular expression for math [Cobbe et al., 2021]

• Unit test for code [Hendrycks et al., 2021a]

• Test questions for science [Hendrycks et al., 2021b]

[Nakano et al., 2021, Lightman et al., 2023]
Rejection Sampling
Best-of-N

For n = 1 to N :

z n ∼ p(z|x)
y n ∼ p(y|x, z n )

Verified set {y n : Verx (y n )}

[Wang et al., 2023]
Monte-Carlo Roll-Outs

Given partial CoT z1:t , expected

value,

Ey∼p(·|x) Ver(y)

Monte Carlo for this

expectation.
Goal: Learning with Latent CoTs

Maximum likelihood;

max log p(Ver(y)|x; θ) =

X
θ

log Ez p(Ver(y)|x, z; θ)
X

Classic combinatorial
expectation
Reinforcement Learning

I will mostly elide RL training question. Important practical

choices:
• Batched? → Compute trajectories first, then train

• On-policy? → Sample from current model

• KL Constraints on learning.

• Specific algorithm choice (REINFORCE, PPO, etc)

Outline

Introduction

The Clues

Technical Background

The Suspects

What do we do now?
The Suspects

• Guess + Check

• Guided Search

• AlphaZero

• Learning to Correct
The Suspects

• Guess + Check

• Guided Search

• AlphaZero

• Learning to Correct
Suspect 1: Guess + Check

• 1) Sample N CoTs

• 2) Check if successful

• 3) Train on good ones

[Neal and Hinton, 1998]
Framework: Rejection Sampling EM

max log Ez∼p(z|x;θ) p(Ver(y)|x, z)

X
θ

• E-Step: For n = 1 to N :

z n ∼ p(·|x)
y n ∼ p(·|x, z n )

Keep verified set Z = {z n : Ver(y n )}

• M-Step: Fit θ0 ← arg maxθ log p(z|x; θ)

P
z∈Z
Variants
• Self-Training [Yarowsky, 1995]

• Best-of-N Training [Cobbe et al., 2021]

• STaR [Zelikman et al., 2022]

• ReST [Gulcehre et al., 2023]

• ReST-EM [Singh et al., 2023]

• Filtered Rejection Sampling [Nakano et al., 2021]

[Singh et al., 2023]
Empirical Results
Is this o1?

Pro
X Extremely simple and
scalable

X Positive results in past

work
Is this o1?

Pro
× No evidence this learns to
X Extremely simple and
correct, plan
scalable

× Computationally inefficient
X Positive results in past
search
work
The Suspects

• Guess + Check

• Guided Search

• AlphaZero

• Learning to Correct
Suspect 2: Guided Search

• 1) During CoT sampling, use a heuristic to improve

trajectories

• 2) Check if final versions are successful

• 3) Train on good ones

Framework: Beam Search with Guide

r : S t → R; Guide function
[Snell et al., 2024]
Framework: Beam Search with Guide

r : S t → R; Guide function
For each step t,
1. Sample many next steps,

zt i ∼ p(·|x, z1:t−1 )

2. Keep the top samples,

ordered by r(zt )
Guide Variants

• Monte Carlo Roll-outs

• Learned Value Function (PRM)

• Self-Evaluation Verifier
[Kazemnejad et al., 2024]
Beam Search with Roll-Outs

For partial z1:t−1 , rollout,

y n ∼ p(·|x, z1:t−1 )

N
1 X
rM C (zt ) = Ver(y n )
N n=1
[Kazemnejad et al., 2024]
Beam Search with Roll-Outs

For partial z1:t−1 , rollout,

y n ∼ p(·|x, z1:t−1 )

N
1 X
rM C (zt ) = Ver(y n )
N n=1
[Kazemnejad et al., 2024]
Beam Search with Roll-Outs

For partial z1:t−1 , rollout,

y n ∼ p(·|x, z1:t−1 )

N
1 X
rM C (zt ) = Ver(y n )
N n=1
[Kazemnejad et al., 2024]
Beam Search with Roll-Outs

For partial z1:t−1 , rollout,

y n ∼ p(·|x, z1:t−1 )

N
1 X
rM C (zt ) = Ver(y n )
N n=1
[Lightman et al., 2023, Wang et al., 2023]
Learned Value Function

• Rollouts are costly /

require Ver

• Learn rψ (zt ) to
approximate

• Use rM C for labels

[Xie et al., 2023]
Self-Evaluation

• Evaluation with an LLM

• Prompt LLM to evalute step-level correctness

Variants

• Search Heuristic

• Value Function

• PRM; Process Reward Model

[Uesato et al., 2022, Lightman et al., 2023]

• PAV; Process Advantage Verifier [Setlur et al., 2024]

[Wang et al., 2023]
Test-time Guides Outperform Self-consistency
Is this o1?

X RS needs to be more × o1 is a single test-time

efficient. model

X Learned rewards are × Not clear if this is enough

effective for planning.
Training Versus Test

• Learned value could be Let’s analyze each

used at test-time option.
Option A: “because
• Alternative can be trained appetite regulation
into LLM is a field of stagger-
ing complexity.”
• Generative Verifier Is that a good ex-
[Zhang et al., 2024] planation? Hmm.
The Suspects

• Guess + Check

• Guided Search

• AlphaZero

• Learning to Correct
[Silver et al., 2017]
Reminder: AlphaZero

• Canonical example
of self-learning

• Scaling model
without data
Suspect 3: AlphaZero

• 1) Self-play using guided-search with exploration

• 2) Label final outcomes of self-play games

• 3) Train guide and generator

[Anthony et al., 2017]
Framework: Expert Iteration

• Iterative algorithm combining learned model + expert

search with a verifier.

• Generate samples using p(y, z|x), reward model r(z t ), and

search algorithm (e.g. beam search)

• Label samples using Verx (y)

• Train p(y, z|x), r(z t ) on the labeled samples, and repeat

[Hosseini et al., 2024]
Empirical Results: Expert Iteration
UCB for Language

• Selection: Walk down tree to leaf zt−1

UCB for Language

• Selection: Walk down tree to leaf zt−1

• Expand: Sample ∼ 5 next steps zt , pick one at random

UCB for Language

• Selection: Walk down tree to leaf zt−1

• Expand: Sample ∼ 5 next steps zt , pick one at random

• Rollouts: Sample steps zt+1 . . . zT

[Hubert et al., 2021, Feng et al., 2023]
UCB for Language

• Selection: Walk down tree to leaf zt−1

• Expand: Sample ∼ 5 next steps zt , pick one at random

• Rollouts: Sample steps zt+1 . . . zT

• Backprop: Update nodes counts N (z1:i ) and wins w(z1:i )

for parents
Generalization: MCTS
Generalization: MCTS
Generalization: MCTS
Generalization: MCTS
Generalization: MCTS
Generalization: MCTS
Generalization: MCTS
Generalization: MCTS
Generalization: MCTS
[Kocsis and Szepesvári, 2006]
Exploration

• MCTS-UCB explores states based on wins and amount of

explorations
s
w(z1:t ) ln N (z1:t−1 )
+α
N (z1:t ) N (z1:t )

• Less strict search process

[Xie et al., 2024, Putta et al., 2024]
Learning from Search

• MCTS tree provides path preferences

• Can be used for preference learning (e.g. DPO)

• Alternative to learning on chains

Is this o1?

X Major demonstrated RL
result

X Scales to more train-time

search
Is this o1?

× Costly to maintain open

states
X Major demonstrated RL
result
× More complex
algorithmically
X Scales to more train-time
search
× OpenAI comments /
rumors
The Suspects

• Guess + Check

• Guided Search

• AlphaZero

• Learning to Correct
What does exploration look like?

• Game Playing - Explore alternative moves.

• Language - Nearly infinite ”moves”

• Exploration to learn strategies

Suspect 4: Learning to Correct

• 1) Start with failed CoT

• 2) Search to find successful corrections

• 3) Train on full CoT

[Welleck et al., 2022]
Framework: Self-Correction

• Aim: Find similar CoT pairs

z 0 , z 00 where z 00 is better.

• Train the model to improve

upon z 0
[Gandhi et al., 2024]
Challenges: Learning Correction

• Collapse: Model may learn to just ignore negative

• Distribution Shift: Actual mistakes may deviate from

examples
[Gandhi et al., 2024]
RL from Mistakes

• Start with z 0

• Learn to correct from

verifier
Empirical Results
[Gandhi et al., 2024]
Generalization: Stream of Search

∗
• Find z1:T as optimal length
CoT

0
• Find z1:T 0
0 with T > T

through backtracking tree

0
• Train model on z1:T 0
From Tree to Stream

• Tree search explores

multiple paths

• Stream presents a linear

sequence

• Allows model to mistakes

in stream
Is this o1?

X Learns to correct and plan

X Single test-time model

Is this o1?

X Learns to correct and plan × Complex training process

X Single test-time model × Limited empirical evidence

Outline

Introduction

The Clues

Technical Background

The Suspects

What do we do now?
[Ouyang et al., 2022]
Does it need to be the same?
[Tunstall et al., 2023]
Open-Source Models
Let’s Build

• Once result is established there should be a easier path

• Open-Source tools need to be improved to scale these

pipelines
Let’s Build

Thank You
https://github.com/srush/awesome-o1
Reference I
[Anthony et al., 2017] Anthony, T., Tian, Z., and Barber, D. (2017).
Thinking fast and slow with deep learning and tree search.
arXiv [cs.AI].

[Brown and Sandholm, 2017] Brown, N. and Sandholm, T. (2017).

Libratus: The superhuman AI for no-limit poker.
In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence,
California. International Joint Conferences on Artificial Intelligence Organization.

[Brown et al., 2020] Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P.,
Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger,
G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen,
M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford,
A., Sutskever, I., and Amodei, D. (2020).
Language models are few-shot learners.
arXiv [cs.CL].
Reference II
[Cobbe et al., 2021] Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L.,
Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. (2021).
Training verifiers to solve math word problems.
arXiv [cs.LG].

[Feng et al., 2023] Feng, X., Wan, Z., Wen, M., McAleer, S. M., Wen, Y., Zhang, W., and
Wang, J. (2023).
Alphazero-like tree-search can guide large language model decoding and training.
arXiv [cs.LG].

[Gandhi et al., 2024] Gandhi, K., Lee, D., Grand, G., Liu, M., Cheng, W., Sharma, A., and
Goodman, N. D. (2024).
Stream of search (SoS): Learning to search in language.
arXiv [cs.LG].
Reference III
[Gulcehre et al., 2023] Gulcehre, C., Paine, T. L., Srinivasan, S., Konyushkova, K., Weerts, L.,
Sharma, A., Siddhant, A., Ahern, A., Wang, M., Gu, C., Macherey, W., Doucet, A., Firat, O.,
and de Freitas, N. (2023).
Reinforced self-training (ReST) for language modeling.
arXiv [cs.CL].

[Hendrycks et al., 2021a] Hendrycks, D., Basart, S., Kadavath, S., Mazeika, M., Arora, A.,
Guo, E., Burns, C., Puranik, S., He, H., Song, D., and Steinhardt, J. (2021a).
Measuring coding challenge competence with APPS.
arXiv [cs.SE].

[Hendrycks et al., 2021b] Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song,
D., and Steinhardt, J. (2021b).
Measuring massive multitask language understanding.
Reference IV
[Hendrycks et al., 2021c] Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang,
E., Song, D., and Steinhardt, J. (2021c).
Measuring mathematical problem solving with the MATH dataset.
arXiv [cs.LG].

[Hosseini et al., 2024] Hosseini, A., Yuan, X., Malkin, N., Courville, A., Sordoni, A., and
Agarwal, R. (2024).
V-STar: Training verifiers for self-taught reasoners.
In First Conference on Language Modeling.

[Hubert et al., 2021] Hubert, T., Schrittwieser, J., Antonoglou, I., Barekatain, M., Schmitt, S.,
and Silver, D. (2021).
Learning and planning in complex action spaces.
arXiv [cs.LG].
Reference V
[Kazemnejad et al., 2024] Kazemnejad, A., Aghajohari, M., Portelance, E., Sordoni, A.,
Reddy, S., Courville, A., and Roux, N. L. (2024).
VinePPO: Unlocking RL potential for LLM reasoning through refined credit assignment.
arXiv [cs.LG].

[Kocsis and Szepesvári, 2006] Kocsis, L. and Szepesvári, C. (2006).

Bandit based monte-carlo planning.
In Lecture Notes in Computer Science, Lecture notes in computer science, pages 282–293.
Springer Berlin Heidelberg, Berlin, Heidelberg.

[Lightman et al., 2023] Lightman, H., Kosaraju, V., Burda, Y., Edwards, H., Baker, B., Lee, T.,
Leike, J., Schulman, J., Sutskever, I., and Cobbe, K. (2023).
Let’s verify step by step.
arXiv [cs.LG].
Reference VI
[Nakano et al., 2021] Nakano, R., Hilton, J., Balaji, S., Wu, J., Ouyang, L., Kim, C., Hesse, C.,
Jain, S., Kosaraju, V., Saunders, W., Jiang, X., Cobbe, K., Eloundou, T., Krueger, G., Button,
K., Knight, M., Chess, B., and Schulman, J. (2021).
WebGPT: Browser-assisted question-answering with human feedback.
arXiv [cs.CL].

[Neal and Hinton, 1998] Neal, R. M. and Hinton, G. E. (1998).

A view of the em algorithm that justifies incremental, sparse, and other variants.
In Learning in Graphical Models, pages 355–368. Springer Netherlands, Dordrecht.

[Nye et al., 2021] Nye, M., Andreassen, A. J., Gur-Ari, G., Michalewski, H., Austin, J., Bieber,
D., Dohan, D., Lewkowycz, A., Bosma, M., Luan, D., Sutton, C., and Odena, A. (2021).
Show your work: Scratchpads for intermediate computation with language models.
arXiv [cs.LG].
Reference VII
[OpenAI, 2024] OpenAI (2024).
Learning to reason with LLMs.
https://openai.com/index/learning-to-reason-with-llms/.
Accessed: 2024-10-29.
[Ouyang et al., 2022] Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin,
P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L.,
Simens, M., Askell, A., Welinder, P., Christiano, P., Leike, J., and Lowe, R. (2022).
Training language models to follow instructions with human feedback.
arXiv [cs.CL], pages 27730–27744.

[Putta et al., 2024] Putta, P., Mills, E., Garg, N., Motwani, S., Finn, C., Garg, D., and Rafailov,
R. (2024).
Agent Q: Advanced reasoning and learning for autonomous AI agents.
arXiv [cs.AI].
Reference VIII

[Setlur et al., 2024] Setlur, A., Nagpal, C., Fisch, A., Geng, X., Eisenstein, J., Agarwal, R.,
Agarwal, A., Berant, J., and Kumar, A. (2024).
Rewarding progress: Scaling automated process verifiers for LLM reasoning.
arXiv [cs.LG].

[Silver et al., 2017] Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A.,
Lanctot, M., Sifre, L., Kumaran, D., Graepel, T., Lillicrap, T., Simonyan, K., and Hassabis, D.
(2017).
Mastering chess and shogi by self-play with a general reinforcement learning algorithm.
arXiv [cs.AI].
Reference IX
[Singh et al., 2023] Singh, A., Co-Reyes, J. D., Agarwal, R., Anand, A., Patil, P., Garcia, X., Liu,
P. J., Harrison, J., Lee, J., Xu, K., Parisi, A., Kumar, A., Alemi, A., Rizkowsky, A., Nova, A.,
Adlam, B., Bohnet, B., Elsayed, G., Sedghi, H., Mordatch, I., Simpson, I., Gur, I., Snoek, J.,
Pennington, J., Hron, J., Kenealy, K., Swersky, K., Mahajan, K., Culp, L., Xiao, L., Bileschi,
M. L., Constant, N., Novak, R., Liu, R., Warkentin, T., Qian, Y., Bansal, Y., Dyer, E.,
Neyshabur, B., Sohl-Dickstein, J., and Fiedel, N. (2023).
Beyond human data: Scaling self-training for problem-solving with language models.
arXiv [cs.LG].

[Snell et al., 2024] Snell, C., Lee, J., Xu, K., and Kumar, A. (2024).
Scaling LLM test-time compute optimally can be more effective than scaling model
parameters.
arXiv [cs.LG].

[Sutton, 2019] Sutton, R. (2019).

The bitter lesson.
Reference X
[Tunstall et al., 2023] Tunstall, L., Beeching, E., Lambert, N., Rajani, N., Rasul, K., Belkada,
Y., Huang, S., von Werra, L., Fourrier, C., Habib, N., Sarrazin, N., Sanseviero, O., Rush,
A. M., and Wolf, T. (2023).
Zephyr: Direct distillation of LM alignment.
arXiv [cs.LG].

[Uesato et al., 2022] Uesato, J., Kushman, N., Kumar, R., Song, F., Siegel, N., Wang, L.,
Creswell, A., Irving, G., and Higgins, I. (2022).
Solving math word problems with process- and outcome-based feedback.
arXiv [cs.LG].

[Wang et al., 2023] Wang, P., Li, L., Shao, Z., Xu, R. X., Dai, D., Li, Y., Chen, D., Wu, Y., and
Sui, Z. (2023).
Math-shepherd: Verify and reinforce LLMs step-by-step without human annotations.
arXiv [cs.AI].
Reference XI
[Wang et al., 2022] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S.,
Chowdhery, A., and Zhou, D. (2022).
Self-consistency improves chain of thought reasoning in language models.
arXiv [cs.CL].

[Wei et al., 2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E.,
Le, Q., and Zhou, D. (2022).
Chain-of-thought prompting elicits reasoning in large language models.
arXiv [cs.CL], pages 24824–24837.

[Welleck et al., 2024] Welleck, S., Bertsch, A., Finlayson, M., Schoelkopf, H., Xie, A.,
Neubig, G., Kulikov, I., and Harchaoui, Z. (2024).
From decoding to meta-generation: Inference-time algorithms for large language
models.
arXiv [cs.CL].
Reference XII
[Welleck et al., 2022] Welleck, S., Lu, X., West, P., Brahman, F., Shen, T., Khashabi, D., and
Choi, Y. (2022).
Generating sequences by learning to self-correct.
arXiv [cs.CL].

[Xie et al., 2024] Xie, Y., Goyal, A., Zheng, W., Kan, M.-Y., Lillicrap, T. P., Kawaguchi, K., and
Shieh, M. (2024).
Monte carlo tree search boosts reasoning via iterative preference learning.
arXiv [cs.AI].

[Xie et al., 2023] Xie, Y., Kawaguchi, K., Zhao, Y., Zhao, X., Kan, M.-Y., He, J., and Xie, Q.
(2023).
Self-evaluation guided beam search for reasoning.
arXiv [cs.CL].
Reference XIII

[Yarowsky, 1995] Yarowsky, D. (1995).

Unsupervised word sense disambiguation rivaling supervised methods.
In Proceedings of the 33rd annual meeting on Association for Computational Linguistics -,
Morristown, NJ, USA. Association for Computational Linguistics.

[Zelikman et al., 2022] Zelikman, E., Wu, Y., Mu, J., and Goodman, N. D. (2022).
STaR: Bootstrapping reasoning with reasoning.
arXiv [cs.LG].

[Zhang et al., 2024] Zhang, L., Hosseini, A., Bansal, H., Kazemi, M., Kumar, A., and Agarwal,
R. (2024).
Generative verifiers: Reward modeling as next-token prediction.
arXiv [cs.LG].

IB Math AA [Analysis and Approaches] Internal Assessment: The Definitive IA Guide for the International Baccalaureate [IB] Diploma
From Everand
IB Math AA [Analysis and Approaches] Internal Assessment: The Definitive IA Guide for the International Baccalaureate [IB] Diploma
Mudassir Mehmood
No ratings yet
Painless Pre-Algebra
From Everand
Painless Pre-Algebra
Barron's Educational Series
3/5 (2)
Why We Think _ Lil'Log
No ratings yet
Why We Think _ Lil'Log
32 pages
10 Cs224r-Rl for Reasoning Lecture
No ratings yet
10 Cs224r-Rl for Reasoning Lecture
41 pages
ICML 2018 Notes: Stockholm, Sweden
No ratings yet
ICML 2018 Notes: Stockholm, Sweden
55 pages
Week1 Slide ECE4010
No ratings yet
Week1 Slide ECE4010
301 pages
2506.18237v1
No ratings yet
2506.18237v1
18 pages
AceReason-Nemotron
No ratings yet
AceReason-Nemotron
23 pages
CS221: Introduction To Artificial Intelligence
No ratings yet
CS221: Introduction To Artificial Intelligence
122 pages
2504.05419v1
No ratings yet
2504.05419v1
18 pages
Demystifying Long Chain-of-Thought Reasoning in LLMs
No ratings yet
Demystifying Long Chain-of-Thought Reasoning in LLMs
40 pages
ps3-Mathy
No ratings yet
ps3-Mathy
14 pages
Ai - Faizan - s2018266056 Final
No ratings yet
Ai - Faizan - s2018266056 Final
13 pages
trustworthy learning using uncertain interpretation of data
No ratings yet
trustworthy learning using uncertain interpretation of data
228 pages
I - E - E R - L: N Context Xploration Xploitation For E Inforcement Earning
No ratings yet
I - E - E R - L: N Context Xploration Xploitation For E Inforcement Earning
16 pages
S18 Reinforcement Learning 2
No ratings yet
S18 Reinforcement Learning 2
46 pages
AutoRL Tutorials
No ratings yet
AutoRL Tutorials
80 pages
Lecture - 11 - 16
No ratings yet
Lecture - 11 - 16
82 pages
Solution ML KOE -073 PUT(7th Sem 2024-25) Neeru
No ratings yet
Solution ML KOE -073 PUT(7th Sem 2024-25) Neeru
14 pages
ps3-sol
No ratings yet
ps3-sol
21 pages
samp_sol
No ratings yet
samp_sol
14 pages
Paper - Training Chain-Of-Thought via Latent-Variable Inference
No ratings yet
Paper - Training Chain-Of-Thought via Latent-Variable Inference
23 pages
Assignment 1: CS747: F I L A
No ratings yet
Assignment 1: CS747: F I L A
10 pages
Lecture 11 16
No ratings yet
Lecture 11 16
82 pages
AI Chapter 3
No ratings yet
AI Chapter 3
98 pages
Comp Intel Cheat Sheet
No ratings yet
Comp Intel Cheat Sheet
2 pages
AI Python 1565131797
No ratings yet
AI Python 1565131797
221 pages
RL
No ratings yet
RL
60 pages
Reinforcement Learning and Dynamic Programming For Control
100% (1)
Reinforcement Learning and Dynamic Programming For Control
111 pages
Answer Key
No ratings yet
Answer Key
12 pages
RStar-Math Small LLMs Can Master Math Reasoning w
No ratings yet
RStar-Math Small LLMs Can Master Math Reasoning w
19 pages
ai-ex-21
No ratings yet
ai-ex-21
10 pages
Instructors: Moses Charikar, Tengyu Ma, and Chris Re: Hope Everyone Stays Safe and Healthy in These Difficult Times!
No ratings yet
Instructors: Moses Charikar, Tengyu Ma, and Chris Re: Hope Everyone Stays Safe and Healthy in These Difficult Times!
40 pages
14. Reasoning
No ratings yet
14. Reasoning
21 pages
AIML LAB AIM & ALG
No ratings yet
AIML LAB AIM & ALG
22 pages
Remove Duplicate Characters
No ratings yet
Remove Duplicate Characters
42 pages
1. ETL (Extract, Transform, Load)
No ratings yet
1. ETL (Extract, Transform, Load)
43 pages
RL Theory Tutorial
No ratings yet
RL Theory Tutorial
80 pages
RNN + RL: Shusen Wang
No ratings yet
RNN + RL: Shusen Wang
51 pages
Aipython
No ratings yet
Aipython
221 pages
AIch3 (2)
No ratings yet
AIch3 (2)
103 pages
JEE Main 2017 Official Question Paper 1 Set C, April 2
No ratings yet
JEE Main 2017 Official Question Paper 1 Set C, April 2
232 pages
Lecture AI PDF
No ratings yet
Lecture AI PDF
232 pages
Python Code For AI
100% (3)
Python Code For AI
219 pages
Python Code For Artificial Intelligence: Foundations of Computational Agents
No ratings yet
Python Code For Artificial Intelligence: Foundations of Computational Agents
221 pages
AI Python
No ratings yet
AI Python
221 pages
Python Code For Artificial Intelligence: Foundations of Computational Agents
No ratings yet
Python Code For Artificial Intelligence: Foundations of Computational Agents
221 pages
Aipython
No ratings yet
Aipython
221 pages
Ideai Reinforcement Learning
No ratings yet
Ideai Reinforcement Learning
167 pages
AI Python
No ratings yet
AI Python
221 pages
Python Code For Artificial Intelligence: Foundations of Computational Agents
No ratings yet
Python Code For Artificial Intelligence: Foundations of Computational Agents
221 pages
AI &MI LAB
No ratings yet
AI &MI LAB
61 pages
Python Code For Artificial Intelligence: Foundations of Computational Agents
No ratings yet
Python Code For Artificial Intelligence: Foundations of Computational Agents
221 pages
Bdo Co1 Session 3
No ratings yet
Bdo Co1 Session 3
25 pages
Python Code For Artificial Intelligence: Foundations of Computational Agents
No ratings yet
Python Code For Artificial Intelligence: Foundations of Computational Agents
221 pages
2501.19393v3
No ratings yet
2501.19393v3
46 pages
neu_bz60xp74g
No ratings yet
neu_bz60xp74g
60 pages
ai-2015-f-02
No ratings yet
ai-2015-f-02
39 pages
Machine Learning
No ratings yet
Machine Learning
111 pages
Trigonometric Ratios to Transformations (Trigonometry) Mathematics E-Book For Public Exams
From Everand
Trigonometric Ratios to Transformations (Trigonometry) Mathematics E-Book For Public Exams
Mohmmad Khaja Shareef
5/5 (1)
Houses and Flats: Name: Grade: Date
100% (1)
Houses and Flats: Name: Grade: Date
3 pages
IBD (ER Journal)
No ratings yet
IBD (ER Journal)
11 pages
Epekto NG Paninigarilyo Sa Kabataan Thesis
100% (1)
Epekto NG Paninigarilyo Sa Kabataan Thesis
5 pages
Atomic Energy Education Society (Regd.)
No ratings yet
Atomic Energy Education Society (Regd.)
9 pages
2005 Us Army Heavy Construction Equipment Operator Course Phase 2 Scoop Loader Pmcs A01 16p
No ratings yet
2005 Us Army Heavy Construction Equipment Operator Course Phase 2 Scoop Loader Pmcs A01 16p
16 pages
Case Study - Dr. Nathaniel J. Williams - 2
No ratings yet
Case Study - Dr. Nathaniel J. Williams - 2
2 pages
Running Head: Reflective Journal On Leadership 0
No ratings yet
Running Head: Reflective Journal On Leadership 0
17 pages
Introductiedag 2020
No ratings yet
Introductiedag 2020
19 pages
RegCon Primer
No ratings yet
RegCon Primer
18 pages
0333911 Spectroscopic Methods of Chemical Analysis
No ratings yet
0333911 Spectroscopic Methods of Chemical Analysis
11 pages
Summative Assignment-2
No ratings yet
Summative Assignment-2
10 pages
Thinking in English A Content Based Appr
No ratings yet
Thinking in English A Content Based Appr
10 pages
The Modal Hotel: Activity Type
No ratings yet
The Modal Hotel: Activity Type
2 pages
Algebra: Rational Numbers: Mid-Chapter Quiz
No ratings yet
Algebra: Rational Numbers: Mid-Chapter Quiz
1 page
Sophia Introduction To Psychology Syllabus
No ratings yet
Sophia Introduction To Psychology Syllabus
5 pages
The Seven Principles of Coaching: 1. Awareness
No ratings yet
The Seven Principles of Coaching: 1. Awareness
2 pages
Pre Ph.D Entrance Exams 2025-26 (1)
No ratings yet
Pre Ph.D Entrance Exams 2025-26 (1)
1 page
Lesson 4 Acad
No ratings yet
Lesson 4 Acad
28 pages
Chmielewski Chapman
No ratings yet
Chmielewski Chapman
3 pages
Class X Circles
No ratings yet
Class X Circles
7 pages
Orientation To Me
No ratings yet
Orientation To Me
9 pages
How To Use A Ruler When Graphing (Revised)
No ratings yet
How To Use A Ruler When Graphing (Revised)
9 pages
Art of Indigenous North America
No ratings yet
Art of Indigenous North America
8 pages
Sharir Rachana Library Books List
No ratings yet
Sharir Rachana Library Books List
6 pages
The Big Awesome Book of PDF
No ratings yet
The Big Awesome Book of PDF
1 page
Munawar Hussain HSE Oficer
No ratings yet
Munawar Hussain HSE Oficer
11 pages
Clinical Utility of The Vanderbilt ADHD Diagnostic Parent Rating Scale Comorbidity Screening Scales
No ratings yet
Clinical Utility of The Vanderbilt ADHD Diagnostic Parent Rating Scale Comorbidity Screening Scales
15 pages
Admin,+6303 22067 1 CE
No ratings yet
Admin,+6303 22067 1 CE
4 pages
PPT Global English VF1
No ratings yet
PPT Global English VF1
30 pages