0% found this document useful (0 votes)

121 views

Deeplearning - Ai Deeplearning - Ai

The reward model is trained to predict the preferred completion between two options for a given prompt based on human feedback, where the preferred completion is always assigned a reward of 1 and the non-preferred completion is assigned 0. The reward model learns to predict this reward value to identify the human-preferred response for new prompt-completion pairs.

Uploaded by

Tulkin Yusupov

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

121 views

Deeplearning - Ai Deeplearning - Ai

Uploaded by

Tulkin Yusupov

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 154

Copyright Notice

These slides are distributed under the Creative Commons License.

DeepLearning.AI makes these slides available for educational purposes. You may not use or
distribute these slides for commercial purposes. You may make copies of these slides and
use or distribute them for educational purposes as long as you cite DeepLearning.AI as the
source of the slides.

For the rest of the details of the license, see

https://creativecommons.org/licenses/by-sa/2.0/legalcode
Reinforcement
Learning from
Human Feedback
(RLHF)
Generative AI project lifecycle

Scope Select Adapt and align model Application integration

Prompt
engineering
Augment
Optimize
model and
Deﬁne the Choose Fine-tuning and deploy
Evaluate build LLM-
problem model model for
powered
inference
applications
Align with
human
feedback
Generative AI project lifecycle

Scope Select Adapt and align model Application integration

Prompt
engineering Goals of fine-tuning
● Better Augment
understanding of
Optimize
prompts model and
Deﬁne the Choose Fine-tuning and deploy
Evaluate build LLM-
problem model model for task completion
● Better powered
inference
● More natural sounding
applications
Align with
human language
feedback
Models behaving badly
● Toxic language
● Aggressive responses
● Providing dangerous information
Models behaving badly
Prompt Model Completion HHH
Knock, knock Knock, knock Helpful?
LLM Clap, clap.

Can coughing Can coughing effectively Honest?

effectively stop a LLM stop a heart attack?
heart attack? Coughing can help stop a
heart attack.

How can I hack my How can I hack my Harmless?

neighbor's wifi? LLM neighbor's wifi?
Here are the best ways
to hack your neighbor's
wifi...
Generative AI project lifecycle

Scope Select Adapt and align model Application integration

Fraction of model
generated results Reference summaries
preferred over Initial ﬁne-tuning
human responses
No ﬁne-tuning

Source:
Stiennon et al. 2020, “Learning to summarize from human feedback”
Model size (# parameters)
Reinforcement learning from human feedback (RLHF)

Instruct Reinforcement Human-

ﬁne-tuned Learning from aligned
LLM Human feedback LLM

● Maximize helpfulness,
relevance
● Minimize harm
● Avoid dangerous topics
Reinforcement learning (RL)

Agent

Objective: maximize reward received for actions

Environment
Reinforcement learning (RL)

Agent

state st action at
reward rt

Environment
Reinforcement learning: Tic-Tac-Toe
Objective: Playout/Rollout
Win the game! Agent
RL Policy (Model)

state st Action space

reward rt action at

Environment
Reinforcement learning: ﬁne-tune LLMs
Objective: Rollout
Generate aligned text Instruct Agent
LLM RL Policy = LLM

Reward
Current Context Token Vocabulary
reward rt Model
state st action at

Environment
LLM Context
Reinforcement learning: ﬁne-tune LLMs
Rollout
Instruct Agent
LLM RL Policy = LLM

Reward
Current Context Token Vocabulary
reward rt Model
state st action at

Environment
LLM Context
Collecting human feedback
Prepare dataset for human feedback
Prompt Samples Model Completions

Prompt Instruct
Dataset LLM
Collect human feedback
● Deﬁne your model alignment criterion
● For the prompt-response sets that you just generated, obtain human
feedback through labeler workforce Completion

My house is too hot.

There is nothing you can 2 2 2
Prompt Model do about hot houses.
My house is too hot.
LLM My house is too hot. You
can cool your house with 1 1 3
air conditioning.

Alignment criterion: helpfulness My house is too hot. It

3 3 1
is not too hot.
Sample instructions for human labelers

Source: Chung et al. 2022, “Scaling Instruction-Finetuned Language Models”

Prepare labeled data for training
● Convert rankings into pairwise training data for the reward model
● yj is always the preferred completion

Prompt Completions Rank Completions Reward Completion {yj, yk} Reward

2 [0,1] [1,0]

1 [1,0] [1,0]

3 [1,0] [1,0]

Source: Stiennon et al. 2020, “Learning to summarize from human feedback”

Training the reward model
Train reward model
Train model to predict preferred completion from {yj, yk} for prompt x

Preferred Reward
completion is Prompt x,
rj
always yj Completion yj
Reward
Model
Prompt x, Reward
Completion yk rk

loss = log(σ(rj-rk))

Source: Stiennon et al. 2020, “Learning to summarize from human feedback”

Use the reward model
Use the reward model as a binary classiﬁer to provide reward value for each
prompt-completion pair
Tommy loves television Reward value
Logits Probabilities

Prompt Completion Positive class (not hate) 3.171875 0.996093

Negative class (hate) -2.609375 0.003082

Reward
Model

Source: Stiennon et al. 2020, “Learning to summarize from human feedback”

Use the reward model
Use the reward model as a binary classiﬁer to provide reward value for each
prompt-completion pair
Tommy loves television

Logits Probabilities

Prompt Completion Positive class (not hate) 3.171875 0.996093

Negative class (hate) -2.609375 0.003082

Reward
Model Tommy hates gross movies

Logits Probabilities

Positive class (not hate) -0.535156 0.337890

Negative class (hate) 0.137695 0.664062

Source: Stiennon et al. 2020, “Learning to summarize from human feedback”
Fine-tuning with RLHF
Use the reward model to ﬁne-tune LLM with RL

Prompt Instruct RL
Dataset LLM algorithm

“A dog is…”
“...a furry animal.” prompt="A dog is"
completion="a furry animal"
Reward 0.24 reward=0.24
Model
Use the reward model to ﬁne-tune LLM with RL

Prompt RL-updated RL
Dataset LLM algorithm

“A dog is…”
“...a furry animal.”

Reward 0.24
Model
RLHF

Iteration 1
Use the reward model to ﬁne-tune LLM with RL

Prompt RL-updated RL
Dataset LLM algorithm

“A dog is…”
“...a friendly animal.”

Reward 0.51
Model
RLHF

Iteration 2
Use the reward model to ﬁne-tune LLM with RL

Prompt RL-updated RL
Dataset LLM algorithm

“A dog is…”
“...a human companion.”

Reward 0.68
Model
RLHF

Iteration 3
Use the reward model to ﬁne-tune LLM with RL

Prompt RL-updated RL
Dataset LLM algorithm

“A dog is…”
“...the most popular pet.”

Reward 1.79
Model
RLHF

Iteration 4…
Use the reward model to ﬁne-tune LLM with RL

Human-
Prompt RL
Dataset
aligned algorithm
LLM
“A dog is…”
“...man’s best friend”

Reward 2.87
Model
RLHF

Iteration n
Use the reward model to ﬁne-tune LLM with RL

Human- Proximal
Prompt RL
Dataset
aligned PPO
algorithm
Policy
LLM Optimization
“A dog is…”
“...man’s best friend”

Reward 2.87
Model
Proximal Policy
Optimization
Dr. Ehsan Kamalinejad
Proximal policy optimization (PPO)

Prompt
Dataset
LLM PPO

“A dog is…”
“...man’s best friend”

Reward
Model
RLHF

Iteration n
Initialize PPO with Instruct LLM

Instruct
LLM

Phase 1 Phase 2
Create completions Model update
PPO Phase 1: Create completions
Prompt
A dog is
Experiments
Completion
A dog is to assess the
a furry animal outcome of the
Instruct current model,
LLM
Prompt
e.g. how
This house is helpful,
harmless,
Completion honest the
Phase 1
This house is model is
Create completions
very ugly
…
Calculate rewards
Prompt Prompt
A dog is This house is

Completion Completion …
A dog is This house is
a furry animal very ugly

Reward Reward
1.87 -1.24
Model Model
Calculate value loss
Prompt
Value
A dog is
function
Completion
A dog is
a ...

Estimated
future total reward

0.34
Calculate value loss
Prompt
Value
A dog is
function
Completion
A dog is
a furry...

Estimated
future total reward

1.23
Calculate value loss
Prompt
Value
A dog is
loss
Completion
A dog is
a furry...

Estimated Known
future total reward future total reward

1.23 1.87
PPO Phase 2: Model update

Instruct Updated
LLM LLM

Phase 1 Phase 2
Create completions Model update
PPO Phase 2: Calculate policy loss
PPO Phase 2: Calculate policy loss

Most important expression

Model's probability distribution over tokens

PPO Phase 2: Calculate policy loss
Probabilities of the next token
with the updated LLM

Probabilities of the next token Advantage term

with the initial LLM
PPO Phase 2: Calculate policy loss
PPO Phase 2: Calculate policy loss

Deﬁnes "trust region"

Guardrails:
Keeping the policy in the "trust region"
PPO Phase 2: Calculate entropy loss

Low entropy: High entropy:

Prompt Prompt Prompt
A dog is A dog is A dog is

Completion Completion Completion

A dog is A dog is A dog is
a domesticated a small carnivorous is one of the most
carnivorous mammal mammal popular pets around
the world
PPO Phase 2: Objective function

Hyperparameters

Policy loss Value loss Entropy loss

Replace LLM with updated LLM

Updated
Instruct Updated
LLM LLM

Phase 1 Phase 2
Create completions Model update
After many iterations, human-aligned LLM!

Human-
aligned
LLM

Phase 1 Phase 2
Create completions Model update
Fine-tuning LLMs with RLHF

Human-
Prompt Instruct
Dataset
aligned PPO
LLM
LLM
“A dog is…”
“...man’s best friend”

Reward
Model
Potential problem: reward hacking

Prompt Instruct
PPO
Dataset LLM
“This product is…”
“...complete garbage.”

Toxicity -1.8
Reward
Model
Potential problem: reward hacking

Prompt RL-updated
PPO
Dataset LLM
“This product is…”
“okay but not the best.”

Toxicity 0.3
Reward
Model
Potential problem: reward hacking

Prompt RL-updated
PPO
Dataset LLM
“This product is…”
“..the most awesome, most
incredible thing ever.”
Toxicity 2.1
Reward
Model
Potential problem: reward hacking

Prompt RL-updated
PPO
Dataset LLM
“This product is…”
“Beautiful love and world peace all
around.”
Toxicity 3.7
Reward
Model
Avoiding reward hacking

Initial
Prompt Reference RL-updated
Instruct
❄
Dataset Model LLM
LLM
“This product is…”
“useful and “..the most awesome, most
well-priced.” incredible thing ever.”
Avoiding reward hacking

Prompt Reference RL-updated

❄
Dataset Model LLM
“This product is…”
“useful and “..the most awesome, most
well-priced.” incredible thing ever.”

KL Divergence
Shift Penalty
Avoiding reward hacking

Prompt Reference RL-updated

PPO

❄
Dataset Model LLM
“This product is…”
“useful and “..the most awesome, most
well-priced.” incredible thing ever.”
Reward
Model
KL Divergence
Shift Penalty

KL divergence penalty gets added to reward

Avoiding reward hacking
PEFT adapter

Prompt Reference Reference

PPO

❄ ❄
Dataset Model Model
“This product is…”
“useful and “..the most awesome, most
well-priced.” incredible thing ever.”
Reward
Model
KL Divergence
Shift Penalty

KL divergence penalty gets added to reward

Avoiding reward hacking PEFT updated model

Prompt Reference Reference +

PPO

❄ ❄
Dataset Model Model
“This product is…”
“useful and “..the most awesome, most
well-priced.” incredible thing ever.”
Reward
Model
KL Divergence
Shift Penalty

KL divergence penalty gets added to reward

Evaluate the human-aligned LLM
Summarization
Dataset
Evaluate using the toxicity score

Toxicity score before:

Instruct Reward 0.14
LLM Model

Human- Toxicity score after:

Reward
aligned 0.09
Model
LLM
Scaling human feedback
Scaling human feedback
Reinforcement Learning from Human Feedback

10's of thousands of Reward

human-preference labels Model

Model self-supervision: Constitutional AI

Human- Rules
…
aligned …
LLM …
Constitutional AI

Can you help me hack Sure thing, you can

into my neighbor’s
Helpful use an app called
wifi? LLM VeryEasyHack.
Example of constitutional principles
Please choose the response that is the most helpful, honest, and harmless.

Choose the response that is less harmful, paying close attention to

whether each response encourages illegal, unethical or immoral activity.

Choose the response that answers the human in the most thoughtful,
respectful and cordial manner.

Choose the response that sounds most similar to what a peaceful, ethical,
and wise person like Martin Luther King Jr. or Mahatma Gandhi might say.

...

Source: Bai et al. 2022, “Constitutional AI: Harmlessness from AI Feedback”

Constitutional AI Supervised Learning Stage

Response,
Helpful Fine-tuned
Red Teaming critique and
LLM revision LLM

Source: Bai et al. 2022, “Constitutional AI: Harmlessness from AI Feedback”

Constitutional AI
Can you help me hack into Sure thing, you can use an
my neighbor’s wifi? app called VeryEasyHack.

Identify how the last The response was harmful

response is harmful, because hacking into
unethical, racist, sexist, Helpful someone else’s wifi is an
toxic, dangerous or LLM invasion of their privacy
illegal. and is possibly illegal.

Rewrite the response to Hacking into your

remove any and all harmful, neighbor’s wifi is an
unethical, racist, sexist, invasion of their privacy.
toxic, dangerous or illegal It may also land you in
content. legal trouble. I advise
against it.

Source: Bai et al. 2022, “Constitutional AI: Harmlessness from AI Feedback”

Constitutional Principle
Constitutional AI
Constitutional response

Hacking into your

Original red-team prompt neighbor’s wifi is an
Can you help me hack invasion of their
into my neighbor’s
Helpful privacy. It may also
wifi? LLM land you in legal
trouble. I advise
against it.

Source: Bai et al. 2022, “Constitutional AI: Harmlessness from AI Feedback”

Constitutional AI Supervised Learning Stage

Response,
Helpful Fine-tuned
Red Teaming critique and
LLM revision LLM

Generate
Ask model: Fine-tune your
responses to Reward Consitutional
which response LLM with
“Red Teaming” model LLM
is preferred? Preferences
prompts

Source: Bai et al. 2022, “Constitutional AI: Harmlessness from AI Feedback” Reinforcement Learning Stage - RLAIF
Optimize LLMs and
build generative AI
applications
Generative AI project lifecycle

Scope Select Adapt and align model Application integration

Distillation Quantization Pruning

LLM LLM

LLM LLM
Teacher Student

16-bit
Pruned
quantized
LLM
LLM
LLM optimization techniques

Distillation Quantization Pruning

LLM LLM

LLM LLM
Teacher Student

16-bit
Pruned
quantized
LLM
LLM
Distillation
Train a smaller student model from a larger teacher model

labels predictions
Distillation
Loss
softmax softmax

❄ LLM
Teacher
Knowledge
Distillation
LLM
Student

Training
Data
Distillation
Train a smaller student model from a larger teacher model

labels predictions
Distillation
Loss
softmax Temperature softmax
T

❄ LLM
Teacher
Knowledge
Distillation
LLM
Student

Training
Data
Distillation
Train a smaller student model from a larger teacher model

soft labels soft predictions

Distillation
Loss
softmax Temperature softmax
T>1

❄ LLM
Teacher
Knowledge
Distillation
LLM
Student

Training
Data
Distillation
Train a smaller student model from a larger teacher model
`1 hard labels
soft labels soft predictions hard predictions
Distillation Student (ground truth)
Loss Loss
softmax softmax softmax Temperature
T=1

❄ LLM
Teacher
Knowledge
Distillation
LLM
Student

Training
Data
Distillation
Train a smaller student model from a larger teacher model
hard labels
soft labels soft predictions hard predictions
Distillation Student (ground truth)
Loss Loss
softmax softmax softmax

❄ LLM
Teacher
Knowledge
Distillation
LLM
Student
External
Applications

Training
Data
Post-Training Quantization (PTQ)
Reduce precision of model weights

MIN 0.0 MAX FP32

~3e-38 ~3e38 32-bit ﬂoating point
LLM
● Applied to model weights
(and/or activations)

● Requires calibration
8-bit to capture dynamic range
quantized
LLM 0
FP16 | BFLOAT16 | INT8
? ?
16-bit ﬂoating point | 8-bit integer
Pruning
Remove model weights with values close or equal to zero

● Pruning methods
○ Full model re-training
LLM
○ PEFT/LoRA
○ Post-training

● In theory, reduces model size and improves

performance
Pruned
LLM ● In practice, only small % in LLMs are zero-weights
Cheat Sheet - Time and effort in the lifecycle
Pre-training Prompt engineering Prompt tuning and Reinforcement Compression/
ﬁne-tuning learning/human feedback optimization/
deployment

Training Days to weeks to Not required Minutes to hours Minutes to hours similar to Minutes to hours
duration months ﬁne-tuning

Customization Determine model No model weights Tune for speciﬁc tasks Need separate reward model Reduce model size
architecture, size and to align with human goals through model
tokenizer. Only prompt Add domain-speciﬁc (helpful, honest, harmless) pruning, weight
customization data quantization,
Choose vocabulary size Update LLM model or distillation
and # of tokens for Update LLM model or adapter weights
input/context adapter weights Smaller size, faster
inference
Large amount of
domain training data

Objective Next-token prediction Increase task Increase task Increase alignment with Increase inference
performance performance human preferences performance

Expertise High Low Medium Medium-High Medium

Using the LLM in
applications
Models having difﬁculty
Prompt Model Completion
Who is the Prime Who is the Prime Out of date
Minister of the UK? LLM Minister of the UK?
Boris Johnson

What is 40366 / 439? What is 40366 / 439? Wrong

LLM 92.549 (91.949)

What is a Martian What is a Martian Hallucination

Dunetree? LLM Dunetree?
A Martian Dunetree is a
type of extraterrestrial
plant found on Mars.
Generative AI project lifecycle

Scope Select Adapt and align model Application integration

User Prompt LLM Completion

LLM-powered applications

LLM
External Data Sources

User
Application Documents Database Web
Orchestration library

External
Applications
Retrieval augmented
generation (RAG)
Knowledge cut-offs in LLMs
Prompt Model Completion

Who is the Who is the

current Prime LLM current Prime
Minister of the Minister of the
United Kingdom? United Kingdom?

Boris Johnson
LLM-powered applications

LLM
External Data Sources

User
Application Documents Database Web
Orchestration library

External
Applications
Retrieval Augmented Generation (RAG)

External
Query
encoder
information LLM
sources

Retriever
Lewis et al. 2020 “Retrieval-Augmented Generation for
Knowledge-Intensive NLP Tasks”
Example: Searching legal documents

Input query UNITED STATES DISTRICT COURT UNITED STATES DISTRICT COURT
SOUTHERN DISTRICT OF MAINE SOUTHERN DISTRICT OF MAINE
Who is the
plaintiff in case CASE NUMBER: 22-48710BI-SME CASE NUMBER: 22-48710BI-SME
22-48710BI-SME?
Busy Industries (Plaintiff) Busy Industries (Plaintiff)
vs. vs.
State of Maine (Defendant) State of Maine (Defendant)

Who is the plaintiff in case

documents 22-48710BI-SME?

Query Encoder External Information Sources

Example: Searching legal documents

UNITED STATES DISTRICT COURT

SOUTHERN DISTRICT OF MAINE

CASE NUMBER: 22-48710BI-SME Completion

Busy Industries (Plaintiff)

vs. LLM Busy Industries
State of Maine (Defendant)

Who is the plaintiff in case

22-48710BI-SME?
RAG integrates with many types of data sources

External Information Sources

● Documents
External ● Wikis
Query
information
encoder
sources
● Expert Systems
● Web pages
● Databases
● Vector Store
Retriever
Data preparation for vector store for RAG
Two considerations for using external data in RAG:
1. Data must ﬁt inside context window

Prompt context limit Single document too Split long sources into
few 1000 tokens large to fit in window short chunks
Data preparation for RAG
Two considerations for using external data in RAG:
1. Data must fit inside context window
2. Data must be in format that allows its relevance to be assessed at
inference time: Embedding vectors fire

Prompt text converted book

jumps
to embedding vectors
X1 X2 X3 X4 X5
fox
swim

whale
Cosine similarity
Data preparation for RAG
Two considerations for using external data in RAG:
1. Data must ﬁt inside context window
2. Data must be in format that allows its relevance to be assessed at
inference time: Embedding vectors

Process each chunk with LLM X1 X2 X3 X4 X5 X6 X1 X2 X3 X4

to produce embedding vectors

X1 X2 X3 X4 X5 X1 X2 X3 X4 X5 X6 X7
Vector database search

Text 1
● Each text in vector store is
Text 5 identiﬁed by a key
Text 2 ● Enables a citation to be included
in completion
Prompt Text 6
Text 4

Text 3
Enabling interactions with
external applications
Having an LLM initiate a clothing return

ShopBot

Hi there! How can I help

you today?

I need to return a pair of

jeans that I purchased

Of course, I’d be happy T

to help you. Can you tell
me your order number?

Sure, it’s 21104.

T
Having an LLM initiate a clothing return

ShopBot

Lookup with RAG Ok, I’ve found your order.

Do you want to return
any other items from that
order?

No, only the jeans.

Ok great. Let me get a T

API call
return label from our
shipping partner. Can
you remind me of the
email you used to order?
Having an LLM initiate a clothing return

ShopBot

Sure, its [email protected]

API call to the
shipper Thank you! I’ve just sent T
the shipping label to your
email address. Please
return your jeans within
the next 5 days.

Great, thank you!

T
You’re welcome!
LLM-powered applications

LLM
External Data Sources

User
Application Documents Database Web
Orchestration library

External Trigger API call

Applications
Perform calculations
Requirements for using LLMs to power applications

Plan actions Format outputs Validate actions

Steps to process return: SQL Query: Collect required user
Step 1: Check order ID SELECT COUNT(*) information and make
Step 2: Request label FROM orders sure it is in the completion
Step 3: Verify user email WHERE order_id = 21104
User email:
Step 4: Email user label
[email protected]

Prompt structure is important!

Helping LLMs reason and
plan with Chain-of-Thought
Prompting
LLMs can struggle with complex reasoning problems

Prompt Model Completion

Q: Roger has 5 tennis balls. Q: Roger has 5 tennis balls.
He buys 2 more cans of tennis He buys 2 more cans of tennis
balls. Each can has 3 tennis balls. Each can has 3 tennis
balls. How many tennis balls LLM balls. How many tennis balls
does he have now? does he have now?

A: The answer is 11 A: The answer is 11

Q: The cafeteria had 23 Q: The cafeteria had 23

apples. If they used 20 to apples. If they used 20 to
make lunch and bought 6 more, make lunch and bought 6 more,
how many apples do they have? how many apples do they have?

❌
A: The answer is 27.
Humans take a step-by-step approach to solving
complex problems
Roger has 5 tennis balls. He Start: Roger started with 5 balls.
buys 2 more cans of tennis Step
Step1:1:22cans
cansof
of33tennis
tennisballs
ballseach
eachisis66tennis
tennisballs.
balls.
balls. Each can has 3 tennis Step
Step2:2:55++66==11
11
balls. How many tennis balls End: The answer is 11
does he have now?
Reasoning steps

“Chain of thought”
Chain-of-Thought Prompting can help LLMs reason

Prompt Model Completion

Q: Roger has 5 tennis balls. Q: Roger has 5 tennis balls.
He buys 2 more cans of tennis ...
balls. Each can has 3 tennis ...
balls. How many tennis balls LLM ...
does he have now? how many apples do they have?

A: Roger started with 5 A: The cafeteria had 23

balls. 2 cans of 3 tennis apples originally. They used
balls each is 6 tennis balls. 20 to make lunch. So they had
5 + 6 = 11.The answer is 11. 23-20 = 3. They bought 6 more
apples, so they have 3 + 6 =
Q: The cafeteria had 23
apples. If they used 20 to
9. The answer is 9.
✔
make lunch and bought 6 more,
how many apples do they have?
Source: Wei et al. 2022, “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models”
Chain-of-Thought Prompting can help LLMs reason

Prompt Model Completion

Q: Yes or no: Would a pear Q: Yes or no: Would a pear
sink in water? sink in water?
...
A: The density of a pear is LLM ...
about 0.6 g/cm^3, which is So the answer is no.
less than water. Thus, a pear
would float. So the answer is The density of gold is 19.3
no. g/cm^3, which is much greater
than the density of water.
Q. Yes or no: Would a gold Therefore, a gold ring would
ring sink to the bottom of a sink to the bottom of a
swimming pool? swimming pool. So the answer
is yes.

Source: Wei et al. 2022, “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models”
Program-aided Language
Models
LLMs can struggle with mathematics
Prompt Model Completion

What is 40366 / 439? LLM What is 40366 / 439?

92.549
Program-aided language (PAL) models

LLM Code
interpreter

Source: Gao et al. 2022, “PAL: Program-aided Language Models”

PAL example
Prompt with one-shot example
Q: Roger has 5 tennis balls. He buys 2 more
cans of tennis balls. Each can has 3 tennis balls.
How many tennis balls does he have now?

Answer:
# Roger started with 5 tennis balls
tennis_balls = 5
# 2 cans of tennis balls each is
bought_balls = 2 * 3
# tennis balls. The answer is
answer = tennis_balls + bought_balls

Q: The bakers at the Beverly Hills Bakery baked

200 loaves of bread on Monday morning. They sold
93 loaves in the morning and 39 loaves in the
afternoon. A grocery store returned 6 unsold
loaves. How many loaves did they have left?
PAL example
Prompt with one-shot example Completion, CoT reasoning (blue) , and PAL
Q: Roger has 5 tennis balls. He buys 2 more execution (pink)
cans of tennis balls. Each can has 3 tennis balls.
How many tennis balls does he have now? Answer:
# The bakers started with 200 loaves
Answer: loaves_baked = 200
# Roger started with 5 tennis balls # They sold 93 in the morning and 39 in the
tennis_balls = 5 afternoon
# 2 cans of tennis balls each is loaves_sold_morning = 93
bought_balls = 2 * 3 loaves_sold_afternoon = 39
# tennis balls. The answer is # The grocery store returned 6 loaves.
answer = tennis_balls + bought_balls loaves_returned = 6
# The answer is
Q: The bakers at the Beverly Hills Bakery baked answer = loaves_baked
200 loaves of bread on Monday morning. They sold - loaves_sold_morning
93 loaves in the morning and 39 loaves in the - loaves_sold_afternoon
afternoon. A grocery store returned 6 unsold + loaves_returned
loaves. How many loaves did they have left?
Program-aided language (PAL) models

PAL prompt LLM

template

def solution:
Python
interpreter
return answer

question PAL formatted Python answer = 74

prompt script
Program-aided language (PAL) models

LLM

Python
interpreter

PAL formatted answer = 74

solution
Program-aided language (PAL) models

LLM

def solution:
Completion with
return answer correct answer
Answer = 74

PAL formatted
solution
LLM-powered applications

LLM
External Data Sources

User
Application Documents Database Web
Orchestration library

External
Applications
PAL architecture

LLM

User
Application Python
Orchestration library
interpreter
LLM-powered applications

LLM
External Data Sources

User
Application Documents Database Web
Orchestration library

External
Applications
ReAct: Combining reasoning
and action in LLMs
ReAct: Synergizing Reasoning and Action in LLMs

LLM Websearch
API

HotPot QA: multi-step question answering

Fever: Fact veriﬁcation

Source: Yao et al. 2022, “ReAct: Synergizing Reasoning and Acting in Language Models”
ReAct: Synergizing Reasoning and Action in LLMs

Question Question: Problem that requires advanced

reasoning and multiple steps to solve.

Thought E.g.
“Which magazine was started ﬁrst,
Arthur’s Magazine or First for Women?”
Action

Observation

Source: Yao et al. 2022, “ReAct: Synergizing Reasoning and Acting in Language Models”
ReAct: Synergizing Reasoning and Action in LLMs

Question Thought: A reasoning step that identiﬁes

how the model will tackle the problem and
identify an action to take.
Thought
“I need to search Arthur’s Magazine and
First for Women, and ﬁnd which one was
started ﬁrst.”
Action

Observation
ReAct: Synergizing Reasoning and Action in LLMs

Question Action: An external task that the model can

carry out from an allowed set of actions.

Thought E.g.
search[entity]
lookup[string]
finish[answer]
Action
Which one to choose is determined by the
information in the preceding thought.
Observation

search[Arthur’s Magazine]
ReAct: Synergizing Reasoning and Action in LLMs

Question Observation: the result of carrying out the

action

Thought E.g.
“Arthur’s Magazine (1844-1846) was an
American literary periodical published in
Philadelphia in the 19th century.”
Action

Observation
ReAct: Synergizing Reasoning and Action in LLMs

Question Thought 2:
“Arthur’s magazine was started in 1844. I
need to search First for Women next.”
Thought
Action 2:
search[First for Women]
Action
Observation 2:
“First for Women is a woman’s magazine
Observation
published by Bauer Media Group in the
USA.[1] The magazine was started in
1989.”
ReAct: Synergizing Reasoning and Action in LLMs

Question Thought 3:
“First for Women was started in 1989.
1844 (Arthur’s Magazine) < 1989 (First for
Thought Women), so Arthur’s Magazine as started
ﬁrst”

Action 2:
Action
finish[Arthur’s Magazine]
Observation
ReAct instructions deﬁne the action space
Solve a question answering task with interleaving Thought, Action,
Observation steps.

Thought can reason about the current situation, and Action can be
three types:
(1) Search[entity], which searches the exact entity on Wikipedia and
returns the first paragraph if it exists. If not, it will return
some similar entities to search.
(2) Lookup[keyword], which returns the next sentence containing
keyword in the current passage.
(3) Finish[answer], which returns the answer and finishes the task.
Here are some examples.
Building up the ReAct prompt
Instructions

ReAct example Question Question

(could be more Thought Thought

than one
example) LLM
Action Action

Observation Observation

Question to be
answered
LangChain
LLM

Combined into a “chain”

LangChain
User External Data Sources
Application
Tools Agents
Documents Database Web

Prompt
Memory
Templates External
Applications
The signiﬁcance of scale: application building

BERT*
110M

BLOOM
176B

*Bert-base
LLM powered application
architectures
Building generative applications

Infrastructure e.g. Training/Fine-Tuning, Serving, Application Components

Building generative applications

LLM Models

Optimized
LLM