Deeplearning - Ai Deeplearning - Ai
Deeplearning - Ai Deeplearning - Ai
DeepLearning.AI makes these slides available for educational purposes. You may not use or
distribute these slides for commercial purposes. You may make copies of these slides and
use or distribute them for educational purposes as long as you cite DeepLearning.AI as the
source of the slides.
Prompt
engineering
Augment
Optimize
model and
Define the Choose Fine-tuning and deploy
Evaluate build LLM-
problem model model for
powered
inference
applications
Align with
human
feedback
Generative AI project lifecycle
Prompt
engineering Goals of fine-tuning
● Better Augment
understanding of
Optimize
prompts model and
Define the Choose Fine-tuning and deploy
Evaluate build LLM-
problem model model for task completion
● Better powered
inference
● More natural sounding
applications
Align with
human language
feedback
Models behaving badly
● Toxic language
● Aggressive responses
● Providing dangerous information
Models behaving badly
Prompt Model Completion HHH
Knock, knock Knock, knock Helpful?
LLM Clap, clap.
Prompt
engineering
Augment
Optimize
model and
Define the Choose Fine-tuning and deploy
Evaluate build LLM-
problem model model for
powered
inference
applications
Align with
human
feedback
Fine-tuning with human feedback
Fine-tuning with human feedback
Fraction of model
generated results Reference summaries
preferred over Initial fine-tuning
human responses
No fine-tuning
Source:
Stiennon et al. 2020, “Learning to summarize from human feedback”
Model size (# parameters)
Reinforcement learning from human feedback (RLHF)
● Maximize helpfulness,
relevance
● Minimize harm
● Avoid dangerous topics
Reinforcement learning (RL)
Agent
Environment
Reinforcement learning (RL)
Agent
state st action at
reward rt
Environment
Reinforcement learning: Tic-Tac-Toe
Objective: Playout/Rollout
Win the game! Agent
RL Policy (Model)
Environment
Reinforcement learning: fine-tune LLMs
Objective: Rollout
Generate aligned text Instruct Agent
LLM RL Policy = LLM
Reward
Current Context Token Vocabulary
reward rt Model
state st action at
Environment
LLM Context
Reinforcement learning: fine-tune LLMs
Rollout
Instruct Agent
LLM RL Policy = LLM
Reward
Current Context Token Vocabulary
reward rt Model
state st action at
Environment
LLM Context
Collecting human feedback
Prepare dataset for human feedback
Prompt Samples Model Completions
Prompt Instruct
Dataset LLM
Collect human feedback
● Define your model alignment criterion
● For the prompt-response sets that you just generated, obtain human
feedback through labeler workforce Completion
1 [1,0] [1,0]
3 [1,0] [1,0]
Preferred Reward
completion is Prompt x,
rj
always yj Completion yj
Reward
Model
Prompt x, Reward
Completion yk rk
loss = log(σ(rj-rk))
Logits Probabilities
Logits Probabilities
Prompt Instruct RL
Dataset LLM algorithm
“A dog is…”
“...a furry animal.” prompt="A dog is"
completion="a furry animal"
Reward 0.24 reward=0.24
Model
Use the reward model to fine-tune LLM with RL
Prompt RL-updated RL
Dataset LLM algorithm
“A dog is…”
“...a furry animal.”
Reward 0.24
Model
RLHF
Iteration 1
Use the reward model to fine-tune LLM with RL
Prompt RL-updated RL
Dataset LLM algorithm
“A dog is…”
“...a friendly animal.”
Reward 0.51
Model
RLHF
Iteration 2
Use the reward model to fine-tune LLM with RL
Prompt RL-updated RL
Dataset LLM algorithm
“A dog is…”
“...a human companion.”
Reward 0.68
Model
RLHF
Iteration 3
Use the reward model to fine-tune LLM with RL
Prompt RL-updated RL
Dataset LLM algorithm
“A dog is…”
“...the most popular pet.”
Reward 1.79
Model
RLHF
Iteration 4…
Use the reward model to fine-tune LLM with RL
Human-
Prompt RL
Dataset
aligned algorithm
LLM
“A dog is…”
“...man’s best friend”
Reward 2.87
Model
RLHF
Iteration n
Use the reward model to fine-tune LLM with RL
Human- Proximal
Prompt RL
Dataset
aligned PPO
algorithm
Policy
LLM Optimization
“A dog is…”
“...man’s best friend”
Reward 2.87
Model
Proximal Policy
Optimization
Dr. Ehsan Kamalinejad
Proximal policy optimization (PPO)
Prompt
Dataset
LLM PPO
“A dog is…”
“...man’s best friend”
Reward
Model
RLHF
Iteration n
Initialize PPO with Instruct LLM
Instruct
LLM
Phase 1 Phase 2
Create completions Model update
PPO Phase 1: Create completions
Prompt
A dog is
Experiments
Completion
A dog is to assess the
a furry animal outcome of the
Instruct current model,
LLM
Prompt
e.g. how
This house is helpful,
harmless,
Completion honest the
Phase 1
This house is model is
Create completions
very ugly
…
Calculate rewards
Prompt Prompt
A dog is This house is
Completion Completion …
A dog is This house is
a furry animal very ugly
Reward Reward
1.87 -1.24
Model Model
Calculate value loss
Prompt
Value
A dog is
function
Completion
A dog is
a ...
Estimated
future total reward
0.34
Calculate value loss
Prompt
Value
A dog is
function
Completion
A dog is
a furry...
Estimated
future total reward
1.23
Calculate value loss
Prompt
Value
A dog is
loss
Completion
A dog is
a furry...
Estimated Known
future total reward future total reward
1.23 1.87
PPO Phase 2: Model update
Instruct Updated
LLM LLM
Phase 1 Phase 2
Create completions Model update
PPO Phase 2: Calculate policy loss
PPO Phase 2: Calculate policy loss
Guardrails:
Keeping the policy in the "trust region"
PPO Phase 2: Calculate entropy loss
Hyperparameters
Updated
Instruct Updated
LLM LLM
Phase 1 Phase 2
Create completions Model update
After many iterations, human-aligned LLM!
Human-
aligned
LLM
Phase 1 Phase 2
Create completions Model update
Fine-tuning LLMs with RLHF
Human-
Prompt Instruct
Dataset
aligned PPO
LLM
LLM
“A dog is…”
“...man’s best friend”
Reward
Model
Potential problem: reward hacking
Prompt Instruct
PPO
Dataset LLM
“This product is…”
“...complete garbage.”
Toxicity -1.8
Reward
Model
Potential problem: reward hacking
Prompt RL-updated
PPO
Dataset LLM
“This product is…”
“okay but not the best.”
Toxicity 0.3
Reward
Model
Potential problem: reward hacking
Prompt RL-updated
PPO
Dataset LLM
“This product is…”
“..the most awesome, most
incredible thing ever.”
Toxicity 2.1
Reward
Model
Potential problem: reward hacking
Prompt RL-updated
PPO
Dataset LLM
“This product is…”
“Beautiful love and world peace all
around.”
Toxicity 3.7
Reward
Model
Avoiding reward hacking
Initial
Prompt Reference RL-updated
Instruct
❄
Dataset Model LLM
LLM
“This product is…”
“useful and “..the most awesome, most
well-priced.” incredible thing ever.”
Avoiding reward hacking
❄
Dataset Model LLM
“This product is…”
“useful and “..the most awesome, most
well-priced.” incredible thing ever.”
KL Divergence
Shift Penalty
Avoiding reward hacking
❄
Dataset Model LLM
“This product is…”
“useful and “..the most awesome, most
well-priced.” incredible thing ever.”
Reward
Model
KL Divergence
Shift Penalty
❄ ❄
Dataset Model Model
“This product is…”
“useful and “..the most awesome, most
well-priced.” incredible thing ever.”
Reward
Model
KL Divergence
Shift Penalty
❄ ❄
Dataset Model Model
“This product is…”
“useful and “..the most awesome, most
well-priced.” incredible thing ever.”
Reward
Model
KL Divergence
Shift Penalty
Human- Rules
…
aligned …
LLM …
Constitutional AI
Choose the response that answers the human in the most thoughtful,
respectful and cordial manner.
Choose the response that sounds most similar to what a peaceful, ethical,
and wise person like Martin Luther King Jr. or Mahatma Gandhi might say.
...
Response,
Helpful Fine-tuned
Red Teaming critique and
LLM revision LLM
Response,
Helpful Fine-tuned
Red Teaming critique and
LLM revision LLM
Generate
Ask model: Fine-tune your
responses to Reward Consitutional
which response LLM with
“Red Teaming” model LLM
is preferred? Preferences
prompts
Source: Bai et al. 2022, “Constitutional AI: Harmlessness from AI Feedback” Reinforcement Learning Stage - RLAIF
Optimize LLMs and
build generative AI
applications
Generative AI project lifecycle
Prompt
engineering
Augment
Optimize
model and
Define the Choose Fine-tuning and deploy
Evaluate build LLM-
problem model model for
powered
inference
applications
Align with
human
feedback
Generative AI project lifecycle
Prompt
engineering
Augment
Optimize
model and
Define the Choose Fine-tuning and deploy
Evaluate build LLM-
problem model model for
powered
inference
applications
Align with
human
feedback
Generative AI project lifecycle
Prompt
engineering
Augment
Optimize
model and
Define the Choose Fine-tuning and deploy
Evaluate build LLM-
problem model model for
powered
inference
applications
Align with
human
feedback
Model optimizations to
improve application performance
LLM optimization techniques
LLM LLM
LLM LLM
Teacher Student
16-bit
Pruned
quantized
LLM
LLM
LLM optimization techniques
LLM LLM
LLM LLM
Teacher Student
16-bit
Pruned
quantized
LLM
LLM
Distillation
Train a smaller student model from a larger teacher model
labels predictions
Distillation
Loss
softmax softmax
❄ LLM
Teacher
Knowledge
Distillation
LLM
Student
Training
Data
Distillation
Train a smaller student model from a larger teacher model
labels predictions
Distillation
Loss
softmax Temperature softmax
T
❄ LLM
Teacher
Knowledge
Distillation
LLM
Student
Training
Data
Distillation
Train a smaller student model from a larger teacher model
❄ LLM
Teacher
Knowledge
Distillation
LLM
Student
Training
Data
Distillation
Train a smaller student model from a larger teacher model
`1 hard labels
soft labels soft predictions hard predictions
Distillation Student (ground truth)
Loss Loss
softmax softmax softmax Temperature
T=1
❄ LLM
Teacher
Knowledge
Distillation
LLM
Student
Training
Data
Distillation
Train a smaller student model from a larger teacher model
hard labels
soft labels soft predictions hard predictions
Distillation Student (ground truth)
Loss Loss
softmax softmax softmax
❄ LLM
Teacher
Knowledge
Distillation
LLM
Student
External
Applications
Training
Data
Post-Training Quantization (PTQ)
Reduce precision of model weights
● Requires calibration
8-bit to capture dynamic range
quantized
LLM 0
FP16 | BFLOAT16 | INT8
? ?
16-bit floating point | 8-bit integer
Pruning
Remove model weights with values close or equal to zero
● Pruning methods
○ Full model re-training
LLM
○ PEFT/LoRA
○ Post-training
Training Days to weeks to Not required Minutes to hours Minutes to hours similar to Minutes to hours
duration months fine-tuning
Customization Determine model No model weights Tune for specific tasks Need separate reward model Reduce model size
architecture, size and to align with human goals through model
tokenizer. Only prompt Add domain-specific (helpful, honest, harmless) pruning, weight
customization data quantization,
Choose vocabulary size Update LLM model or distillation
and # of tokens for Update LLM model or adapter weights
input/context adapter weights Smaller size, faster
inference
Large amount of
domain training data
Objective Next-token prediction Increase task Increase task Increase alignment with Increase inference
performance performance human preferences performance
Prompt
engineering
Augment
Optimize
model and
Define the Choose Fine-tuning and deploy
Evaluate build LLM-
problem model model for
powered
inference
applications
Align with
human
feedback
LLM-powered applications
LLM
External Data Sources
User
Application Documents Database Web
Orchestration library
External
Applications
Retrieval augmented
generation (RAG)
Knowledge cut-offs in LLMs
Prompt Model Completion
Boris Johnson
LLM-powered applications
LLM
External Data Sources
User
Application Documents Database Web
Orchestration library
External
Applications
Retrieval Augmented Generation (RAG)
External
Query
encoder
information LLM
sources
Retriever
Lewis et al. 2020 “Retrieval-Augmented Generation for
Knowledge-Intensive NLP Tasks”
Example: Searching legal documents
Input query UNITED STATES DISTRICT COURT UNITED STATES DISTRICT COURT
SOUTHERN DISTRICT OF MAINE SOUTHERN DISTRICT OF MAINE
Who is the
plaintiff in case CASE NUMBER: 22-48710BI-SME CASE NUMBER: 22-48710BI-SME
22-48710BI-SME?
Busy Industries (Plaintiff) Busy Industries (Plaintiff)
vs. vs.
State of Maine (Defendant) State of Maine (Defendant)
Prompt context limit Single document too Split long sources into
few 1000 tokens large to fit in window short chunks
Data preparation for RAG
Two considerations for using external data in RAG:
1. Data must fit inside context window
2. Data must be in format that allows its relevance to be assessed at
inference time: Embedding vectors fire
whale
Cosine similarity
Data preparation for RAG
Two considerations for using external data in RAG:
1. Data must fit inside context window
2. Data must be in format that allows its relevance to be assessed at
inference time: Embedding vectors
X1 X2 X3 X4 X5 X1 X2 X3 X4 X5 X6 X7
Vector database search
Text 1
● Each text in vector store is
Text 5 identified by a key
Text 2 ● Enables a citation to be included
in completion
Prompt Text 6
Text 4
Text 3
Enabling interactions with
external applications
Having an LLM initiate a clothing return
ShopBot
T
Having an LLM initiate a clothing return
ShopBot
ShopBot
T
You’re welcome!
LLM-powered applications
LLM
External Data Sources
User
Application Documents Database Web
Orchestration library
❌
A: The answer is 27.
Humans take a step-by-step approach to solving
complex problems
Roger has 5 tennis balls. He Start: Roger started with 5 balls.
buys 2 more cans of tennis Step
Step1:1:22cans
cansof
of33tennis
tennisballs
ballseach
eachisis66tennis
tennisballs.
balls.
balls. Each can has 3 tennis Step
Step2:2:55++66==11
11
balls. How many tennis balls End: The answer is 11
does he have now?
Reasoning steps
“Chain of thought”
Chain-of-Thought Prompting can help LLMs reason
Source: Wei et al. 2022, “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models”
Program-aided Language
Models
LLMs can struggle with mathematics
Prompt Model Completion
LLM Code
interpreter
Answer:
# Roger started with 5 tennis balls
tennis_balls = 5
# 2 cans of tennis balls each is
bought_balls = 2 * 3
# tennis balls. The answer is
answer = tennis_balls + bought_balls
def solution:
Python
interpreter
return answer
LLM
Python
interpreter
LLM
def solution:
Completion with
return answer correct answer
Answer = 74
PAL formatted
solution
LLM-powered applications
LLM
External Data Sources
User
Application Documents Database Web
Orchestration library
External
Applications
PAL architecture
LLM
User
Application Python
Orchestration library
interpreter
LLM-powered applications
LLM
External Data Sources
User
Application Documents Database Web
Orchestration library
External
Applications
ReAct: Combining reasoning
and action in LLMs
ReAct: Synergizing Reasoning and Action in LLMs
LLM Websearch
API
Source: Yao et al. 2022, “ReAct: Synergizing Reasoning and Acting in Language Models”
ReAct: Synergizing Reasoning and Action in LLMs
Thought E.g.
“Which magazine was started first,
Arthur’s Magazine or First for Women?”
Action
Observation
Source: Yao et al. 2022, “ReAct: Synergizing Reasoning and Acting in Language Models”
ReAct: Synergizing Reasoning and Action in LLMs
Observation
ReAct: Synergizing Reasoning and Action in LLMs
Thought E.g.
search[entity]
lookup[string]
finish[answer]
Action
Which one to choose is determined by the
information in the preceding thought.
Observation
search[Arthur’s Magazine]
ReAct: Synergizing Reasoning and Action in LLMs
Thought E.g.
“Arthur’s Magazine (1844-1846) was an
American literary periodical published in
Philadelphia in the 19th century.”
Action
Observation
ReAct: Synergizing Reasoning and Action in LLMs
Question Thought 2:
“Arthur’s magazine was started in 1844. I
need to search First for Women next.”
Thought
Action 2:
search[First for Women]
Action
Observation 2:
“First for Women is a woman’s magazine
Observation
published by Bauer Media Group in the
USA.[1] The magazine was started in
1989.”
ReAct: Synergizing Reasoning and Action in LLMs
Question Thought 3:
“First for Women was started in 1989.
1844 (Arthur’s Magazine) < 1989 (First for
Thought Women), so Arthur’s Magazine as started
first”
Action 2:
Action
finish[Arthur’s Magazine]
Observation
ReAct instructions define the action space
Solve a question answering task with interleaving Thought, Action,
Observation steps.
Thought can reason about the current situation, and Action can be
three types:
(1) Search[entity], which searches the exact entity on Wikipedia and
returns the first paragraph if it exists. If not, it will return
some similar entities to search.
(2) Lookup[keyword], which returns the next sentence containing
keyword in the current passage.
(3) Finish[answer], which returns the answer and finishes the task.
Here are some examples.
Building up the ReAct prompt
Instructions
Observation Observation
Question to be
answered
LangChain
LLM
LangChain
User External Data Sources
Application
Tools Agents
Documents Database Web
Prompt
Memory
Templates External
Applications
The significance of scale: application building
BERT*
110M
BLOOM
176B
*Bert-base
LLM powered application
architectures
Building generative applications
LLM Models
Optimized
LLM
Optimized
Database LLM
Documents Web
Optimized
Database LLM
Documents Web
Optimized
Database LLM
Documents Web
Optimized
Database LLM
Documents Web
Optimized
Database LLM
Documents Web
Optimized
Database LLM
Documents Web
● Toxicity
● Hallucinations
● Intellectual Property
Toxicity
How to mitigate?
● Careful curation of training data
● Train guardrail models to filter out unwanted content
● Diverse group of human annotators
Hallucinations
How to mitigate?
● Educate users about how generative AI works
● Add disclaimers
● Augment LLMs with independent, verified citation databases
● Define intended/unintended use cases
Intellectual Property
Ensure people aren't plagiarizing, make sure there aren't any copyright issues
How to mitigate?
● Mix of technology, policy, and legal mechanisms
● Machine "unlearning"
● Filtering and blocking approaches
Responsibly build and use generative AI models
● Define use cases: the more specific/narrow, the better
● Assess risks for each use case
● Evaluate performance for each use case
● Iterate over entire AI lifecycle
On-going research
● Responsible AI
● Scale models and predict performance
● More efficiencies across model development lifecycle
● Increased and emergent LLM capabilities