Chat Gpt
Chat Gpt
ChatGPT
from techincal perspective
Nguyen Phi Le
Agenda
◦ What is ChatGPT?
◦ Reinforcement Learning from Human Feedback
◦ Training mechanism of ChatGPT
◦ Pros and Cons of ChatGPT
2
Agenda
◦ What is ChatGPT?
◦ Reinforcement Learning from Human Feedback
◦ Training mechanism of ChatGPT
◦ Pros and Cons of ChatGPT
3
What is ChatGPT?
◦ A conversational support system based on the Generative Pre-trained
Transformer (GPT) language model developed by OpenAI
◦ A sibling model to InstructGPT, which is trained to follow an instruction in
a prompt and provide a detailed response
◦ Trained on a large amount of data: Books, reports, articles, and websites, …
◦ Trained using Reinforcement Learning from Human Feedback (RLHF)
4
Some usecases
5
What is the significant difference between
ChatGPT and counterparts?
The answer from ChatGPT The answer from OpenAI’s paper
6
What is the significant difference between
ChatGPT and counterparts?
◦ Wrong cases of ChatGPT caused by the feedback from the user
7
Agenda
◦ What is ChatGPT?
◦ Reinforcement Learning from Human Feedback
◦ Reinforcement learning (RL)
◦ Reinforcement learning from Human Feedback (RLHF)
◦ Training mechanism of ChatGPT
◦ Pros and Cons of ChatGPT
8
Agenda
◦ What is ChatGPT?
◦ Reinforcement Learning from Human Feedback
◦ Reinforcement learning (RL)
◦ Reinforcement learning from Human Feedback (RLHF)
◦ Training mechanism of ChatGPT
◦ Pros and Cons of ChatGPT
9
An intuitive example of RL
10
Interactive learning
Sensor node
Base station
11
Definition of Reinforcement learning
◦ Reinforcement learning
◦ Learning what to do—how to map situations to actions— to achieve the goal
◦ Reward hypothesis
◦ That all of what we mean by goals and purposes can be well thought of as the
maximization of the expected value of the cumulative sum of a received scalar
signal (called reward) (Richard S. Sutton, RL: An introduction, 2018)
12
Goal and reward
◦ Reward: a scalar signal received from the environment at each step
◦ Goal: maximizing cumulative reward in the long run
Sensor node
Goal = maximizing total gain Goal = decreasing traffic jam Goal = increasing network lifetime
Reward = gain at every turn Reward = waiting time, Reward = load balancing,
queue length, lane speed, … route length, …
13
RL vs other learning techniques
◦ RL vs supervised learning
◦ Supervised learning: learning from a training set of labeled examples
◦ Know the true action to take
◦ Reinforcement learning: do not know the optimal action
◦ RL vs unsupervised learning
◦ Unsupervised learning: finding structure hidden in collections of unlabeled data
◦ Reinforcement learning: maximizing the reward signal
14
RL framework
◦ Policy: a mapping from states to actions
◦ may be stochastic, specifying probabilities for each action
◦ Reward signal: the goal of a reinforcement learning problem
◦ objective is to maximize the total reward the agent receives over the long run
◦ Value: total amount of accumulative reward over the future, starting from
that state
◦ Reward: immediate; Value: long-run
o Action: can be any decisions we want to learn how to make
o State: can be anything we can know that might be useful in making them
Agent Reward signal
Action
State Policy Environement
New state
15
Agenda
◦ What is ChatGPT?
◦ Reinforcement Learning from Human Feedback
◦ Reinforcement learning (RL)
◦ Reinforcement learning from Human Feedback (RLHF)
◦ Training mechanism of ChatGPT
◦ Pros and Cons of ChatGPT
16
Reinforcement learning from Human Feedback (RLHF)
New state
Environment
State
Action
Policy
Train a reward model that aims to providing reward values resembling human’s feedback
17
Reinforcement learning from Human Feedback (RLHF)
19
Training mechanism of ChatGPT
the parameters of the reward model (𝑦! , 𝑦" ) is a pair of responses against the prompt 𝑥
𝑦! is ranked higher than 𝑦! (according to the labeler)
𝑟# 𝑥, 𝑦! : reward of 𝑦! determined by the reward model
𝑟# 𝑥, 𝑦" : reward of 𝑦" determined by the reward model
à by minimizing 𝑙𝑜𝑠𝑠 𝜃 ,we encourage the reward model to give 𝑦! a higher reward value than 𝑦"
à making the rewards provided by the reward model simiar to those evaluated by the labeler
21
Training mechanism of ChatGPT
◦ Illustration of the reward model training process
𝜋'() 𝑦|𝑥
𝑜𝑏𝑗𝑒𝑐𝑡𝑖𝑣𝑒 𝜙 = 𝐸 !,# ~% #$ 𝑟& 𝑥, 𝑦 − 𝛽 log + 𝛾𝐸!~%%&'(&)*+ log 𝜋'() 𝑦|𝑥
!" 𝜋 *+, 𝑦|𝑥
$%
The KL divergence between 𝜋# and 𝜋 *+, maximize the reward for responses
reward of response y against prompt x
à we want to maximize this one à to mitigate over-optimization of the produced by a pretraining model
reward à fix the performance regressions on
à guarantee that the model produced is public NLP datasets
not too different from the original model
23
Training mechanism of ChatGPT
◦ Illustration of the policy optimization process
25
Pros
◦ Natrual and flexible conversation
26
Pros
◦ Providing common knowledge quickly with high accuracy
ChatGPT vs Google
Users have to find the answer
by themselves
27
Limitations (stated by OpenAI)
1. ChatGPT sometimes writes Plausible-sounding but incorrect or
nonsensical answers
29
Limitations (stated by OpenAI)
2. ChatGPT is sensitive to tweaks to the input phrasing or attempting the
same prompt multiple times
32
Limitations (stated by OpenAI)
4. While OpenAI made efforts to make the model refuse inappropriate
requests, it will sometimes respond to harmful instructions or exhibit
biased behavior
33
References
◦ https://openai.com/blog/chatgpt
◦ https://huggingface.co/blog/rlhf
◦ Christiano, Paul F., et al. "Deep reinforcement learning from human
preferences." Advances in neural information processing systems 30
(2017)
◦ Ouyang, Long, et al. "Training language models to follow instructions
with human feedback." Advances in Neural Information Processing
Systems 35 (2022): 27730-27744.
◦ Stiennon, Nisan, et al. "Learning to summarize with human
feedback." Advances in Neural Information Processing Systems 33
(2020): 3008-3021.
34
35