0% found this document useful (0 votes)
3 views

Chat Gpt

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Chat Gpt

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

March, 2023

ChatGPT
from techincal perspective
Nguyen Phi Le
Agenda
◦ What is ChatGPT?
◦ Reinforcement Learning from Human Feedback
◦ Training mechanism of ChatGPT
◦ Pros and Cons of ChatGPT

2
Agenda
◦ What is ChatGPT?
◦ Reinforcement Learning from Human Feedback
◦ Training mechanism of ChatGPT
◦ Pros and Cons of ChatGPT

3
What is ChatGPT?
◦ A conversational support system based on the Generative Pre-trained
Transformer (GPT) language model developed by OpenAI
◦ A sibling model to InstructGPT, which is trained to follow an instruction in
a prompt and provide a detailed response
◦ Trained on a large amount of data: Books, reports, articles, and websites, …
◦ Trained using Reinforcement Learning from Human Feedback (RLHF)

GPT-3.5 Reinforcement Learning from ChatGPT


Human Feedback (RLHF)

4
Some usecases

3. When we claim the anwser is


wrong, ChatGPT provides
another answer from the pool

1. this link is not correct

2. this link is also not correct


Bug fixing
Information retrieval

5
What is the significant difference between
ChatGPT and counterparts?
The answer from ChatGPT The answer from OpenAI’s paper

ChatGPT tries to follow the user’s intention


à pros: making the conversation more natural
BUT
àcons: may give the wrong answer, being
navigated by user’s feedback

6
What is the significant difference between
ChatGPT and counterparts?
◦ Wrong cases of ChatGPT caused by the feedback from the user

7
Agenda
◦ What is ChatGPT?
◦ Reinforcement Learning from Human Feedback
◦ Reinforcement learning (RL)
◦ Reinforcement learning from Human Feedback (RLHF)
◦ Training mechanism of ChatGPT
◦ Pros and Cons of ChatGPT

8
Agenda
◦ What is ChatGPT?
◦ Reinforcement Learning from Human Feedback
◦ Reinforcement learning (RL)
◦ Reinforcement learning from Human Feedback (RLHF)
◦ Training mechanism of ChatGPT
◦ Pros and Cons of ChatGPT

9
An intuitive example of RL

- The chicken is performing a simple


reinforcement learning strategy
- it does the act of pecking at the pink
piece of paper, because every time it
does, it will be fed

10
Interactive learning
Sensor node

Base station

Multi-Armed Bandit Traffic light control

Wireless sensor networks


ü The learner is not told which actions to take à trial-and-error
ü Action may affect also future situation à delayed reward

11
Definition of Reinforcement learning
◦ Reinforcement learning
◦ Learning what to do—how to map situations to actions— to achieve the goal
◦ Reward hypothesis
◦ That all of what we mean by goals and purposes can be well thought of as the
maximization of the expected value of the cumulative sum of a received scalar
signal (called reward) (Richard S. Sutton, RL: An introduction, 2018)

12
Goal and reward
◦ Reward: a scalar signal received from the environment at each step
◦ Goal: maximizing cumulative reward in the long run
Sensor node

Multi-Armed Bandit Traffic light control


Base station

Goal = maximizing total gain Goal = decreasing traffic jam Goal = increasing network lifetime
Reward = gain at every turn Reward = waiting time, Reward = load balancing,
queue length, lane speed, … route length, …

13
RL vs other learning techniques
◦ RL vs supervised learning
◦ Supervised learning: learning from a training set of labeled examples
◦ Know the true action to take
◦ Reinforcement learning: do not know the optimal action
◦ RL vs unsupervised learning
◦ Unsupervised learning: finding structure hidden in collections of unlabeled data
◦ Reinforcement learning: maximizing the reward signal

14
RL framework
◦ Policy: a mapping from states to actions
◦ may be stochastic, specifying probabilities for each action
◦ Reward signal: the goal of a reinforcement learning problem
◦ objective is to maximize the total reward the agent receives over the long run
◦ Value: total amount of accumulative reward over the future, starting from
that state
◦ Reward: immediate; Value: long-run
o Action: can be any decisions we want to learn how to make
o State: can be anything we can know that might be useful in making them
Agent Reward signal

Action
State Policy Environement

New state
15
Agenda
◦ What is ChatGPT?
◦ Reinforcement Learning from Human Feedback
◦ Reinforcement learning (RL)
◦ Reinforcement learning from Human Feedback (RLHF)
◦ Training mechanism of ChatGPT
◦ Pros and Cons of ChatGPT

16
Reinforcement learning from Human Feedback (RLHF)
New state
Environment

State
Action
Policy

Reward: Reflects the goodness of the action


How to calculate the reward?
Reward signal

Conventional RL usually determines reward using a well-defined rewad function


Problem: For some tasks (e.g., Languague model), it is difficult to determine reward function

Reinforcement learning from Human Feedback:


Determine the reward based on the feedback obtained from human

Train a reward model that aims to providing reward values resembling human’s feedback

17
Reinforcement learning from Human Feedback (RLHF)

Ø Human ranks the goodness of actions.


Ø The feedback from human is used to train a
reward model
Ø training objective: reward signals
resemble the feedback provided by
human

Schematic illustration of Reinforcement learning


from Human Feedback

Image credit: OpenAI


18
Training mechanism of ChatGPT
• Training objectives: use reinforcement learning from human feedback to fine-tune GPT-3 to follow a
broad class of written instructions
1. Data collection and baselines training 2. Train a reward model
• use a team of contractors to label the data • collect a dataset of human-labeled comparisons
• collect a dataset of human-written between outputs from baseline models on a
demonstrations of the desired output behavior on larger set of API prompts
(mostly English) prompts submitted to the OpenAI • train a reward model (RM) on this dataset to
API and some labeler-written prompts predict which model output our labelers would
prefer
use this data to fine-tune GPT-3
and produce baselines

3. Optimize the policy using the trained reward


model
• use the RM as a reward function and fine-tune
the supervised learning baseline to maximize
supervised learning baselines this reward using the PPO algorithm

19
Training mechanism of ChatGPT

Baselines training Finetune the baseline using RLHF


Image credit: OpenAI
20
Training mechanism of ChatGPT
◦ Reward model training
◦ For a prompt 𝑥, labelers are given 𝐾 responses to rank 4 ≤ 𝐾 ≤ 9
𝐾
◦ producing comparisons of responses à these comparisons are collected to train the
2
reward model
◦ loss function for training the reward model

the parameters of the reward model (𝑦! , 𝑦" ) is a pair of responses against the prompt 𝑥
𝑦! is ranked higher than 𝑦! (according to the labeler)
𝑟# 𝑥, 𝑦! : reward of 𝑦! determined by the reward model
𝑟# 𝑥, 𝑦" : reward of 𝑦" determined by the reward model

à by minimizing 𝑙𝑜𝑠𝑠 𝜃 ,we encourage the reward model to give 𝑦! a higher reward value than 𝑦"
à making the rewards provided by the reward model simiar to those evaluated by the labeler
21
Training mechanism of ChatGPT
◦ Illustration of the reward model training process

Image credit: https://huggingface.co/blog/rlhf


22
Training mechanism of ChatGPT
◦ Policy optimization
◦ Using reinforcement learning paradigm to finetune the supervised learning baseline
◦ Agent: The language model (which will produce the response)
◦ Environment: the conversation
◦ Policy: The strategy to generate the response
◦ Action: Response generation
◦ Reward: A scalar determined by the reward model which reflects the goodness of the response
◦ The loss function used to update the policy
◦ (𝑥, 𝑦): a pair of prompt-response
%&
◦ 𝜋$ : the policy that has been learned, 𝜋 '() : supervised trained model; 𝐷*+,-+./0 : the pretraining distribution

𝜋'() 𝑦|𝑥
𝑜𝑏𝑗𝑒𝑐𝑡𝑖𝑣𝑒 𝜙 = 𝐸 !,# ~% #$ 𝑟& 𝑥, 𝑦 − 𝛽 log + 𝛾𝐸!~%%&'(&)*+ log 𝜋'() 𝑦|𝑥
!" 𝜋 *+, 𝑦|𝑥

$%
The KL divergence between 𝜋# and 𝜋 *+, maximize the reward for responses
reward of response y against prompt x
à we want to maximize this one à to mitigate over-optimization of the produced by a pretraining model
reward à fix the performance regressions on
à guarantee that the model produced is public NLP datasets
not too different from the original model
23
Training mechanism of ChatGPT
◦ Illustration of the policy optimization process

Image credit: https://huggingface.co/blog/rlhf


24
Agenda
◦ What is ChatGPT?
◦ Reinforcement Learning from Human Feedback
◦ Training mechanism of ChatGPT
◦ Pros and Cons of ChatGPT

25
Pros
◦ Natrual and flexible conversation

26
Pros
◦ Providing common knowledge quickly with high accuracy

users can quickly grasp


what they want to
know by using
ChatGPT

ChatGPT vs Google
Users have to find the answer
by themselves
27
Limitations (stated by OpenAI)
1. ChatGPT sometimes writes Plausible-sounding but incorrect or
nonsensical answers

Plausible-sounding but incorrect


28
Limitations (stated by OpenAI)
1. ChatGPT sometimes writes Plausible-sounding but incorrect or
nonsensical answers
◦ Fixing this issue is challenging, as:
◦ During RL training, there’s currently no source of truth
◦ Training the model to be more cautious causes it to decline questions that it can
answer correctly
◦ Supervised training misleads the model because the ideal answer depends on
what the model knows, rather than what the human demonstrator knows.

29
Limitations (stated by OpenAI)
2. ChatGPT is sensitive to tweaks to the input phrasing or attempting the
same prompt multiple times

3. When we claim the anwser is


wrong, ChatGPT provide
another answer from the pool

1. this link is not correct

2. this link is also not correct


30
Limitations (stated by OpenAI)
2. ChatGPT is sensitive to tweaks to the input phrasing or attempting the
same prompt multiple times

when the user tweaks the input

The responses of ChatGPT get confused


31
Limitations (stated by OpenAI)
3. Ideally, the model would ask clarifying questions when the user
provided an ambiguous query. Instead, our current models usually
guess what the user intended

32
Limitations (stated by OpenAI)
4. While OpenAI made efforts to make the model refuse inappropriate
requests, it will sometimes respond to harmful instructions or exhibit
biased behavior

33
References
◦ https://openai.com/blog/chatgpt
◦ https://huggingface.co/blog/rlhf
◦ Christiano, Paul F., et al. "Deep reinforcement learning from human
preferences." Advances in neural information processing systems 30
(2017)
◦ Ouyang, Long, et al. "Training language models to follow instructions
with human feedback." Advances in Neural Information Processing
Systems 35 (2022): 27730-27744.
◦ Stiennon, Nisan, et al. "Learning to summarize with human
feedback." Advances in Neural Information Processing Systems 33
(2020): 3008-3021.

34
35

You might also like