0% found this document useful (0 votes)

3 views

Chat Gpt

Uploaded by

Nguyễn Thanh Tuyển

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

Chat Gpt

Uploaded by

Nguyễn Thanh Tuyển

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 35

March, 2023

ChatGPT
from techincal perspective
Nguyen Phi Le
Agenda
◦ What is ChatGPT?
◦ Reinforcement Learning from Human Feedback
◦ Training mechanism of ChatGPT
◦ Pros and Cons of ChatGPT

2
Agenda
◦ What is ChatGPT?
◦ Reinforcement Learning from Human Feedback
◦ Training mechanism of ChatGPT
◦ Pros and Cons of ChatGPT

3
What is ChatGPT?
◦ A conversational support system based on the Generative Pre-trained
Transformer (GPT) language model developed by OpenAI
◦ A sibling model to InstructGPT, which is trained to follow an instruction in
a prompt and provide a detailed response
◦ Trained on a large amount of data: Books, reports, articles, and websites, …
◦ Trained using Reinforcement Learning from Human Feedback (RLHF)

GPT-3.5 Reinforcement Learning from ChatGPT

Human Feedback (RLHF)

4
Some usecases

3. When we claim the anwser is

wrong, ChatGPT provides
another answer from the pool

1. this link is not correct

2. this link is also not correct

Bug fixing
Information retrieval

5
What is the significant difference between
ChatGPT and counterparts?
The answer from ChatGPT The answer from OpenAI’s paper

ChatGPT tries to follow the user’s intention

à pros: making the conversation more natural
BUT
àcons: may give the wrong answer, being
navigated by user’s feedback

6
What is the significant difference between
ChatGPT and counterparts?
◦ Wrong cases of ChatGPT caused by the feedback from the user

7
Agenda
◦ What is ChatGPT?
◦ Reinforcement Learning from Human Feedback
◦ Reinforcement learning (RL)
◦ Reinforcement learning from Human Feedback (RLHF)
◦ Training mechanism of ChatGPT
◦ Pros and Cons of ChatGPT

8
Agenda
◦ What is ChatGPT?
◦ Reinforcement Learning from Human Feedback
◦ Reinforcement learning (RL)
◦ Reinforcement learning from Human Feedback (RLHF)
◦ Training mechanism of ChatGPT
◦ Pros and Cons of ChatGPT

9
An intuitive example of RL

- The chicken is performing a simple

reinforcement learning strategy
- it does the act of pecking at the pink
piece of paper, because every time it
does, it will be fed

10
Interactive learning
Sensor node

Base station

Multi-Armed Bandit Traffic light control

Wireless sensor networks

ü The learner is not told which actions to take à trial-and-error
ü Action may affect also future situation à delayed reward

11
Definition of Reinforcement learning
◦ Reinforcement learning
◦ Learning what to do—how to map situations to actions— to achieve the goal
◦ Reward hypothesis
◦ That all of what we mean by goals and purposes can be well thought of as the
maximization of the expected value of the cumulative sum of a received scalar
signal (called reward) (Richard S. Sutton, RL: An introduction, 2018)

12
Goal and reward
◦ Reward: a scalar signal received from the environment at each step
◦ Goal: maximizing cumulative reward in the long run
Sensor node

Multi-Armed Bandit Traffic light control

Base station

Goal = maximizing total gain Goal = decreasing traffic jam Goal = increasing network lifetime
Reward = gain at every turn Reward = waiting time, Reward = load balancing,
queue length, lane speed, … route length, …

13
RL vs other learning techniques
◦ RL vs supervised learning
◦ Supervised learning: learning from a training set of labeled examples
◦ Know the true action to take
◦ Reinforcement learning: do not know the optimal action
◦ RL vs unsupervised learning
◦ Unsupervised learning: finding structure hidden in collections of unlabeled data
◦ Reinforcement learning: maximizing the reward signal

14
RL framework
◦ Policy: a mapping from states to actions
◦ may be stochastic, specifying probabilities for each action
◦ Reward signal: the goal of a reinforcement learning problem
◦ objective is to maximize the total reward the agent receives over the long run
◦ Value: total amount of accumulative reward over the future, starting from
that state
◦ Reward: immediate; Value: long-run
o Action: can be any decisions we want to learn how to make
o State: can be anything we can know that might be useful in making them
Agent Reward signal

Action
State Policy Environement

New state
15
Agenda
◦ What is ChatGPT?
◦ Reinforcement Learning from Human Feedback
◦ Reinforcement learning (RL)
◦ Reinforcement learning from Human Feedback (RLHF)
◦ Training mechanism of ChatGPT
◦ Pros and Cons of ChatGPT

16
Reinforcement learning from Human Feedback (RLHF)
New state
Environment

State
Action
Policy

Reward: Reflects the goodness of the action

How to calculate the reward?
Reward signal

Conventional RL usually determines reward using a well-defined rewad function

Problem: For some tasks (e.g., Languague model), it is difficult to determine reward function

Reinforcement learning from Human Feedback:

Determine the reward based on the feedback obtained from human

Train a reward model that aims to providing reward values resembling human’s feedback

17
Reinforcement learning from Human Feedback (RLHF)

Ø Human ranks the goodness of actions.

Ø The feedback from human is used to train a
reward model
Ø training objective: reward signals
resemble the feedback provided by
human

Schematic illustration of Reinforcement learning

from Human Feedback

Image credit: OpenAI

18
Training mechanism of ChatGPT
• Training objectives: use reinforcement learning from human feedback to fine-tune GPT-3 to follow a
broad class of written instructions
1. Data collection and baselines training 2. Train a reward model
• use a team of contractors to label the data • collect a dataset of human-labeled comparisons
• collect a dataset of human-written between outputs from baseline models on a
demonstrations of the desired output behavior on larger set of API prompts
(mostly English) prompts submitted to the OpenAI • train a reward model (RM) on this dataset to
API and some labeler-written prompts predict which model output our labelers would
prefer
use this data to fine-tune GPT-3
and produce baselines

3. Optimize the policy using the trained reward

model
• use the RM as a reward function and fine-tune
the supervised learning baseline to maximize
supervised learning baselines this reward using the PPO algorithm

19
Training mechanism of ChatGPT

Baselines training Finetune the baseline using RLHF

Image credit: OpenAI
20
Training mechanism of ChatGPT
◦ Reward model training
◦ For a prompt 𝑥, labelers are given 𝐾 responses to rank 4 ≤ 𝐾 ≤ 9
𝐾
◦ producing comparisons of responses à these comparisons are collected to train the
2
reward model
◦ loss function for training the reward model

the parameters of the reward model (𝑦! , 𝑦" ) is a pair of responses against the prompt 𝑥
𝑦! is ranked higher than 𝑦! (according to the labeler)
𝑟# 𝑥, 𝑦! : reward of 𝑦! determined by the reward model
𝑟# 𝑥, 𝑦" : reward of 𝑦" determined by the reward model

à by minimizing 𝑙𝑜𝑠𝑠 𝜃 ,we encourage the reward model to give 𝑦! a higher reward value than 𝑦"
à making the rewards provided by the reward model simiar to those evaluated by the labeler
21
Training mechanism of ChatGPT
◦ Illustration of the reward model training process

Image credit: https://huggingface.co/blog/rlhf

22
Training mechanism of ChatGPT
◦ Policy optimization
◦ Using reinforcement learning paradigm to finetune the supervised learning baseline
◦ Agent: The language model (which will produce the response)
◦ Environment: the conversation
◦ Policy: The strategy to generate the response
◦ Action: Response generation
◦ Reward: A scalar determined by the reward model which reflects the goodness of the response
◦ The loss function used to update the policy
◦ (𝑥, 𝑦): a pair of prompt-response
%&
◦ 𝜋$ : the policy that has been learned, 𝜋 '() : supervised trained model; 𝐷*+,-+./0 : the pretraining distribution

𝜋'() 𝑦|𝑥
𝑜𝑏𝑗𝑒𝑐𝑡𝑖𝑣𝑒 𝜙 = 𝐸 !,# ~% #$ 𝑟& 𝑥, 𝑦 − 𝛽 log + 𝛾𝐸!~%%&'(&)*+ log 𝜋'() 𝑦|𝑥
!" 𝜋 *+, 𝑦|𝑥

$%
The KL divergence between 𝜋# and 𝜋 *+, maximize the reward for responses
reward of response y against prompt x
à we want to maximize this one à to mitigate over-optimization of the produced by a pretraining model
reward à fix the performance regressions on
à guarantee that the model produced is public NLP datasets
not too different from the original model
23
Training mechanism of ChatGPT
◦ Illustration of the policy optimization process

Image credit: https://huggingface.co/blog/rlhf

24
Agenda
◦ What is ChatGPT?
◦ Reinforcement Learning from Human Feedback
◦ Training mechanism of ChatGPT
◦ Pros and Cons of ChatGPT

25
Pros
◦ Natrual and flexible conversation

26
Pros
◦ Providing common knowledge quickly with high accuracy

users can quickly grasp

what they want to
know by using
ChatGPT

ChatGPT vs Google
Users have to find the answer
by themselves
27
Limitations (stated by OpenAI)
1. ChatGPT sometimes writes Plausible-sounding but incorrect or
nonsensical answers

Plausible-sounding but incorrect

28
Limitations (stated by OpenAI)
1. ChatGPT sometimes writes Plausible-sounding but incorrect or
nonsensical answers
◦ Fixing this issue is challenging, as:
◦ During RL training, there’s currently no source of truth
◦ Training the model to be more cautious causes it to decline questions that it can
answer correctly
◦ Supervised training misleads the model because the ideal answer depends on
what the model knows, rather than what the human demonstrator knows.

29
Limitations (stated by OpenAI)
2. ChatGPT is sensitive to tweaks to the input phrasing or attempting the
same prompt multiple times

3. When we claim the anwser is

wrong, ChatGPT provide
another answer from the pool

1. this link is not correct

2. this link is also not correct

30
Limitations (stated by OpenAI)
2. ChatGPT is sensitive to tweaks to the input phrasing or attempting the
same prompt multiple times

when the user tweaks the input

The responses of ChatGPT get confused

31
Limitations (stated by OpenAI)
3. Ideally, the model would ask clarifying questions when the user
provided an ambiguous query. Instead, our current models usually
guess what the user intended

32
Limitations (stated by OpenAI)
4. While OpenAI made efforts to make the model refuse inappropriate
requests, it will sometimes respond to harmful instructions or exhibit
biased behavior

33
References
◦ https://openai.com/blog/chatgpt
◦ https://huggingface.co/blog/rlhf
◦ Christiano, Paul F., et al. "Deep reinforcement learning from human
preferences." Advances in neural information processing systems 30
(2017)
◦ Ouyang, Long, et al. "Training language models to follow instructions
with human feedback." Advances in Neural Information Processing
Systems 35 (2022): 27730-27744.
◦ Stiennon, Nisan, et al. "Learning to summarize with human
feedback." Advances in Neural Information Processing Systems 33
(2020): 3008-3021.

34
35

ChatGPT Optimizing Language Models For Dialogue
No ratings yet
ChatGPT Optimizing Language Models For Dialogue
1 page
Seminar Report On Chat GPT
100% (4)
Seminar Report On Chat GPT
8 pages
Sample 2 (Web) Rise Paper Application - 2023-2024
No ratings yet
Sample 2 (Web) Rise Paper Application - 2023-2024
9 pages
Chatgpt An Ai NLP Model Pov
No ratings yet
Chatgpt An Ai NLP Model Pov
11 pages
What Can ChatGPT Do?
No ratings yet
What Can ChatGPT Do?
10 pages
4. III. Introduction to Chatbots and ChatGPT
No ratings yet
4. III. Introduction to Chatbots and ChatGPT
17 pages
Technical Seminar
100% (1)
Technical Seminar
21 pages
4 - ChatGPT - Optimizing Language Models For Dialogue
100% (1)
4 - ChatGPT - Optimizing Language Models For Dialogue
1 page
ChatGPT_part_14
No ratings yet
ChatGPT_part_14
1 page
What Is ChatGPT
No ratings yet
What Is ChatGPT
5 pages
document (4)
No ratings yet
document (4)
5 pages
12 LLM Notes
No ratings yet
12 LLM Notes
10 pages
E4. LLM Instruction Tuning
No ratings yet
E4. LLM Instruction Tuning
45 pages
ChatGPT, LLM and RLHF
No ratings yet
ChatGPT, LLM and RLHF
45 pages
RLHF_ Reinforcement Learning From Human Feedback
No ratings yet
RLHF_ Reinforcement Learning From Human Feedback
21 pages
[Slide]-RLHF
No ratings yet
[Slide]-RLHF
53 pages
3. Self-improving Chatbots based on Reinforcement Learning
No ratings yet
3. Self-improving Chatbots based on Reinforcement Learning
6 pages
Application of Artificial Intelligence Chatbots, Including ChatGPT, in Education, Scholarly Work, Programming, and Content Generation and Its Prospects - A Narrative Review
No ratings yet
Application of Artificial Intelligence Chatbots, Including ChatGPT, in Education, Scholarly Work, Programming, and Content Generation and Its Prospects - A Narrative Review
8 pages
RLHF Tutorial (60 mins)
No ratings yet
RLHF Tutorial (60 mins)
73 pages
Decipher
No ratings yet
Decipher
37 pages
Module 3
No ratings yet
Module 3
44 pages
Openai Chatgpt Seminar Report Collegelib
No ratings yet
Openai Chatgpt Seminar Report Collegelib
8 pages
02 As An AI Language Model Developed by OpenAI
No ratings yet
02 As An AI Language Model Developed by OpenAI
3 pages
Reinforcement Learning From Human Feedback (RLHF)
No ratings yet
Reinforcement Learning From Human Feedback (RLHF)
23 pages
Working of Chatgpt Report
No ratings yet
Working of Chatgpt Report
24 pages
Final Presentation
No ratings yet
Final Presentation
22 pages
QUT- assignment-1
No ratings yet
QUT- assignment-1
3 pages
Everything About ChatGPT
No ratings yet
Everything About ChatGPT
5 pages
Introducing ChatGPT
No ratings yet
Introducing ChatGPT
1 page
Chatbot
No ratings yet
Chatbot
21 pages
AI Case Study
100% (2)
AI Case Study
14 pages
SSRN Id4402499
No ratings yet
SSRN Id4402499
7 pages
Navarrete Parra, O. (2023) Aligning a Medium Size GPT Model in English to a Small Closed Domain in Spanish
No ratings yet
Navarrete Parra, O. (2023) Aligning a Medium Size GPT Model in English to a Small Closed Domain in Spanish
9 pages
ML Algorithms
No ratings yet
ML Algorithms
5 pages
Everything You Need To Know About Chatgpt Expeed Software 240314091646 b2188bc5
No ratings yet
Everything You Need To Know About Chatgpt Expeed Software 240314091646 b2188bc5
19 pages
Hierarchical Reinforcement Learning For Open-Domain Dialog - 2
No ratings yet
Hierarchical Reinforcement Learning For Open-Domain Dialog - 2
12 pages
2. Deep Reinforcement Learning for Chatbots Using
No ratings yet
2. Deep Reinforcement Learning for Chatbots Using
8 pages
A Comprehensive Overview of ChatGPT
No ratings yet
A Comprehensive Overview of ChatGPT
6 pages
Chat GPT
No ratings yet
Chat GPT
8 pages
hehehehehevnwegisdbg
No ratings yet
hehehehehevnwegisdbg
2 pages
Generative AI
No ratings yet
Generative AI
28 pages
New 181
No ratings yet
New 181
2 pages
chatgptfordataanalyticswebinar1705530121157
No ratings yet
chatgptfordataanalyticswebinar1705530121157
124 pages
Chatgpt Theory Final
No ratings yet
Chatgpt Theory Final
18 pages
Illustrating Reinforcement Learning From Human Feedback (RLHF)
No ratings yet
Illustrating Reinforcement Learning From Human Feedback (RLHF)
10 pages
Arxiv - 20200108 - Daniel Ziegler - Fine-Tuning Language Models From Human Preferences
No ratings yet
Arxiv - 20200108 - Daniel Ziegler - Fine-Tuning Language Models From Human Preferences
26 pages
Roisinluo Reasoning in LLMs
No ratings yet
Roisinluo Reasoning in LLMs
72 pages
Deep Reinforcement Learning from Human Preferences (深度强化学习来自人类偏好)
No ratings yet
Deep Reinforcement Learning from Human Preferences (深度强化学习来自人类偏好)
9 pages
ChatGPT Principles and Architecture (Ge Cheng) (Z-Library)
No ratings yet
ChatGPT Principles and Architecture (Ge Cheng) (Z-Library)
502 pages
AIA R L H F ? C L: Lignment Through Einforcement Earning From Uman Eedback Ontradictions and Imitations
No ratings yet
AIA R L H F ? C L: Lignment Through Einforcement Earning From Uman Eedback Ontradictions and Imitations
12 pages
ChatGPT_Chatbot_Guide
No ratings yet
ChatGPT_Chatbot_Guide
3 pages
Seminar
No ratings yet
Seminar
27 pages
GALLM Unit 4 Notes
No ratings yet
GALLM Unit 4 Notes
14 pages
Intro To Chatgpt
100% (1)
Intro To Chatgpt
28 pages
rl-summ1
No ratings yet
rl-summ1
28 pages
Chatbot Assessment Northern University Bangladesh
No ratings yet
Chatbot Assessment Northern University Bangladesh
12 pages
MD Burhan6621documentation1
No ratings yet
MD Burhan6621documentation1
23 pages
Training Language Models To Follow Instructions With Human Feedback
No ratings yet
Training Language Models To Follow Instructions With Human Feedback
68 pages
Exploring Bi-Directional Context For Improved Chatbot Response Generation Using Deep Reinforcement Learning
No ratings yet
Exploring Bi-Directional Context For Improved Chatbot Response Generation Using Deep Reinforcement Learning
20 pages
Intermediate AI Prompting – Reinforcement Learning
From Everand
Intermediate AI Prompting – Reinforcement Learning
Eric Centore
No ratings yet
Reinforcement Learning Explained - A Step-by-Step Guide to Reward-Driven AI
From Everand
Reinforcement Learning Explained - A Step-by-Step Guide to Reward-Driven AI
Luka Nikolic
No ratings yet
Artificial Intelligence: Changing Business Amidst COVID
No ratings yet
Artificial Intelligence: Changing Business Amidst COVID
4 pages
Viscosimetro Rheomat R180 - e
No ratings yet
Viscosimetro Rheomat R180 - e
4 pages
EE3706 - Chapter 4 - Circuit Theorems
No ratings yet
EE3706 - Chapter 4 - Circuit Theorems
33 pages
Blind Taste Test Report
50% (2)
Blind Taste Test Report
63 pages
Lecture 3 Feedback
No ratings yet
Lecture 3 Feedback
3 pages
Math - SD (1-6)
No ratings yet
Math - SD (1-6)
76 pages
Soas Ma Dissertation Deadline
100% (2)
Soas Ma Dissertation Deadline
7 pages
Controlling Rail Potential of DC Supplied Rail Traction Systems PDF
100% (1)
Controlling Rail Potential of DC Supplied Rail Traction Systems PDF
10 pages
MCQ For Pharmacognosy 5th Sem
No ratings yet
MCQ For Pharmacognosy 5th Sem
23 pages
Interview Blueprint (English) - PARADISE
No ratings yet
Interview Blueprint (English) - PARADISE
19 pages
Locomotor Non Locomotor
No ratings yet
Locomotor Non Locomotor
4 pages
HL-350 boring and welding machine quote镗焊机报价
No ratings yet
HL-350 boring and welding machine quote镗焊机报价
5 pages
Acoustic Waves PDF
No ratings yet
Acoustic Waves PDF
479 pages
2.4.3 Gaussian Elimination: An Example
No ratings yet
2.4.3 Gaussian Elimination: An Example
3 pages
Farmer Producer Companies in Odisha
No ratings yet
Farmer Producer Companies in Odisha
34 pages
Siwer Line Trade Project
No ratings yet
Siwer Line Trade Project
22 pages
WG12-96 Modbus
No ratings yet
WG12-96 Modbus
13 pages
An Introduction To GSD - General Sewing Data - TEXTILE LIBRARY
No ratings yet
An Introduction To GSD - General Sewing Data - TEXTILE LIBRARY
5 pages
Quotation 71034
No ratings yet
Quotation 71034
1 page
Brief: DICOR.: Karen Tatiana Arias Bermúdez April 2019
No ratings yet
Brief: DICOR.: Karen Tatiana Arias Bermúdez April 2019
7 pages
Nigeria Fact Sheet
No ratings yet
Nigeria Fact Sheet
5 pages
Customer Information Form
No ratings yet
Customer Information Form
2 pages
SAMPLE Mastercam X9 Handbook Volume 2
No ratings yet
SAMPLE Mastercam X9 Handbook Volume 2
36 pages
Whats New
100% (1)
Whats New
130 pages
Sipgear TK 550
No ratings yet
Sipgear TK 550
5 pages
2025-01-23 VMware Horizon Pricing, Packaging, and Licensing - Licenseware
No ratings yet
2025-01-23 VMware Horizon Pricing, Packaging, and Licensing - Licenseware
4 pages
Actual Car Specifications May Vary From The Picture Shown Above. Details For The Options Availables, Please Speak To Your Authorized BMW Dealer
No ratings yet
Actual Car Specifications May Vary From The Picture Shown Above. Details For The Options Availables, Please Speak To Your Authorized BMW Dealer
2 pages
Tesla AC TT34EX81 1232IAW Spec 2024 ENG
No ratings yet
Tesla AC TT34EX81 1232IAW Spec 2024 ENG
2 pages