0% found this document useful (0 votes)

37 views

Lesson 5 AI

This document provides an overview of a course on reinforcement learning. It defines key concepts like the agent, environment, reward, and state. It also provides an example of applying reinforcement learning to the game PyRat, where the agent is either the rat or python and the goal is to maximize the collection of cheese pieces. The document outlines the course, which will cover definitions of reinforcement learning, Q-learning, and approximate Q-learning approaches.

Uploaded by

matheus

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views

Lesson 5 AI

Uploaded by

matheus

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 38

Course 5: Reinforcement Learning

Course 5: Reinforcement Learning 1 / 15

Summary

Last session Today’s session

1 Combinatorial game theory 1 Reinforcement Learning
2 Deﬁnition of a game 2 Value and Policy Functions
3 Proof of determined games 3 Q-Learning

Note: reinforcement learning and combinatorial game theory share a

common mathematical framework. But to ease access to online
resources, we will adopt a new vocabulary.

Course 5: Reinforcement Learning 2 / 15

Outline of the course

1 Deﬁnitions of Reinforcement Learning (RL)

Fundamentals
Example: PyRat
Policy and values

2 Q-learning
Q-learning deﬁnitions
Example
Approximate Q-learning
Exploration/Exploitation

Course 5: Reinforcement Learning 3 / 15

Plan

1 Deﬁnitions of Reinforcement Learning (RL)

Fundamentals
Example: PyRat
Policy and values

2 Q-learning
Q-learning deﬁnitions
Example
Approximate Q-learning
Exploration/Exploitation

Course 5: Reinforcement Learning 3 / 15

Agent and environment
Our objective is to train an agent to maximize its reward through
actions that aﬀect an environment.

Observation Reward Action

Course 5: Reinforcement Learning 4 / 15

Fundamentals

Reward hypothesis
All goals can be described by the maximization of excepted
cumulated reward over time.

Speciﬁcities of reinforcement learning

No supervision, only a reward signal,
Delayed feedback, the reward can come (much) later,
Importance of the temporal dimension,
Agent’s actions aﬀect the subsequent data it receives.

Course 5: Reinforcement Learning 5 / 15

Fundamentals

Reward hypothesis
All goals can be described by the maximization of excepted
cumulated reward over time.

Speciﬁcities of reinforcement learning

No supervision, only a reward signal,
Delayed feedback, the reward can come (much) later,
Importance of the temporal dimension,
Agent’s actions aﬀect the subsequent data it receives.

Course 5: Reinforcement Learning 5 / 15

Agent and environment
Agent state representation stα ∈ S α

Observation ot ∈ O rt ∈ R Reward Action at ∈ A

Environment state representation ste ∈ S e

Course 5: Reinforcement Learning 6 / 15

Deﬁnitions

The agent α... Observability:

1 analyzes previous actions,
Perfect: stα = ste = ot
states, rewards and
observations, Imperfect: no access to full
2 computes action at , environment state:
3 obtains reward rt , The agent indirectly
4 obtains an observation ot +1 , observes the environment
5 deduce a new state stα+1 . through ot ,
stα is estimated by the
The environment... agent and may diﬀer from
1 receives action at , ste .
2 produces reward rt ,
3 deduce a new state ste+1 ,
4 produces ot +1 .

Course 5: Reinforcement Learning 7 / 15

Deﬁnitions

The agent α... Observability:

Course 5: Reinforcement Learning 7 / 15

Deﬁnitions

The agent α... Observability:

Course 5: Reinforcement Learning 7 / 15

Deﬁnitions

The agent α... Observability:

Course 5: Reinforcement Learning 7 / 15

Example: PyRat

Deﬁnitions
The agent is either the Rat or the Python,
The opponent becomes part of the environment,
Note that the game can be with perfect observability if the
opponent strategy is known,
Seen this way, the game becomes sequential.

RL-based PyRat versus supervised approach

Reward signal: number of picked up pieces of cheese,
Delayed feedback: several moves required to reach a reward,
Character’s moves aﬀect subsequent data it receives,
Importance of the temporal dimension.

Course 5: Reinforcement Learning 8 / 15

Example: PyRat

RL-based PyRat versus supervised approach

Course 5: Reinforcement Learning 8 / 15

Example: PyRat

Observability examples
Perfect: strat = sto = ot
at : Last move of the rat,
ot : The entire maze with all cheese locations and python position,
rt : Binary variable which is 1 if the rat just got a piece of cheese.
Imperfect:
at : Last move of the rat,
ot : Neighboring cells of the rat,
rt : Binary variable which is 1 if the rat just got a piece of cheese.

To represent the strategy of the rat, we use a policy function.

Course 5: Reinforcement Learning 8 / 15

Example: PyRat

To represent the strategy of the rat, we use a policy function.

Course 5: Reinforcement Learning 8 / 15

Example: PyRat

To represent the strategy of the rat, we use a policy function.

Course 5: Reinforcement Learning 8 / 15

Policy Function

Deﬁnition
The policy function of an agent α is:
α
S → A
π:
stα 7→ atα

π can be deterministic or stochastic.

Playout
The playout (stα,π )t ∈N associated with a policy π and initial state s0 , is
deﬁned by considering agent α takes his/her actions using π.

Course 5: Reinforcement Learning 9 / 15

Policy Function

Deﬁnition
The policy function of an agent α is:
α
S → A
π:
stα 7→ atα

π can be deterministic or stochastic.

Playout
The playout (stα,π )t ∈N associated with a policy π and initial state s0 , is
deﬁned by considering agent α takes his/her actions using π.

Course 5: Reinforcement Learning 9 / 15

Value Function

Deﬁnition
Fix γ ∈ [0, 1[, the value function v π is deﬁned as:
 α
 S
 → R
π +∞
v : α,π
X
 st0
 7→ γ t −t0 rt
t =t0

The value of a policy function is thus an expectation of cumulative

future rewards, weakened by the geometrical coeﬃcient γ to
avoid divergence,
The best possible policy π ∗ (s ) is deﬁned as:
∗
∀s ∈ S α , ∀π, V π (s ) ≥ V π (s ).

Course 5: Reinforcement Learning 10 / 15

Value Function

Deﬁnition
Fix γ ∈ [0, 1[, the value function v π is deﬁned as:
 α
 S
 → R
π +∞
v : α,π
X
 st0
 7→ γ t −t0 rt
t =t0

The value of a policy function is thus an expectation of cumulative

future rewards, weakened by the geometrical coeﬃcient γ to
avoid divergence,
The best possible policy π ∗ (s ) is deﬁned as:
∗
∀s ∈ S α , ∀π, V π (s ) ≥ V π (s ).

Course 5: Reinforcement Learning 10 / 15

Plan

1 Deﬁnitions of Reinforcement Learning (RL)

Fundamentals
Example: PyRat
Policy and values

2 Q-learning
Q-learning deﬁnitions
Example
Approximate Q-learning
Exploration/Exploitation

Course 5: Reinforcement Learning 10 / 15

Q-learning
Deﬁnition
∗
In Q-learning, we aim to ﬁnd V π as a solution to the recursive system
of equations (Bellman equation):

∀s ∈ S α , ∀a ∈ A , Q (s , a ) = rs ,a + γ max
0
Q (s (a ), a 0 ),
a

where rs ,a is the reward agent α performs action a in state s and s (a )

is the state observed by agent α after performing action a .

Pros and cons

Pros:
Can be learned even if the agent is not following any specific π,
Self training is possible,
Cons:
Scalability issues when S α is large.
Course 5: Reinforcement Learning 11 / 15
Q-learning
Definition
∗
In Q-learning, we aim to find V π as a solution to the recursive system
of equations (Bellman equation):

∀s ∈ S α , ∀a ∈ A , Q (s , a ) = rs ,a + γ max
0
Q (s (a ), a 0 ),
a

where rs ,a is the reward agent α performs action a in state s and s (a )

is the state observed by agent α after performing action a .

Pros and cons

∀s ∈ S α , ∀a ∈ A , Q (s , a ) = rs ,a + γ max
0
Q (s (a ), a 0 ),
a

where rs ,a is the reward agent α performs action a in state s and s (a )

is the state observed by agent α after performing action a .

Pros and cons

Finish

R=10
R=-3 R=-3
Research Study Work

R=-10 R=4
R=4
Do nothing

Course 5: Reinforcement Learning 12 / 15

Q-learning example

R=10
R=-3 R=-3

R=-10 R=4
R=4

Course 5: Reinforcement Learning 12 / 15

Q-learning example

R=10
R=-3 R=-3
10

R=-10 R=4
R=4

Course 5: Reinforcement Learning 12 / 15

Q-learning example

R=10
R=-3 R=-3
7 10

R=-10 R=4
R=4

Course 5: Reinforcement Learning 12 / 15

Q-learning example

R=10
R=-3 R=-3
4 7 10

R=-10 R=4
R=4

Course 5: Reinforcement Learning 12 / 15

Q-learning example

R=10
R=-3 R=-3
4 7 10

R=-10 R=4
R=4
-6

Course 5: Reinforcement Learning 12 / 15

Q-learning example

Q=10
Q=7 Q=10
4 7 10

Q=4 Q=-6
Q=-6
-6

Course 5: Reinforcement Learning 12 / 15

Approximated Q-learning
Deﬁnition
Train a model to approximate Q ,
Input is a state s and output is made of values of Q (s , ·),
Representation learning can be used to compress S α .

Problems
Almost always needs a simulator for the game,
Game duration can be bottleneck for training,
Catastrophic forgetting and adversary specialization,
These eﬀects can be alliviated by training using experience replay.

Experience replay
Instead of using only the last decision to train, sample at random
from the m previous decisions,
Decisions taken before should remain considered now.
Course 5: Reinforcement Learning 13 / 15
Approximated Q-learning
Deﬁnition
Train a model to approximate Q ,
Input is a state s and output is made of values of Q (s , ·),
Representation learning can be used to compress S α .

Dilemma
Repeat with existing strategy (Exploitation)...
... or try a new strategy (Exploration)?

Example
Always eating in restaurants that you know is exploitation,
While that is a good heuristic, you have no way of knowing if you
have the maximum reward possible,
So exploring new restaurants from time to time may be needed to
ﬁnd the maximum reward.

Course 5: Reinforcement Learning 14 / 15

Exploration/Exploitation

Dilemma
Repeat with existing strategy (Exploitation)...
... or try a new strategy (Exploration)?

Course 5: Reinforcement Learning 14 / 15

Lab Session 5

TP4 - PyRat with reinforcement learning

Approximate Q-learning algorithm using experience replay and
linear regression to beat the greedy algorithm,
Approximation method (linear regression) and experience replay
routine are given,
Assemble all the primitives to perform Reinforcement Learning.

Challenge
You can continue working in the challenge after ﬁnishing TP4. You can
now integrate reinforcement learning in your solution.

Course 5: Reinforcement Learning 15 / 15

Published Document Recommendations For The Design of Structures To BS EN 1991-1-7
100% (3)
Published Document Recommendations For The Design of Structures To BS EN 1991-1-7
20 pages
The Supplement Handbook: A Trusted Expert's Guide To What Works & What's Worthless For More Than 100 Conditions
No ratings yet
The Supplement Handbook: A Trusted Expert's Guide To What Works & What's Worthless For More Than 100 Conditions
4 pages
Aouadi & Marsat, 2016
No ratings yet
Aouadi & Marsat, 2016
22 pages
International Financial Management 11 Edition: by Jeff Madura
No ratings yet
International Financial Management 11 Edition: by Jeff Madura
48 pages
Lecture 1: Introduction To Reinforcement Learning: David Silver
No ratings yet
Lecture 1: Introduction To Reinforcement Learning: David Silver
46 pages
L13 Reinforcement Learning
No ratings yet
L13 Reinforcement Learning
35 pages
L11 Reinforcement Learning 1
No ratings yet
L11 Reinforcement Learning 1
18 pages
Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
Sections
No ratings yet
Sections
76 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
30 pages
Reinf 2
No ratings yet
Reinf 2
4 pages
F90de-Introduction To Reinforcement Learning
No ratings yet
F90de-Introduction To Reinforcement Learning
67 pages
Reinforcement Learning: Russell and Norvig: CH 21
No ratings yet
Reinforcement Learning: Russell and Norvig: CH 21
16 pages
AI (IT) UNIT-5
No ratings yet
AI (IT) UNIT-5
43 pages
DRL Final Notes
No ratings yet
DRL Final Notes
281 pages
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
No ratings yet
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
34 pages
Evolutionary Game Theory and Multi-Agent Reinforcement Learning
No ratings yet
Evolutionary Game Theory and Multi-Agent Reinforcement Learning
26 pages
Reinforcement Learning: Russell and Norvig: CH 21
No ratings yet
Reinforcement Learning: Russell and Norvig: CH 21
16 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
32 pages
Reinforcement Learning: By: Chandra Prakash IIITM Gwalior
No ratings yet
Reinforcement Learning: By: Chandra Prakash IIITM Gwalior
64 pages
Reinforcement LN-6
No ratings yet
Reinforcement LN-6
13 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
9 pages
Reinforcement Learning
100% (1)
Reinforcement Learning
64 pages
Lec 04 Reinforcement Learning
No ratings yet
Lec 04 Reinforcement Learning
57 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
17 pages
3.RL Unit 3
No ratings yet
3.RL Unit 3
31 pages
Reinforced Learning
No ratings yet
Reinforced Learning
25 pages
RL Introduction
No ratings yet
RL Introduction
225 pages
ML - Unit 3 - Part II
No ratings yet
ML - Unit 3 - Part II
51 pages
Reinforcement Learning: Nguyen Do Van, PHD
No ratings yet
Reinforcement Learning: Nguyen Do Van, PHD
40 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
23 pages
MLT Unit-5 notes
No ratings yet
MLT Unit-5 notes
17 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
32 pages
Module 1
No ratings yet
Module 1
72 pages
7- Reinforcement Learning
No ratings yet
7- Reinforcement Learning
23 pages
Lecture 10 - Overview of RL With A VIP Perspective
No ratings yet
Lecture 10 - Overview of RL With A VIP Perspective
35 pages
Unit 5 - Reinforcement Learning
No ratings yet
Unit 5 - Reinforcement Learning
15 pages
ReinforcementLearning
No ratings yet
ReinforcementLearning
17 pages
Unit-5 (AI)
No ratings yet
Unit-5 (AI)
21 pages
Unit 1 - Reinforcement Learning,Overfitting, Training, Validation Sets, Metrics, Bias and Variance
No ratings yet
Unit 1 - Reinforcement Learning,Overfitting, Training, Validation Sets, Metrics, Bias and Variance
16 pages
L13 Reinforcement Learning
No ratings yet
L13 Reinforcement Learning
57 pages
Unit 5d - Deep Reinforcement Learning
No ratings yet
Unit 5d - Deep Reinforcement Learning
52 pages
Lecture Notes v1.0 687 F22
No ratings yet
Lecture Notes v1.0 687 F22
115 pages
RL Ese Answers
No ratings yet
RL Ese Answers
16 pages
A Beginner's Guide To Deep Reinforcement Learning: Skymind - Ai
No ratings yet
A Beginner's Guide To Deep Reinforcement Learning: Skymind - Ai
23 pages
Reinforcement Learning MY101
No ratings yet
Reinforcement Learning MY101
15 pages
UNIT-4
No ratings yet
UNIT-4
56 pages
کتاب هشتم بارگزاری شده
No ratings yet
کتاب هشتم بارگزاری شده
112 pages
2024 MTH058 Lecture05 ReinforcementLearning
No ratings yet
2024 MTH058 Lecture05 ReinforcementLearning
59 pages
Reinforcement learning
No ratings yet
Reinforcement learning
10 pages
RL
No ratings yet
RL
62 pages
I2ml3e Chap18
No ratings yet
I2ml3e Chap18
27 pages
PDF Unit-5(Full Unit)
No ratings yet
PDF Unit-5(Full Unit)
37 pages
An Introduction To Reinforcement Learning
No ratings yet
An Introduction To Reinforcement Learning
63 pages
Reinforcement Learning-1
No ratings yet
Reinforcement Learning-1
13 pages
Reinforcement Learning: Instructor: Max Welling
No ratings yet
Reinforcement Learning: Instructor: Max Welling
18 pages
Unit 5
No ratings yet
Unit 5
45 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
9 pages
Neural Networks: 1 October, 2016
No ratings yet
Neural Networks: 1 October, 2016
51 pages
Unit 3
No ratings yet
Unit 3
12 pages
Module 01
No ratings yet
Module 01
66 pages
lecture 9 Reiforcement learning (1)
No ratings yet
lecture 9 Reiforcement learning (1)
29 pages
Introduction To Reinforcement Learning
100% (1)
Introduction To Reinforcement Learning
52 pages
2.1 Terms of Sizes: Unit 1 Limit and Fits
No ratings yet
2.1 Terms of Sizes: Unit 1 Limit and Fits
20 pages
Slade V Metrodent LTD
0% (1)
Slade V Metrodent LTD
3 pages
Kolhapuri Chappal Craft Cluster Study
100% (2)
Kolhapuri Chappal Craft Cluster Study
54 pages
Labour Economics-Micro and Macro Aspects
No ratings yet
Labour Economics-Micro and Macro Aspects
33 pages
Activity C07: Boyle's Law: Pressure - Volume Relationship in Gases (Pressure Sensor)
No ratings yet
Activity C07: Boyle's Law: Pressure - Volume Relationship in Gases (Pressure Sensor)
6 pages
Delivery Report - Order #FO2152473B786
No ratings yet
Delivery Report - Order #FO2152473B786
220 pages
CS 229, Summer 2020 Problem Set #1
No ratings yet
CS 229, Summer 2020 Problem Set #1
14 pages
Black & Decker BES720 manual
No ratings yet
Black & Decker BES720 manual
112 pages
Bahasa Inggris Niaga
No ratings yet
Bahasa Inggris Niaga
2 pages
Renewable and Sustainable Energy Reviews: Yogi Sugiawan, Shunsuke Managi T
No ratings yet
Renewable and Sustainable Energy Reviews: Yogi Sugiawan, Shunsuke Managi T
9 pages
Automatic_Challan_System_with_Vehicle_Ve
No ratings yet
Automatic_Challan_System_with_Vehicle_Ve
4 pages
Leon Trotsky Critical Lives 2015 by Paul Le Blanc
No ratings yet
Leon Trotsky Critical Lives 2015 by Paul Le Blanc
123 pages
Financial Liabilities - Bonds Payable - Practice Set (QUESTIONNAIRE)
No ratings yet
Financial Liabilities - Bonds Payable - Practice Set (QUESTIONNAIRE)
4 pages
Employment Notification-05 2010
No ratings yet
Employment Notification-05 2010
2 pages
99069r1P802 15 Bluetooth Architecture Overview
No ratings yet
99069r1P802 15 Bluetooth Architecture Overview
29 pages
MA Lecture Week 22 - Managing Change
No ratings yet
MA Lecture Week 22 - Managing Change
29 pages
bikash intern report11
No ratings yet
bikash intern report11
50 pages
Interview Report
No ratings yet
Interview Report
5 pages
Strategic Human Resource Management: Assignment 1 - Benefits of Adopting SHRM For Capgemini
No ratings yet
Strategic Human Resource Management: Assignment 1 - Benefits of Adopting SHRM For Capgemini
9 pages
Alphabet Index Final 9x
100% (1)
Alphabet Index Final 9x
6 pages
Laropal A 81 June 2014 R2 IC
No ratings yet
Laropal A 81 June 2014 R2 IC
3 pages
361 Masangya Muñoz. v. Hord
No ratings yet
361 Masangya Muñoz. v. Hord
1 page
Copeland_K.Series_Start.Components
No ratings yet
Copeland_K.Series_Start.Components
1 page
Global Business English
No ratings yet
Global Business English
128 pages
Minimalist Hepatitis Clinical Case by Slidesgo
No ratings yet
Minimalist Hepatitis Clinical Case by Slidesgo
50 pages
Colostle - Tavern Module (COMPLETE)
100% (2)
Colostle - Tavern Module (COMPLETE)
12 pages