0% found this document useful (0 votes)
37 views

Lesson 5 AI

This document provides an overview of a course on reinforcement learning. It defines key concepts like the agent, environment, reward, and state. It also provides an example of applying reinforcement learning to the game PyRat, where the agent is either the rat or python and the goal is to maximize the collection of cheese pieces. The document outlines the course, which will cover definitions of reinforcement learning, Q-learning, and approximate Q-learning approaches.

Uploaded by

matheus
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views

Lesson 5 AI

This document provides an overview of a course on reinforcement learning. It defines key concepts like the agent, environment, reward, and state. It also provides an example of applying reinforcement learning to the game PyRat, where the agent is either the rat or python and the goal is to maximize the collection of cheese pieces. The document outlines the course, which will cover definitions of reinforcement learning, Q-learning, and approximate Q-learning approaches.

Uploaded by

matheus
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Course 5: Reinforcement Learning

Course 5: Reinforcement Learning 1 / 15


Summary

Last session Today’s session


1 Combinatorial game theory 1 Reinforcement Learning
2 Definition of a game 2 Value and Policy Functions
3 Proof of determined games 3 Q-Learning

Note: reinforcement learning and combinatorial game theory share a


common mathematical framework. But to ease access to online
resources, we will adopt a new vocabulary.

Course 5: Reinforcement Learning 2 / 15


Outline of the course

1 Definitions of Reinforcement Learning (RL)


Fundamentals
Example: PyRat
Policy and values

2 Q-learning
Q-learning definitions
Example
Approximate Q-learning
Exploration/Exploitation

Course 5: Reinforcement Learning 3 / 15


Plan

1 Definitions of Reinforcement Learning (RL)


Fundamentals
Example: PyRat
Policy and values

2 Q-learning
Q-learning definitions
Example
Approximate Q-learning
Exploration/Exploitation

Course 5: Reinforcement Learning 3 / 15


Agent and environment
Our objective is to train an agent to maximize its reward through
actions that affect an environment.

Observation Reward Action

Course 5: Reinforcement Learning 4 / 15


Fundamentals

Reward hypothesis
All goals can be described by the maximization of excepted
cumulated reward over time.

Specificities of reinforcement learning


No supervision, only a reward signal,
Delayed feedback, the reward can come (much) later,
Importance of the temporal dimension,
Agent’s actions affect the subsequent data it receives.

Course 5: Reinforcement Learning 5 / 15


Fundamentals

Reward hypothesis
All goals can be described by the maximization of excepted
cumulated reward over time.

Specificities of reinforcement learning


No supervision, only a reward signal,
Delayed feedback, the reward can come (much) later,
Importance of the temporal dimension,
Agent’s actions affect the subsequent data it receives.

Course 5: Reinforcement Learning 5 / 15


Agent and environment
Agent state representation stα ∈ S α

Observation ot ∈ O rt ∈ R Reward Action at ∈ A

Environment state representation ste ∈ S e

Course 5: Reinforcement Learning 6 / 15


Definitions

The agent α... Observability:


1 analyzes previous actions,
Perfect: stα = ste = ot
states, rewards and
observations, Imperfect: no access to full
2 computes action at , environment state:
3 obtains reward rt , The agent indirectly
4 obtains an observation ot +1 , observes the environment
5 deduce a new state stα+1 . through ot ,
stα is estimated by the
The environment... agent and may differ from
1 receives action at , ste .
2 produces reward rt ,
3 deduce a new state ste+1 ,
4 produces ot +1 .

Course 5: Reinforcement Learning 7 / 15


Definitions

The agent α... Observability:


1 analyzes previous actions,
Perfect: stα = ste = ot
states, rewards and
observations, Imperfect: no access to full
2 computes action at , environment state:
3 obtains reward rt , The agent indirectly
4 obtains an observation ot +1 , observes the environment
5 deduce a new state stα+1 . through ot ,
stα is estimated by the
The environment... agent and may differ from
1 receives action at , ste .
2 produces reward rt ,
3 deduce a new state ste+1 ,
4 produces ot +1 .

Course 5: Reinforcement Learning 7 / 15


Definitions

The agent α... Observability:


1 analyzes previous actions,
Perfect: stα = ste = ot
states, rewards and
observations, Imperfect: no access to full
2 computes action at , environment state:
3 obtains reward rt , The agent indirectly
4 obtains an observation ot +1 , observes the environment
5 deduce a new state stα+1 . through ot ,
stα is estimated by the
The environment... agent and may differ from
1 receives action at , ste .
2 produces reward rt ,
3 deduce a new state ste+1 ,
4 produces ot +1 .

Course 5: Reinforcement Learning 7 / 15


Definitions

The agent α... Observability:


1 analyzes previous actions,
Perfect: stα = ste = ot
states, rewards and
observations, Imperfect: no access to full
2 computes action at , environment state:
3 obtains reward rt , The agent indirectly
4 obtains an observation ot +1 , observes the environment
5 deduce a new state stα+1 . through ot ,
stα is estimated by the
The environment... agent and may differ from
1 receives action at , ste .
2 produces reward rt ,
3 deduce a new state ste+1 ,
4 produces ot +1 .

Course 5: Reinforcement Learning 7 / 15


Example: PyRat

Definitions
The agent is either the Rat or the Python,
The opponent becomes part of the environment,
Note that the game can be with perfect observability if the
opponent strategy is known,
Seen this way, the game becomes sequential.

RL-based PyRat versus supervised approach


Reward signal: number of picked up pieces of cheese,
Delayed feedback: several moves required to reach a reward,
Character’s moves affect subsequent data it receives,
Importance of the temporal dimension.

Course 5: Reinforcement Learning 8 / 15


Example: PyRat

Definitions
The agent is either the Rat or the Python,
The opponent becomes part of the environment,
Note that the game can be with perfect observability if the
opponent strategy is known,
Seen this way, the game becomes sequential.

RL-based PyRat versus supervised approach


Reward signal: number of picked up pieces of cheese,
Delayed feedback: several moves required to reach a reward,
Character’s moves affect subsequent data it receives,
Importance of the temporal dimension.

Course 5: Reinforcement Learning 8 / 15


Example: PyRat

Observability examples
Perfect: strat = sto = ot
at : Last move of the rat,
ot : The entire maze with all cheese locations and python position,
rt : Binary variable which is 1 if the rat just got a piece of cheese.
Imperfect:
at : Last move of the rat,
ot : Neighboring cells of the rat,
rt : Binary variable which is 1 if the rat just got a piece of cheese.

To represent the strategy of the rat, we use a policy function.

Course 5: Reinforcement Learning 8 / 15


Example: PyRat

Observability examples
Perfect: strat = sto = ot
at : Last move of the rat,
ot : The entire maze with all cheese locations and python position,
rt : Binary variable which is 1 if the rat just got a piece of cheese.
Imperfect:
at : Last move of the rat,
ot : Neighboring cells of the rat,
rt : Binary variable which is 1 if the rat just got a piece of cheese.

To represent the strategy of the rat, we use a policy function.

Course 5: Reinforcement Learning 8 / 15


Example: PyRat

Observability examples
Perfect: strat = sto = ot
at : Last move of the rat,
ot : The entire maze with all cheese locations and python position,
rt : Binary variable which is 1 if the rat just got a piece of cheese.
Imperfect:
at : Last move of the rat,
ot : Neighboring cells of the rat,
rt : Binary variable which is 1 if the rat just got a piece of cheese.

To represent the strategy of the rat, we use a policy function.

Course 5: Reinforcement Learning 8 / 15


Policy Function

Definition
The policy function of an agent α is:
 α
S → A
π:
stα 7→ atα

π can be deterministic or stochastic.

Playout
The playout (stα,π )t ∈N associated with a policy π and initial state s0 , is
defined by considering agent α takes his/her actions using π.

Course 5: Reinforcement Learning 9 / 15


Policy Function

Definition
The policy function of an agent α is:
 α
S → A
π:
stα 7→ atα

π can be deterministic or stochastic.

Playout
The playout (stα,π )t ∈N associated with a policy π and initial state s0 , is
defined by considering agent α takes his/her actions using π.

Course 5: Reinforcement Learning 9 / 15


Value Function

Definition
Fix γ ∈ [0, 1[, the value function v π is defined as:
 α
 S
 → R
π +∞
v : α,π
X
 st0
 7→ γ t −t0 rt
t =t0

The value of a policy function is thus an expectation of cumulative


future rewards, weakened by the geometrical coefficient γ to
avoid divergence,
The best possible policy π ∗ (s ) is defined as:

∀s ∈ S α , ∀π, V π (s ) ≥ V π (s ).

Course 5: Reinforcement Learning 10 / 15


Value Function

Definition
Fix γ ∈ [0, 1[, the value function v π is defined as:
 α
 S
 → R
π +∞
v : α,π
X
 st0
 7→ γ t −t0 rt
t =t0

The value of a policy function is thus an expectation of cumulative


future rewards, weakened by the geometrical coefficient γ to
avoid divergence,
The best possible policy π ∗ (s ) is defined as:

∀s ∈ S α , ∀π, V π (s ) ≥ V π (s ).

Course 5: Reinforcement Learning 10 / 15


Plan

1 Definitions of Reinforcement Learning (RL)


Fundamentals
Example: PyRat
Policy and values

2 Q-learning
Q-learning definitions
Example
Approximate Q-learning
Exploration/Exploitation

Course 5: Reinforcement Learning 10 / 15


Q-learning
Definition

In Q-learning, we aim to find V π as a solution to the recursive system
of equations (Bellman equation):

∀s ∈ S α , ∀a ∈ A , Q (s , a ) = rs ,a + γ max
0
Q (s (a ), a 0 ),
a

where rs ,a is the reward agent α performs action a in state s and s (a )


is the state observed by agent α after performing action a .

Pros and cons


Pros:
Can be learned even if the agent is not following any specific π,
Self training is possible,
Cons:
Scalability issues when S α is large.
Course 5: Reinforcement Learning 11 / 15
Q-learning
Definition

In Q-learning, we aim to find V π as a solution to the recursive system
of equations (Bellman equation):

∀s ∈ S α , ∀a ∈ A , Q (s , a ) = rs ,a + γ max
0
Q (s (a ), a 0 ),
a

where rs ,a is the reward agent α performs action a in state s and s (a )


is the state observed by agent α after performing action a .

Pros and cons


Pros:
Can be learned even if the agent is not following any specific π,
Self training is possible,
Cons:
Scalability issues when S α is large.
Course 5: Reinforcement Learning 11 / 15
Q-learning
Definition

In Q-learning, we aim to find V π as a solution to the recursive system
of equations (Bellman equation):

∀s ∈ S α , ∀a ∈ A , Q (s , a ) = rs ,a + γ max
0
Q (s (a ), a 0 ),
a

where rs ,a is the reward agent α performs action a in state s and s (a )


is the state observed by agent α after performing action a .

Pros and cons


Pros:
Can be learned even if the agent is not following any specific π,
Self training is possible,
Cons:
Scalability issues when S α is large.
Course 5: Reinforcement Learning 11 / 15
Q-learning example

Finish

R=10
R=-3 R=-3
Research Study Work

R=-10 R=4
R=4
Do nothing

Course 5: Reinforcement Learning 12 / 15


Q-learning example

R=10
R=-3 R=-3

R=-10 R=4
R=4

Course 5: Reinforcement Learning 12 / 15


Q-learning example

R=10
R=-3 R=-3
10

R=-10 R=4
R=4

Course 5: Reinforcement Learning 12 / 15


Q-learning example

R=10
R=-3 R=-3
7 10

R=-10 R=4
R=4

Course 5: Reinforcement Learning 12 / 15


Q-learning example

R=10
R=-3 R=-3
4 7 10

R=-10 R=4
R=4

Course 5: Reinforcement Learning 12 / 15


Q-learning example

R=10
R=-3 R=-3
4 7 10

R=-10 R=4
R=4
-6

Course 5: Reinforcement Learning 12 / 15


Q-learning example

Q=10
Q=7 Q=10
4 7 10

Q=4 Q=-6
Q=-6
-6

Course 5: Reinforcement Learning 12 / 15


Approximated Q-learning
Definition
Train a model to approximate Q ,
Input is a state s and output is made of values of Q (s , ·),
Representation learning can be used to compress S α .

Problems
Almost always needs a simulator for the game,
Game duration can be bottleneck for training,
Catastrophic forgetting and adversary specialization,
These effects can be alliviated by training using experience replay.

Experience replay
Instead of using only the last decision to train, sample at random
from the m previous decisions,
Decisions taken before should remain considered now.
Course 5: Reinforcement Learning 13 / 15
Approximated Q-learning
Definition
Train a model to approximate Q ,
Input is a state s and output is made of values of Q (s , ·),
Representation learning can be used to compress S α .

Problems
Almost always needs a simulator for the game,
Game duration can be bottleneck for training,
Catastrophic forgetting and adversary specialization,
These effects can be alliviated by training using experience replay.

Experience replay
Instead of using only the last decision to train, sample at random
from the m previous decisions,
Decisions taken before should remain considered now.
Course 5: Reinforcement Learning 13 / 15
Approximated Q-learning
Definition
Train a model to approximate Q ,
Input is a state s and output is made of values of Q (s , ·),
Representation learning can be used to compress S α .

Problems
Almost always needs a simulator for the game,
Game duration can be bottleneck for training,
Catastrophic forgetting and adversary specialization,
These effects can be alliviated by training using experience replay.

Experience replay
Instead of using only the last decision to train, sample at random
from the m previous decisions,
Decisions taken before should remain considered now.
Course 5: Reinforcement Learning 13 / 15
Exploration/Exploitation

Dilemma
Repeat with existing strategy (Exploitation)...
... or try a new strategy (Exploration)?

Example
Always eating in restaurants that you know is exploitation,
While that is a good heuristic, you have no way of knowing if you
have the maximum reward possible,
So exploring new restaurants from time to time may be needed to
find the maximum reward.

Course 5: Reinforcement Learning 14 / 15


Exploration/Exploitation

Dilemma
Repeat with existing strategy (Exploitation)...
... or try a new strategy (Exploration)?

Example
Always eating in restaurants that you know is exploitation,
While that is a good heuristic, you have no way of knowing if you
have the maximum reward possible,
So exploring new restaurants from time to time may be needed to
find the maximum reward.

Course 5: Reinforcement Learning 14 / 15


Lab Session 5

TP4 - PyRat with reinforcement learning


Approximate Q-learning algorithm using experience replay and
linear regression to beat the greedy algorithm,
Approximation method (linear regression) and experience replay
routine are given,
Assemble all the primitives to perform Reinforcement Learning.

Challenge
You can continue working in the challenge after finishing TP4. You can
now integrate reinforcement learning in your solution.

Course 5: Reinforcement Learning 15 / 15

You might also like