Lesson 5 AI
Lesson 5 AI
2 Q-learning
Q-learning definitions
Example
Approximate Q-learning
Exploration/Exploitation
2 Q-learning
Q-learning definitions
Example
Approximate Q-learning
Exploration/Exploitation
Reward hypothesis
All goals can be described by the maximization of excepted
cumulated reward over time.
Reward hypothesis
All goals can be described by the maximization of excepted
cumulated reward over time.
Definitions
The agent is either the Rat or the Python,
The opponent becomes part of the environment,
Note that the game can be with perfect observability if the
opponent strategy is known,
Seen this way, the game becomes sequential.
Definitions
The agent is either the Rat or the Python,
The opponent becomes part of the environment,
Note that the game can be with perfect observability if the
opponent strategy is known,
Seen this way, the game becomes sequential.
Observability examples
Perfect: strat = sto = ot
at : Last move of the rat,
ot : The entire maze with all cheese locations and python position,
rt : Binary variable which is 1 if the rat just got a piece of cheese.
Imperfect:
at : Last move of the rat,
ot : Neighboring cells of the rat,
rt : Binary variable which is 1 if the rat just got a piece of cheese.
Observability examples
Perfect: strat = sto = ot
at : Last move of the rat,
ot : The entire maze with all cheese locations and python position,
rt : Binary variable which is 1 if the rat just got a piece of cheese.
Imperfect:
at : Last move of the rat,
ot : Neighboring cells of the rat,
rt : Binary variable which is 1 if the rat just got a piece of cheese.
Observability examples
Perfect: strat = sto = ot
at : Last move of the rat,
ot : The entire maze with all cheese locations and python position,
rt : Binary variable which is 1 if the rat just got a piece of cheese.
Imperfect:
at : Last move of the rat,
ot : Neighboring cells of the rat,
rt : Binary variable which is 1 if the rat just got a piece of cheese.
Definition
The policy function of an agent α is:
α
S → A
π:
stα 7→ atα
Playout
The playout (stα,π )t ∈N associated with a policy π and initial state s0 , is
defined by considering agent α takes his/her actions using π.
Definition
The policy function of an agent α is:
α
S → A
π:
stα 7→ atα
Playout
The playout (stα,π )t ∈N associated with a policy π and initial state s0 , is
defined by considering agent α takes his/her actions using π.
Definition
Fix γ ∈ [0, 1[, the value function v π is defined as:
α
S
→ R
π +∞
v : α,π
X
st0
7→ γ t −t0 rt
t =t0
Definition
Fix γ ∈ [0, 1[, the value function v π is defined as:
α
S
→ R
π +∞
v : α,π
X
st0
7→ γ t −t0 rt
t =t0
2 Q-learning
Q-learning definitions
Example
Approximate Q-learning
Exploration/Exploitation
∀s ∈ S α , ∀a ∈ A , Q (s , a ) = rs ,a + γ max
0
Q (s (a ), a 0 ),
a
∀s ∈ S α , ∀a ∈ A , Q (s , a ) = rs ,a + γ max
0
Q (s (a ), a 0 ),
a
∀s ∈ S α , ∀a ∈ A , Q (s , a ) = rs ,a + γ max
0
Q (s (a ), a 0 ),
a
Finish
R=10
R=-3 R=-3
Research Study Work
R=-10 R=4
R=4
Do nothing
R=10
R=-3 R=-3
R=-10 R=4
R=4
R=10
R=-3 R=-3
10
R=-10 R=4
R=4
R=10
R=-3 R=-3
7 10
R=-10 R=4
R=4
R=10
R=-3 R=-3
4 7 10
R=-10 R=4
R=4
R=10
R=-3 R=-3
4 7 10
R=-10 R=4
R=4
-6
Q=10
Q=7 Q=10
4 7 10
Q=4 Q=-6
Q=-6
-6
Problems
Almost always needs a simulator for the game,
Game duration can be bottleneck for training,
Catastrophic forgetting and adversary specialization,
These effects can be alliviated by training using experience replay.
Experience replay
Instead of using only the last decision to train, sample at random
from the m previous decisions,
Decisions taken before should remain considered now.
Course 5: Reinforcement Learning 13 / 15
Approximated Q-learning
Definition
Train a model to approximate Q ,
Input is a state s and output is made of values of Q (s , ·),
Representation learning can be used to compress S α .
Problems
Almost always needs a simulator for the game,
Game duration can be bottleneck for training,
Catastrophic forgetting and adversary specialization,
These effects can be alliviated by training using experience replay.
Experience replay
Instead of using only the last decision to train, sample at random
from the m previous decisions,
Decisions taken before should remain considered now.
Course 5: Reinforcement Learning 13 / 15
Approximated Q-learning
Definition
Train a model to approximate Q ,
Input is a state s and output is made of values of Q (s , ·),
Representation learning can be used to compress S α .
Problems
Almost always needs a simulator for the game,
Game duration can be bottleneck for training,
Catastrophic forgetting and adversary specialization,
These effects can be alliviated by training using experience replay.
Experience replay
Instead of using only the last decision to train, sample at random
from the m previous decisions,
Decisions taken before should remain considered now.
Course 5: Reinforcement Learning 13 / 15
Exploration/Exploitation
Dilemma
Repeat with existing strategy (Exploitation)...
... or try a new strategy (Exploration)?
Example
Always eating in restaurants that you know is exploitation,
While that is a good heuristic, you have no way of knowing if you
have the maximum reward possible,
So exploring new restaurants from time to time may be needed to
find the maximum reward.
Dilemma
Repeat with existing strategy (Exploitation)...
... or try a new strategy (Exploration)?
Example
Always eating in restaurants that you know is exploitation,
While that is a good heuristic, you have no way of knowing if you
have the maximum reward possible,
So exploring new restaurants from time to time may be needed to
find the maximum reward.
Challenge
You can continue working in the challenge after finishing TP4. You can
now integrate reinforcement learning in your solution.