0% found this document useful (0 votes)
12 views

Q Learning

Uploaded by

vivekanshi42
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Q Learning

Uploaded by

vivekanshi42
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 12

Q-learning

Watkins, C. J. C. H., and


Dayan, P., Q learning,
Machine Learning, 8: 279-292
(1992)
Q value
When an agent take action at in state st at time t,
the predicted future rewards is defined as Q(st,at).


Q ( st , at )  E rt 1  rt  2   2 rt 3   3 rt  4   
Example)
a3t
Q3(st,at)=0 a2t
Q2(st,at)=1


rt rt+1 rt+1
st st+1 st+2
a1t at+1 at+2
Q1(st,at)=2
Generally speaking, an agent should take action a1t
because the corresponding Q value Q1(st,at) is max.
Q learning
First, Q value can be transformed as follows.
Q ( st , at ) E rt 1  rt 2   2 rt 3   3 rt 4  
 k 
E   rt k 1  ①
 k 0 

 
E rt 1     rt k 2 
k

 k 0 
Ert 1  Q ( st 1 , at 1 ) ( ① )

As a result, the Q value at time t is easily calculated


by rt+1 and Q value of the next step.
Q learning
Q values is updated every step.

When an agent take action at in state st,


and gets reward r, the Q value is updated as
follows.
Q ( st , at ) Q( st , at )   r   max a Q( st 1 , a )  Q( st , at )

target value current value

TD error

α: step size parameter (learning rate)


Q learning algorithm

Initialize Q(s,a) arbitrarily


Repeat (for each episode):
initialize s
Repeat (for each step of episode):
Choose a from s using policy derived from Q
(e.g., greedy, ε-greedy)
take action a, observe r, s’
Q ( s, a ) Q( s, a )   r   max a ' Q ( s ' , a ' )  Q( s, a )

s←s’;
until s is terminal
n- step return (reward)
1 step
(Q-learning) 2 step n step Monte Carlo
initial state
(time t)
….. …..
Complete experience based method

Boot-strapping
Rt(1) rt 1  Qst 1 , at 1 
Rt( 2 ) rt 1   rt 2   2Qst 2 , at 2 
.
.
.

..

action
Terminal state
state Rt( n ) rt 1   rt 2     n  1rt n   nQst n , at n  (time T)
Rt rt 1   rt 2     n  1rt n     T  t  1rT
n- step return (reward)
Q ( st 1 , at 1 )

E rt 1  rt 2   rt 3   rt 4  
2 3

E r t 1
2 n

 rt 2   rt 3     Q ( st n , at n )


E rt 1  rt 2   2Q ( st 2 , at 2 ) 
E r t 1 
  Q ( st 1 , at 1 )
λ-return (trace-decay parameter)
1 step 2 step 3 step n step Monte Carlo

weight
….. …..
1 

 1   
1     n 1
1
n 1

.
.
.
1   2

..

λ-return

R t
( )
1    n  1 Rt( n )
n 1
T  t 1 1   t  1
1     n  1 Rt( n )  T  t  1 Rt T  t  1
n 1
λ-return (trace-decay parameter)

3-step return
Eligibility trace and Replacing trace
Eligibility and Replacing traces is useful to calculate the
n-step return
These traces show how often each state is visited.

Eligibility trace replacing trace


et  1 ( s ) ( s st ) et  1 ( s ) ( s st )
et (s )  et (s ) 
et  1 ( s )  1 ( s st ) 1 ( s st )

Eligibility trace

Replacing trace
Q(λ) algorithm
Q-learning
Q ( st , at ) Q( st , at )   r   max a Q( st 1 , a )  Q ( st , at )

Q(λ) with replacing trace


 r   max a Q ( st 1 , a )  Q ( st , at )
current value
target value

et ( st ) 1 st at St+1
for all s,a

Q ( s, a ) Q ( s, a )  e( s, a ) Q (st ,at)

e( s ) es, a 
Q(λ) algorithm
Initialize Q(s,a) arbitrarily and e(s,a)=0, for all s, a
Repeat (for each episode):
Initialize s, a
Repeat (for each step):
take action a, observe r, s’
choose a’ from s’ using policy derived from Q (e.g., ε-greedy)
a*←arg maxb Q(s’,b) (if a’ ties for the max, then a*←a’)
δ←r+γQ(s’,a*)-Q(s,a)
e(s,a)←1
for all s, a:
Q(s,a)←Q(s,a)+αδe(s,a)
If a’=a*, then e(s,a)←γλe(s,a)
else e(s,a)← 0
s←s’; a←a’
until s is terminal

You might also like