Q Learning
Q Learning
Q ( st , at ) E rt 1 rt 2 2 rt 3 3 rt 4
Example)
a3t
Q3(st,at)=0 a2t
Q2(st,at)=1
…
rt rt+1 rt+1
st st+1 st+2
a1t at+1 at+2
Q1(st,at)=2
Generally speaking, an agent should take action a1t
because the corresponding Q value Q1(st,at) is max.
Q learning
First, Q value can be transformed as follows.
Q ( st , at ) E rt 1 rt 2 2 rt 3 3 rt 4
k
E rt k 1 ①
k 0
E rt 1 rt k 2
k
k 0
Ert 1 Q ( st 1 , at 1 ) ( ① )
TD error
s←s’;
until s is terminal
n- step return (reward)
1 step
(Q-learning) 2 step n step Monte Carlo
initial state
(time t)
….. …..
Complete experience based method
Boot-strapping
Rt(1) rt 1 Qst 1 , at 1
Rt( 2 ) rt 1 rt 2 2Qst 2 , at 2
.
.
.
..
…
action
Terminal state
state Rt( n ) rt 1 rt 2 n 1rt n nQst n , at n (time T)
Rt rt 1 rt 2 n 1rt n T t 1rT
n- step return (reward)
Q ( st 1 , at 1 )
E rt 1 rt 2 rt 3 rt 4
2 3
E r t 1
2 n
rt 2 rt 3 Q ( st n , at n )
E rt 1 rt 2 2Q ( st 2 , at 2 )
E r t 1
Q ( st 1 , at 1 )
λ-return (trace-decay parameter)
1 step 2 step 3 step n step Monte Carlo
weight
….. …..
1
1
1 n 1
1
n 1
.
.
.
1 2
..
…
λ-return
R t
( )
1 n 1 Rt( n )
n 1
T t 1 1 t 1
1 n 1 Rt( n ) T t 1 Rt T t 1
n 1
λ-return (trace-decay parameter)
3-step return
Eligibility trace and Replacing trace
Eligibility and Replacing traces is useful to calculate the
n-step return
These traces show how often each state is visited.
Eligibility trace
Replacing trace
Q(λ) algorithm
Q-learning
Q ( st , at ) Q( st , at ) r max a Q( st 1 , a ) Q ( st , at )
et ( st ) 1 st at St+1
for all s,a
e( s ) es, a
Q(λ) algorithm
Initialize Q(s,a) arbitrarily and e(s,a)=0, for all s, a
Repeat (for each episode):
Initialize s, a
Repeat (for each step):
take action a, observe r, s’
choose a’ from s’ using policy derived from Q (e.g., ε-greedy)
a*←arg maxb Q(s’,b) (if a’ ties for the max, then a*←a’)
δ←r+γQ(s’,a*)-Q(s,a)
e(s,a)←1
for all s, a:
Q(s,a)←Q(s,a)+αδe(s,a)
If a’=a*, then e(s,a)←γλe(s,a)
else e(s,a)← 0
s←s’; a←a’
until s is terminal