MAS-Lab7-QFA
MAS-Lab7-QFA
- Lab 7 -
Q-Learning with Linear Value Function Approximation
Q-Learning Recap
●
Value Function is more explicit in storing the value of executing
an action in a given state: q(s, a)
q π (s , a)=E π [ R t +1 + γ R t + 2 + γ2 Rt +3 +...∣St =s , A t =a ]= E π
[∑
τ =t +1
γ τ −t −1 R τ∣St =s , A t =a
]
●
Instance of model-free learning – i.e. environment dynamics is
unknown to the agent
●
We tackled environments where number of states is small
enough to use a tabular representation of the Q-Function
Q-Learning Recap
●
Many real world problems have enormous state and/or action
spaces (e.g. robotics control, self driving)
●
Tabular representation is not really appropriate
●
Idea: Use a function to represent the value
Q-Learning with Linear Value Function
Approximation – General Formulation
( )
x 1 (s , a)
●
Use features to represent state and action x (s , a)= x 2 (s , a)
...
x n (s , a)
●
Q-function represented as weighted linear combination of
features n
^ , a , w)=x(s , a)T w=∑ x j (s , a)w j
Q(s
j=1
●
Learn weights w through stochastic gradient descent updates
∇ w J (w)=∇ w E π [(Q (s , a)−Q^ (s , a , w)) ]
π π 2
Q-Learning with Linear Value Function
Approximation – Simplified
( )
x 1 (s)
●
When action space A is small and finite x (s)= x 2 (s)
...
consider a featurised representation of states only x n (s)
●
Q-function represented as collection of weighted linear
combination of features – one model per action
n
Q^ a (s , w)=x(s) w=∑ x j (s) w j , ∀ a∈ A
T
j=1
●
Learn weights w through stochastic gradient descent updates
∇ w J (w)=∇ w E π [(Q (s , a)−Q^ a (s , w)) ]
π π 2
Q-Learning with Linear Value Function
Approximation – TD Target
●
For Q-Function, instead of the actual gain per episode under
current policy Qπ (s , a) use TD-target r + γ max a ' Q^ (s ' , a ' , w)
●
Learn weights w through stochastic gradient descent updates
∇ w J (w)=∇ w (r + γ max a ' Q^a ' (s ' , w)− Q^ a (s , w))
2
Δ w=α (r+ γ max a ' Q^a ' (s ' , w)−Q^ a (s , w)) x (s)
Q-Learning, Linear Approximation, TD target
s ← initial state
while s not final state do Q-values adjusted through temporal
pick action a using ε-Greedy (os, estimator, ε)
differences
execute a → get reward r and next state s’
x(s’) = featurize(os’)
[q̂a1(s’), …, q̂am(s’)] = estimator.predict(x(s’))
Learning is off-policy
tdtarget = r + γmaxa’q̂a’(s’)
Learning policy is greedy
estimator.update(s, a, tdtarget)
s ← s’ Play policy allows for exploration
end while
end for
procedure ε-Greedy (os, estimator, ε)
●
Cartpole-v1 environment in Gymnasium:
– Objective: keep a pendulum upright for as long as possible
– 2 actions: left (force = -1), right (force = +1)
– Reward: +1 for every time step that the pole remains upright
– Game ends when pole more the 15° from vertical OR cart moves >
2.4 units from center
OpenAI Gym Cartpole Environment
●
Cartpole-v1 environment in Gymnasium:
– Max num steps per episode = 100
– Use one q-function estimator per action: q̂left , q̂right
– Use a MLP-like feature extractor
●
Sample 4 sets of weights and 4 sets of biases
w ij ~ √ 2 γ k N (0, 1), k =1..4 , i=1. .100, j=1..4 γ k ∈{5.0,2.0, 1.0, 0.5}
k
–
k
– bi ~ uniform(0,2 π ), k =1..4 , i=1. .100
●
Explore three different activation functions: cos(x), sigmoid(x), tanh(x)
●
Explore different values of the SGD learning rate
●
Explore different values of ε – ε=0.0, ε=0.1, ε=decay(init=0.1, min =
0.001, factor=0.99)
– Plot agent learning curves for each case