0% found this document useful (0 votes)
4 views

MAS-Lab7-QFA

The document discusses Q-Learning with Linear Value Function Approximation, focusing on how agents learn from their actions in environments with large state and action spaces. It outlines the use of features to represent states and actions, as well as the process of updating Q-values through temporal differences and stochastic gradient descent. Additionally, it describes the Cartpole-v1 environment in Gymnasium, detailing its objectives, actions, rewards, and the setup for testing various learning parameters.

Uploaded by

Adrian Patrascu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

MAS-Lab7-QFA

The document discusses Q-Learning with Linear Value Function Approximation, focusing on how agents learn from their actions in environments with large state and action spaces. It outlines the use of features to represent states and actions, as well as the process of updating Q-values through temporal differences and stochastic gradient descent. Additionally, it describes the Cartpole-v1 environment in Gymnasium, detailing its objectives, actions, rewards, and the setup for testing various learning parameters.

Uploaded by

Adrian Patrascu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Multi Agent Systems

- Lab 7 -
Q-Learning with Linear Value Function Approximation
Q-Learning Recap


Value Function is more explicit in storing the value of executing
an action in a given state: q(s, a)
q π (s , a)=E π [ R t +1 + γ R t + 2 + γ2 Rt +3 +...∣St =s , A t =a ]= E π
[∑
τ =t +1
γ τ −t −1 R τ∣St =s , A t =a
]

Instance of model-free learning – i.e. environment dynamics is
unknown to the agent


We tackled environments where number of states is small
enough to use a tabular representation of the Q-Function
Q-Learning Recap

Agent learns by observing consequences procedure Q-Learning (<S, A, γ>, ε)


for all s in S, a in A do
of actions it takes in the environment
q(s,a) ← 0 // set initial values to 0
end for
Q-values adjusted through temporal for all episodes do
s ← initial state
differences
while s not final state do
pick action a using ε-Greedy (s, q, ε)
Learning is off-policy execute a → get reward r and next state s’
q(s, a) ← q(s, a) + α(r + γmaxa’q(s’,a’ ) - q(s, a))
Learning policy is greedy
s ← s’
Play policy allows for exploration end while
end for
for all s in S do
procedure ε-Greedy (s, q, ε)
π(s) ← argmaxa in A q(s, a)
with prob ε: return random(A)
end for
with prob 1-ε: return argmax q (s , a)
a return π
end
Q-Learning in continuous state space


Many real world problems have enormous state and/or action
spaces (e.g. robotics control, self driving)


Tabular representation is not really appropriate


Idea: Use a function to represent the value
Q-Learning with Linear Value Function
Approximation – General Formulation

( )
x 1 (s , a)

Use features to represent state and action x (s , a)= x 2 (s , a)
...
x n (s , a)


Q-function represented as weighted linear combination of
features n
^ , a , w)=x(s , a)T w=∑ x j (s , a)w j
Q(s
j=1


Learn weights w through stochastic gradient descent updates
∇ w J (w)=∇ w E π [(Q (s , a)−Q^ (s , a , w)) ]
π π 2
Q-Learning with Linear Value Function
Approximation – Simplified

( )
x 1 (s)

When action space A is small and finite x (s)= x 2 (s)
...
consider a featurised representation of states only x n (s)


Q-function represented as collection of weighted linear
combination of features – one model per action
n
Q^ a (s , w)=x(s) w=∑ x j (s) w j , ∀ a∈ A
T

j=1


Learn weights w through stochastic gradient descent updates
∇ w J (w)=∇ w E π [(Q (s , a)−Q^ a (s , w)) ]
π π 2
Q-Learning with Linear Value Function
Approximation – TD Target


For Q-Function, instead of the actual gain per episode under
current policy Qπ (s , a) use TD-target r + γ max a ' Q^ (s ' , a ' , w)


Learn weights w through stochastic gradient descent updates
∇ w J (w)=∇ w (r + γ max a ' Q^a ' (s ' , w)− Q^ a (s , w))
2

Δ w=α (r+ γ max a ' Q^a ' (s ' , w)−Q^ a (s , w)) ∇ w Q^ a (s , w)

Δ w=α (r+ γ max a ' Q^a ' (s ' , w)−Q^ a (s , w)) x (s)
Q-Learning, Linear Approximation, TD target

procedure Q-Learning (<S, A, γ>, ε, estimator) Agent learns by observing consequences


for all episodes do
of actions it takes in the environment

s ← initial state
while s not final state do Q-values adjusted through temporal
pick action a using ε-Greedy (os, estimator, ε)
differences
execute a → get reward r and next state s’
x(s’) = featurize(os’)
[q̂a1(s’), …, q̂am(s’)] = estimator.predict(x(s’))
Learning is off-policy
tdtarget = r + γmaxa’q̂a’(s’)
Learning policy is greedy
estimator.update(s, a, tdtarget)
s ← s’ Play policy allows for exploration
end while
end for
procedure ε-Greedy (os, estimator, ε)

for all s in S do x(s) = featurize(os)


π(s) ← argmaxa in A q(s, a) q̂ = estimator(x(s))
end for with prob ε: return random(A)
return π with prob 1-ε: return argmax q^a (s)
a
end
OpenAI Gym Cartpole Environment


Cartpole-v1 environment in Gymnasium:
– Objective: keep a pendulum upright for as long as possible
– 2 actions: left (force = -1), right (force = +1)
– Reward: +1 for every time step that the pole remains upright
– Game ends when pole more the 15° from vertical OR cart moves >
2.4 units from center
OpenAI Gym Cartpole Environment


Cartpole-v1 environment in Gymnasium:
– Max num steps per episode = 100
– Use one q-function estimator per action: q̂left , q̂right
– Use a MLP-like feature extractor

Sample 4 sets of weights and 4 sets of biases
w ij ~ √ 2 γ k N (0, 1), k =1..4 , i=1. .100, j=1..4 γ k ∈{5.0,2.0, 1.0, 0.5}
k

k
– bi ~ uniform(0,2 π ), k =1..4 , i=1. .100

Explore three different activation functions: cos(x), sigmoid(x), tanh(x)

Explore different values of the SGD learning rate

Explore different values of ε – ε=0.0, ε=0.1, ε=decay(init=0.1, min =
0.001, factor=0.99)
– Plot agent learning curves for each case

You might also like