0% found this document useful (0 votes)
23 views

CS234_A2

CS 234 Assignment #2 is due on April 26, 2024, and requires individual submissions despite group discussions. The assignment focuses on Deep Q-Networks and Policy Gradient Methods, including coding tasks and theoretical questions related to these algorithms. Students must adhere to the Honor Code and provide proper attribution for any discussions with peers.

Uploaded by

lecotem
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

CS234_A2

CS 234 Assignment #2 is due on April 26, 2024, and requires individual submissions despite group discussions. The assignment focuses on Deep Q-Networks and Policy Gradient Methods, including coding tasks and theoretical questions related to these algorithms. Students must adhere to the Honor Code and provide proper attribution for any discussions with peers.

Uploaded by

lecotem
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

CS 234: Assignment #2

Due date: April 26, 2024 at 6:00 PM (18:00) PST

These questions require thought but do not require long answers. Please be as concise as possible.

We encourage students to discuss in groups for assignments. However, each student must finish
the problem set and programming assignment individually, and must turn in her/his
assignment. We ask that you abide by the university Honor Code and that of the Computer Science
department, and make sure that all of your submitted work is done by yourself. If you have discussed
the problems with others, please include a statement saying who you discussed problems with. Failure
to follow these instructions will be reported to the Office of Community Standards. We reserve the
right to run a fraud-detection software on your code.

Please review any additional instructions posted on the assignment page at


http://web.stanford.edu/class/cs234/assignments.html. When you are ready to submit, please
follow the instructions on the course website.

1 Deep Q-Networks (DQN) (8 pts writeup)


All questions in the section pertain to DQN. The pseudocode for DQN is provided below.

Algorithm 1 Deep Q-Network (DQN)


1: Initialize replay buffer D
2: Initialize action-value function Q with random weights θ
3: Initialize target action-value function Q̂ with weights θ− = θ
4: for episode = 1, M do
5: Receive initial state s1
6: for t = 1, T do
7: With probability ϵ select a random action at
8: otherwise select at = maxa Q(st , a; θ)
9: Execute action at and observe reward rt and state st+1
10: Store transition (st , at , rt , st+1 ) in D
11: Sample random minibatch B of transitions from D
12: for each transition (sj , aj , rj , sj+1 ) in B do
13: if sj+1 is terminal then
14: Set yj = rj
15: else
16: Set yj = rj + γ maxa′ Q̂(sj+1 , a′ ; θ− )
17: end if
18: Perform gradient descent step on (yj − Q(sj , aj ; θ))2 with respect to network parameters θ
19: end for
20: Every C steps reset Q̂ = Q by setting θ− = θ
21: end for
22: end for

In this pseudocode:

1
CS 234: Assignment #2

• D is the replay memory which stores transitions.

• θ are the weights of the Q-network, which are adjusted during training.

• θ− are the weights of the target network, which are periodically updated to match θ.

• M is the number of episodes over which the training occurs.

• T is the maximum number of steps in each episode.

• ϵ is the exploration rate, which is typically decayed over time.

• γ is the discount factor, used to weigh future rewards.

• C is the frequency with which to update the target network’s weights.

1.1 Written Questions (8 pts)


(a) (3 pts) (written) What are three key differences between the DQN and Q-learning algorithms?

(b) (2 pts) (written) When using DQN with a deep neural network, which of the above components would you
hypothesize contributes most to performance gains? Justify your answer.

(c) (3 pts) (written) In DQN, the choice of target network update frequency is important. What might happen
if the target network is updated every 1015 steps for an agent learning to play a simple Atari game like
Pong?

Page 2 of 9
CS 234: Assignment #2

2 Policy Gradient Methods (54 pts coding + 26 pts writeup)


The goal of this problem is to experiment with policy gradient and its variants, including variance reduction
and off-policy methods. Your goals will be to set up policy gradient for both continuous and discrete
environments, use a neural network baseline for variance reduction, and implement the off-policy Proximal
Policy Optimization algorithm. The starter code has detailed instructions for each coding task and includes
a README with instructions to set up your environment. Below, we provide an overview of the key steps
of the algorithm.

2.1 REINFORCE
Recall the policy gradient theorem,

∇θ J(θ) = Eπθ [∇θ log πθ (a|s)Qπθ (s, a)]

REINFORCE is a Monte Carlo policy gradient algorithm, so we will be using the sampled returns Gt
as unbiased estimates of Qπθ (s, a). The REINFORCE estimator can be expressed as the gradient of the
following objective function:
|D| Ti
1 XX
log πθ (ait |sit ) Git

J(θ) = P
Ti i=1 t=1

where D is the set of all trajectories collected by policy πθ , and τ i = (si0 , ai0 , r0i , si1 , . . . , siTi , aiTi , rTi i ) is
trajectory i.

2.2 Baseline
One difficulty of training with the REINFORCE algorithm is that the Monte Carlo sampled return(s) Gt
can have high variance. To reduce variance, we subtract a baseline bϕ (s) from the estimated returns when
computing the policy gradient. A good baseline is the state value function, V πθ (s), which requires a training
update to ϕ to minimize the following mean-squared error loss:
|D| Ti
1 XX
LMSE (ϕ) = P (bϕ (sit ) − Git )2
Ti i=1 t=1

2.3 Advantage Normalization


After subtracting the baseline, we get the following new objective function:

|D| Ti
1 XX
log πθ (ait |sit ) Âit

J(θ) = P
Ti i=1 t=1
where

Âit = Git − bϕ (sit )


A second variance reduction technique is to normalize the computed advantages, Âit , so that they have mean
0 and standard deviation 1. From a theoretical perspective, we can consider centering the advantages to be
simply adjusting the advantages by a constant baseline, which does not change the policy gradient. Likewise,
rescaling the advantages effectively changes the learning rate by a factor of 1/σ, where σ is the standard
deviation of the empirical advantages.

Page 3 of 9
CS 234: Assignment #2

2.4 Proximal Policy Optimization


One might notice that the REINFORCE algorithm above (with or without a baseline function) is an on-policy
algorithm; that is, we collect some number of trajectories under the current policy network parameters, use
that data to perform a single batched policy gradient update, and then proceed to discard that data and
repeat the same steps using the newly updated policy parameters. This is in stark contrast to an algorithm
like DQN which stores all experiences collected over several past episodes. One might imagine that it could be
useful to have a policy gradient algorithm “squeeze” a little more information out of each batch of trajectories
sampled from the environment. Unfortunately, while the Q-learning update immediately allows for this, our
derived REINFORCE estimator does not in its standard form.
Ideally, an off-policy policy gradient algorithm will allow us to do multiple parameter updates on the same
batch of trajectory data. To get a suitable objective function that allows for this, we need to correct for
the mismatch between the policy under which the data was collected and the policy being optimized with
that data. Proximal Policy Optimization (PPO) restricts the magnitude of each update to the policy (i.e.,
through gradient descent) by ensuring the ratio of the current and former policies on the current batch is
not too different. In doing so, PPO tries to prevent updates that are “too large” due to the off-policy data,
which may lead to performance degradation. This technique is related to the idea of importance sampling
which we will examine in detail later in the course. Consider the following ratio zθ , which measures the
probability ratio between a current policy πθ (the “actor”) and an old policy πθold :

πθ (ait | sit )
zθ (sit , ait ) =
πθold (ait | sit )
To do so, we introduce the clipped PPO loss function, shown below, where clip(x, a, b) outputs x if a ≤ x ≤ b,
a if x < a, and b if x > b:

|D| Ti
1 XX
Jclip (θ) = P min(zθ (sit , ait )Âit , clip(zθ (sit , ait ), 1 − ϵ, 1 + ϵ)Âit )
Ti i=1 t=1

where Âit = Git − Vϕ (sit ). Note that in this context, we will refer to Vϕ (sit ) as a “critic”; we will train this
like the baseline network described above.
To train the policy, we collect data in the environment using πθold and apply gradient ascent on Jclip (θ) for
each update. After every K updates to parameters [π, ϕ], we update the old policy πθold to equal πθ .

2.5 Coding Questions (50 pts)


The functions that you need to implement in network utils.py, policy.py, policy gradient.py,
and baseline network.py are enumerated here. Detailed instructions for each function can be found in
the comments in each of these files.
P
Note: The ”batch size” for all the arguments is Ti since we already flattened out all the episode observa-
tions, actions, and rewards for you.
In network utils.py, you need to implement:

• build mlp

In policy.py, you need to implement:

• BasePolicy.act

• CategoricalPolicy.action distribution

• GaussianPolicy. init

• GaussianPolicy.std

Page 4 of 9
CS 234: Assignment #2

• GaussianPolicy.action distribution

In policy gradient.py, you need to implement:

• PolicyGradient.init policy

• PolicyGradient.get returns

• PolicyGradient.normalize advantage

• PolicyGradient.update policy

In baseline network.py, you need to implement:

• BaselineNetwork. init

• BaselineNetwork.forward

• BaselineNetwork.calculate advantage

• BaselineNetwork.update baseline

In ppo.py, you need to implement:

• PPO.update policy

2.6 Debugging
To help debug and verify that your implementation is correct, we provide a set of sanity checks below that
pass with a correct implementation. Note that these are not exhaustive (i.e., they do not verify that your
implementation is correct) and that you may notice oscillation of the average reward across training.
Across most seeds:

• Policy gradient (without baseline) on Pendulum should achieve around an average reward of 100 by
iteration 10.

• Policy gradient (with baseline) on Pendulum should achieve around an average reward of 700 by
iteration 20.

• PPO on Pendulum should achieve an average reward of 200 by iteration 20.

• All methods should reach an average reward of 200 on Cartpole, 1000 on Pendulum, and 200 on
Cheetah at some point.

2.7 Writeup Questions (26 pts)


(a) (3 pts) To compute the REINFORCE estimator, you will need to calculate the values {Gt }Tt=1 (we drop the
trajectory index i for simplicity), where
T
X ′
Gt = γ t −t rt′
t′ =t

Naively, computing all these values takes O(T 2 ) time. Describe how to compute them in O(T ) time.

(b) (3 pts) Consider the cases in the gradient of the clipped PPO loss function equals 0. Express these cases
mathematically and explain why PPO behaves in this manner.

Page 5 of 9
CS 234: Assignment #2

(c) (3 pts) Notice that the method which samples actions from the policy also returns the log-probability with
which the sampled action was taken. Why does REINFORCE not need to cache this information while
PPO does? Suppose this log-probability information had not been collected during the rollout. How
would that affect the implementation (that is, change the code you would write) of the PPO update?

(d) (12 pts) The general form for running your policy gradient implementation is as follows:

python main.py --env-name ENV --seed SEED --METHOD

ENV should be cartpole, pendulum, or cheetah, METHOD should be either baseline, no-baseline,
or ppo, and SEED should be a positive integer.
For the cartpole and pendulum environments, we will consider 3 seeds (seed = 1, 2, 3). For
cheetah, we will only require one seed (seed = 1) since it’s more computationally expensive, but
we strongly encourage you to run multiple seeds if you are able to. Run each of the algorithms we
implemented (PPO, PG with baseline, PG without baseline) across each seed and environment. In
total, you should end up with at least 21 runs.
Plot the results using:

python plot.py --env-name ENV --seeds SEEDS

where SEEDS should be a comma-separated list of seeds which you want to plot (e.g. --seeds
1,2,3). Please include the plots (one for each environment) in your writeup, and comment
on the performance of each method.
We have the following expectations about performance to receive full credit:

• cartpole: Should reach the max reward of 200 (although it may not stay there)
• pendulum: Should reach the max reward of 1000 (although it may not stay there)
• cheetah: Should reach at least 200 (could be as large as 900)

Page 6 of 9
CS 234: Assignment #2

3 Distributions induced by a policy (13 pts)


Suppose we have a single MDP and two policies for that MDP, π and π ′ . Naturally, we are often interested

in the performance of policies obtained in the MDP, quantified by V π and V π , respectively. If the reward
function and transition dynamics of the underlying MDP are known to us, we can use standard methods for
policy evaluation. There are many scenarios, however, where the underlying MDP model is not known and
we must try to infer something about the performance of policy π ′ solely based on data obtained through
executing policy π within the environment. In this problem, we will explore a classic result for quantifying
the gap in performance between two policies that only requires access to data sampled from one of the
policies.
Consider an infinite-horizon MDP M = ⟨S, A, R, P, γ⟩ and stochastic policies of the form π : S → ∆(A)1 .
P
Specifically, π(a|s) refers to the probability of taking action a in state s, and a π(a|s) = 1, ∀s. For
simplicity, we’ll assume that this decision process has a single, fixed starting state s0 ∈ S.

(a) (3 pts) (written) Consider a fixed stochastic policy and imagine running several rollouts of this policy within
the environment. Naturally, depending on the stochasticity of the MDP M and the policy itself,
some trajectories are more likely than others. Write down an expression for ρπ (τ ), the probability of
 running π in M. To put this distribution in context,
sampling a trajectory τ = (s0 , a0 , s1 , a1 , . . .) from

recall that V π (s0 ) = Eτ ∼ρπ γ t R(st , at ) | s0 .
P
t=0

(b) (1 pt) (written) What is p (st = s), where pπ (st = s) denotes the probability of being in state s at timestep
π

t while following policy π? (Provide an equation)

(c) (5 pts) (written) Just as ρπ captures the distribution over trajectories induced by π, we can also examine the
distribution over states induced by π. In particular, define the discounted, stationary state distribution
of a policy π as
X∞
dπ (s) = (1 − γ) γ t pπ (st = s),
t=0
π
where p (st = s) denotes the probability of being in state s at timestep t while following policy π; your
answer to the previous part should help you reason about how you might compute this value.
The value function of a policy π can be expressed using this distribution dπ (s, a) = dπ (s)π(a | s) over
states and actions, which will shortly be quite useful.
Consider an arbitrary function f : S × A → R. Prove the following identity:
"∞ #
X
t 1  
Eτ ∼ρπ γ f (st , at ) = Es∼dπ Ea∼π(s) [f (s, a)] .
t=0
(1 − γ)

Hint: You may find it helpful to first consider how things work out for f (s, a) = 1, ∀(s, a) ∈ S × A.

(d) (5 pts) (written) For any policy π, we define the following function

Aπ (s, a) = Qπ (s, a) − V π (s).

Aπ (s, a) is known as the advantage function and shows up in a lot of policy gradient based RL al-
gorithms, which we shall see later in the class. Intuitively, it is the additional benefit one gets from
first following action a and then following π, instead of always following π. Prove that the following
statement holds for all policies π, π ′ :
′ 1 h h ′ ii
V π (s0 ) − V π (s0 ) = Es∼dπ Ea∼π(s) Aπ (s, a) .
(1 − γ)
1 For a finite set X , ∆(X ) refers to the set of categorical distributions with support on X or, equivalently, the ∆|X |−1

probability simplex.

Page 7 of 9
CS 234: Assignment #2


Hint 1: Try adding and subtracting a term that will let you bring Aπ (s, a) into the equation. What
∞ ′
γ t+1 V π (st+1 ) on the LHS?
P
happens on adding and subtracting
t=0
Hint 2: Recall the tower property of expectation which says that E [X] = E [E [X | Y ]].

After proving this result, you might already begin to appreciate why this represents a useful theoretical
contribution. We are often interested in being able to control the gap between two value functions and this
result provides a new mechanism for doing exactly that, when the value functions in question belong to two
particular policies of the MDP.
Additionally, to see how this result is of practical importance as well, suppose the data-generating policy in
the above identity π is some current policy we have in hand and π ′ represents some next policy we would
like to optimize for; concretely, this scenario happens quite often when π is a neural network and π ′ denotes
the same network with updated parameters. As is often the case with function approximation, there are
sources of instability and, sometimes, even small parameter updates can lead to drastic changes in policy
performance, potentially degrading (instead of improving) the performance of the current policy π. These
realities of deep learning motivate a desire to occasionally be conservative in our updates and attempt to
reach a new policy π ′ that provides only a modest improvement over π. Practical approaches can leverage
the above identity to strike the right balance between making progress and maintaining stability.

Page 8 of 9
CS 234: Assignment #2

4 Ethical concerns with Policy Gradients (5 pts)


In this assignment, we focus on policy gradients, an extremely popular and useful model-free technique for
RL. However, policy gradients collect data from the environment with a potentially suboptimal policy during
the learning process. While this is acceptable in simulators like Mujoco or Atari, such exploration in real
world settings such as healthcare and education presents challenges.
Consider a case study of a Stanford CS course considering introducing a RL-based chat bot for office hours.
For each assignment, some students will be given 100% human CA office hours; others 100% chatbot; others
a mix of both. The reward signal is the student grades on each assignment. Since the AI chatbot will learn
through experience, at any given point in the quarter, the help given by the chatbot might be better or
worse than the help given by a randomly selected human CA.
If each time students are randomly assigned to each condition, some students will be assigned more chatbot
hours and others fewer. In addition, some students will be assigned more chatbot hours at the beginning of
the term (when the chatbot has had fewer interactions and may have lower effectiveness) and fewer at the
end, and vice versa. All students will be graded according to the same standards, regardless of which type
of help they have received.
Researchers who experiment on human subjects are morally responsible for ensuring their well being and
protecting them from being harmed by the study. A foundational document in research ethics, the Belmont
Report, identifies three core principles of responsible research:

1. Respect for persons: individuals are capable of making choices about their own lives on the basis of
their personal goals. Research participants should be informed about the study they are considering
undergoing, asked for their consent, and not coerced into giving it. Individuals who are less capable of
giving informed consent, such as young children, should be protected in other ways.

2. Beneficence: the principle of beneficence describes an obligation to ensure the well-being of subjects.
It has been summarized as “do not harm” or “maximize possible benefits and minimize possible harms.”

3. Justice: the principle of justice requires treating all people equally and distributing benefits and harms
to them equitably.

(a) (4 pts) In 4-6 sentences, describe two experimental design or research choices that researchers planning
the above experiment ought to make in order to respect these principles. Justify the importance of
these choices using one of the three ethical principles above and indicating which principle you have
chosen. For example, “Researchers ought to ensure that students advised by the chatbot are able to
revise their assignments after submission with the benefit of human advice if needed. If they did not
take this precaution, the principle of justice would be violated because the risk of harm from poor
advice from the AI chatbot would be distributed unevenly.”

At universities, research experiments that involve human subjects are subject by federal law to Institutional
Review Board (IRB) approval. The purpose of IRB is to protect human subjects of research: to “assure,
both in advance and by periodic review, that appropriate steps are taken to protect the rights and welfare of
humans participating as subjects in the research” (reference). The IRB process was established in response
to abuses of human subjects in the name of medical research performed during WWII (reference). The
IRB is primarily intended to address the responsibilities of the researcher towards the subjects. Familiarize
yourself with Stanford’s IRB Research Compliance process at this link.

(b) (1 pt) If you were conducting the above experiment, what process would you need to follow at Stanford
(who would you email/ where would you upload a research protocol) to get clearance?

Page 9 of 9

You might also like