CS234_A2
CS234_A2
These questions require thought but do not require long answers. Please be as concise as possible.
We encourage students to discuss in groups for assignments. However, each student must finish
the problem set and programming assignment individually, and must turn in her/his
assignment. We ask that you abide by the university Honor Code and that of the Computer Science
department, and make sure that all of your submitted work is done by yourself. If you have discussed
the problems with others, please include a statement saying who you discussed problems with. Failure
to follow these instructions will be reported to the Office of Community Standards. We reserve the
right to run a fraud-detection software on your code.
In this pseudocode:
1
CS 234: Assignment #2
• θ are the weights of the Q-network, which are adjusted during training.
• θ− are the weights of the target network, which are periodically updated to match θ.
(b) (2 pts) (written) When using DQN with a deep neural network, which of the above components would you
hypothesize contributes most to performance gains? Justify your answer.
(c) (3 pts) (written) In DQN, the choice of target network update frequency is important. What might happen
if the target network is updated every 1015 steps for an agent learning to play a simple Atari game like
Pong?
Page 2 of 9
CS 234: Assignment #2
2.1 REINFORCE
Recall the policy gradient theorem,
REINFORCE is a Monte Carlo policy gradient algorithm, so we will be using the sampled returns Gt
as unbiased estimates of Qπθ (s, a). The REINFORCE estimator can be expressed as the gradient of the
following objective function:
|D| Ti
1 XX
log πθ (ait |sit ) Git
J(θ) = P
Ti i=1 t=1
where D is the set of all trajectories collected by policy πθ , and τ i = (si0 , ai0 , r0i , si1 , . . . , siTi , aiTi , rTi i ) is
trajectory i.
2.2 Baseline
One difficulty of training with the REINFORCE algorithm is that the Monte Carlo sampled return(s) Gt
can have high variance. To reduce variance, we subtract a baseline bϕ (s) from the estimated returns when
computing the policy gradient. A good baseline is the state value function, V πθ (s), which requires a training
update to ϕ to minimize the following mean-squared error loss:
|D| Ti
1 XX
LMSE (ϕ) = P (bϕ (sit ) − Git )2
Ti i=1 t=1
|D| Ti
1 XX
log πθ (ait |sit ) Âit
J(θ) = P
Ti i=1 t=1
where
Page 3 of 9
CS 234: Assignment #2
πθ (ait | sit )
zθ (sit , ait ) =
πθold (ait | sit )
To do so, we introduce the clipped PPO loss function, shown below, where clip(x, a, b) outputs x if a ≤ x ≤ b,
a if x < a, and b if x > b:
|D| Ti
1 XX
Jclip (θ) = P min(zθ (sit , ait )Âit , clip(zθ (sit , ait ), 1 − ϵ, 1 + ϵ)Âit )
Ti i=1 t=1
where Âit = Git − Vϕ (sit ). Note that in this context, we will refer to Vϕ (sit ) as a “critic”; we will train this
like the baseline network described above.
To train the policy, we collect data in the environment using πθold and apply gradient ascent on Jclip (θ) for
each update. After every K updates to parameters [π, ϕ], we update the old policy πθold to equal πθ .
• build mlp
• BasePolicy.act
• CategoricalPolicy.action distribution
• GaussianPolicy. init
• GaussianPolicy.std
Page 4 of 9
CS 234: Assignment #2
• GaussianPolicy.action distribution
• PolicyGradient.init policy
• PolicyGradient.get returns
• PolicyGradient.normalize advantage
• PolicyGradient.update policy
• BaselineNetwork. init
• BaselineNetwork.forward
• BaselineNetwork.calculate advantage
• BaselineNetwork.update baseline
• PPO.update policy
2.6 Debugging
To help debug and verify that your implementation is correct, we provide a set of sanity checks below that
pass with a correct implementation. Note that these are not exhaustive (i.e., they do not verify that your
implementation is correct) and that you may notice oscillation of the average reward across training.
Across most seeds:
• Policy gradient (without baseline) on Pendulum should achieve around an average reward of 100 by
iteration 10.
• Policy gradient (with baseline) on Pendulum should achieve around an average reward of 700 by
iteration 20.
• All methods should reach an average reward of 200 on Cartpole, 1000 on Pendulum, and 200 on
Cheetah at some point.
Naively, computing all these values takes O(T 2 ) time. Describe how to compute them in O(T ) time.
(b) (3 pts) Consider the cases in the gradient of the clipped PPO loss function equals 0. Express these cases
mathematically and explain why PPO behaves in this manner.
Page 5 of 9
CS 234: Assignment #2
(c) (3 pts) Notice that the method which samples actions from the policy also returns the log-probability with
which the sampled action was taken. Why does REINFORCE not need to cache this information while
PPO does? Suppose this log-probability information had not been collected during the rollout. How
would that affect the implementation (that is, change the code you would write) of the PPO update?
(d) (12 pts) The general form for running your policy gradient implementation is as follows:
ENV should be cartpole, pendulum, or cheetah, METHOD should be either baseline, no-baseline,
or ppo, and SEED should be a positive integer.
For the cartpole and pendulum environments, we will consider 3 seeds (seed = 1, 2, 3). For
cheetah, we will only require one seed (seed = 1) since it’s more computationally expensive, but
we strongly encourage you to run multiple seeds if you are able to. Run each of the algorithms we
implemented (PPO, PG with baseline, PG without baseline) across each seed and environment. In
total, you should end up with at least 21 runs.
Plot the results using:
where SEEDS should be a comma-separated list of seeds which you want to plot (e.g. --seeds
1,2,3). Please include the plots (one for each environment) in your writeup, and comment
on the performance of each method.
We have the following expectations about performance to receive full credit:
• cartpole: Should reach the max reward of 200 (although it may not stay there)
• pendulum: Should reach the max reward of 1000 (although it may not stay there)
• cheetah: Should reach at least 200 (could be as large as 900)
Page 6 of 9
CS 234: Assignment #2
(a) (3 pts) (written) Consider a fixed stochastic policy and imagine running several rollouts of this policy within
the environment. Naturally, depending on the stochasticity of the MDP M and the policy itself,
some trajectories are more likely than others. Write down an expression for ρπ (τ ), the probability of
running π in M. To put this distribution in context,
sampling a trajectory τ = (s0 , a0 , s1 , a1 , . . .) from
∞
recall that V π (s0 ) = Eτ ∼ρπ γ t R(st , at ) | s0 .
P
t=0
(b) (1 pt) (written) What is p (st = s), where pπ (st = s) denotes the probability of being in state s at timestep
π
(c) (5 pts) (written) Just as ρπ captures the distribution over trajectories induced by π, we can also examine the
distribution over states induced by π. In particular, define the discounted, stationary state distribution
of a policy π as
X∞
dπ (s) = (1 − γ) γ t pπ (st = s),
t=0
π
where p (st = s) denotes the probability of being in state s at timestep t while following policy π; your
answer to the previous part should help you reason about how you might compute this value.
The value function of a policy π can be expressed using this distribution dπ (s, a) = dπ (s)π(a | s) over
states and actions, which will shortly be quite useful.
Consider an arbitrary function f : S × A → R. Prove the following identity:
"∞ #
X
t 1
Eτ ∼ρπ γ f (st , at ) = Es∼dπ Ea∼π(s) [f (s, a)] .
t=0
(1 − γ)
Hint: You may find it helpful to first consider how things work out for f (s, a) = 1, ∀(s, a) ∈ S × A.
(d) (5 pts) (written) For any policy π, we define the following function
Aπ (s, a) is known as the advantage function and shows up in a lot of policy gradient based RL al-
gorithms, which we shall see later in the class. Intuitively, it is the additional benefit one gets from
first following action a and then following π, instead of always following π. Prove that the following
statement holds for all policies π, π ′ :
′ 1 h h ′ ii
V π (s0 ) − V π (s0 ) = Es∼dπ Ea∼π(s) Aπ (s, a) .
(1 − γ)
1 For a finite set X , ∆(X ) refers to the set of categorical distributions with support on X or, equivalently, the ∆|X |−1
probability simplex.
Page 7 of 9
CS 234: Assignment #2
′
Hint 1: Try adding and subtracting a term that will let you bring Aπ (s, a) into the equation. What
∞ ′
γ t+1 V π (st+1 ) on the LHS?
P
happens on adding and subtracting
t=0
Hint 2: Recall the tower property of expectation which says that E [X] = E [E [X | Y ]].
After proving this result, you might already begin to appreciate why this represents a useful theoretical
contribution. We are often interested in being able to control the gap between two value functions and this
result provides a new mechanism for doing exactly that, when the value functions in question belong to two
particular policies of the MDP.
Additionally, to see how this result is of practical importance as well, suppose the data-generating policy in
the above identity π is some current policy we have in hand and π ′ represents some next policy we would
like to optimize for; concretely, this scenario happens quite often when π is a neural network and π ′ denotes
the same network with updated parameters. As is often the case with function approximation, there are
sources of instability and, sometimes, even small parameter updates can lead to drastic changes in policy
performance, potentially degrading (instead of improving) the performance of the current policy π. These
realities of deep learning motivate a desire to occasionally be conservative in our updates and attempt to
reach a new policy π ′ that provides only a modest improvement over π. Practical approaches can leverage
the above identity to strike the right balance between making progress and maintaining stability.
Page 8 of 9
CS 234: Assignment #2
1. Respect for persons: individuals are capable of making choices about their own lives on the basis of
their personal goals. Research participants should be informed about the study they are considering
undergoing, asked for their consent, and not coerced into giving it. Individuals who are less capable of
giving informed consent, such as young children, should be protected in other ways.
2. Beneficence: the principle of beneficence describes an obligation to ensure the well-being of subjects.
It has been summarized as “do not harm” or “maximize possible benefits and minimize possible harms.”
3. Justice: the principle of justice requires treating all people equally and distributing benefits and harms
to them equitably.
(a) (4 pts) In 4-6 sentences, describe two experimental design or research choices that researchers planning
the above experiment ought to make in order to respect these principles. Justify the importance of
these choices using one of the three ethical principles above and indicating which principle you have
chosen. For example, “Researchers ought to ensure that students advised by the chatbot are able to
revise their assignments after submission with the benefit of human advice if needed. If they did not
take this precaution, the principle of justice would be violated because the risk of harm from poor
advice from the AI chatbot would be distributed unevenly.”
At universities, research experiments that involve human subjects are subject by federal law to Institutional
Review Board (IRB) approval. The purpose of IRB is to protect human subjects of research: to “assure,
both in advance and by periodic review, that appropriate steps are taken to protect the rights and welfare of
humans participating as subjects in the research” (reference). The IRB process was established in response
to abuses of human subjects in the name of medical research performed during WWII (reference). The
IRB is primarily intended to address the responsibilities of the researcher towards the subjects. Familiarize
yourself with Stanford’s IRB Research Compliance process at this link.
(b) (1 pt) If you were conducting the above experiment, what process would you need to follow at Stanford
(who would you email/ where would you upload a research protocol) to get clearance?
Page 9 of 9