0% found this document useful (0 votes)

23 views

CS234_A2

CS 234 Assignment #2 is due on April 26, 2024, and requires individual submissions despite group discussions. The assignment focuses on Deep Q-Networks and Policy Gradient Methods, including coding tasks and theoretical questions related to these algorithms. Students must adhere to the Honor Code and provide proper attribution for any discussions with peers.

Uploaded by

lecotem

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views

CS234_A2

Uploaded by

lecotem

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

CS 234: Assignment #2

Due date: April 26, 2024 at 6:00 PM (18:00) PST

These questions require thought but do not require long answers. Please be as concise as possible.

We encourage students to discuss in groups for assignments. However, each student must finish
the problem set and programming assignment individually, and must turn in her/his
assignment. We ask that you abide by the university Honor Code and that of the Computer Science
department, and make sure that all of your submitted work is done by yourself. If you have discussed
the problems with others, please include a statement saying who you discussed problems with. Failure
to follow these instructions will be reported to the Office of Community Standards. We reserve the
right to run a fraud-detection software on your code.

Please review any additional instructions posted on the assignment page at

http://web.stanford.edu/class/cs234/assignments.html. When you are ready to submit, please
follow the instructions on the course website.

1 Deep Q-Networks (DQN) (8 pts writeup)

All questions in the section pertain to DQN. The pseudocode for DQN is provided below.

Algorithm 1 Deep Q-Network (DQN)

1: Initialize replay buffer D
2: Initialize action-value function Q with random weights θ
3: Initialize target action-value function Q̂ with weights θ− = θ
4: for episode = 1, M do
5: Receive initial state s1
6: for t = 1, T do
7: With probability ϵ select a random action at
8: otherwise select at = maxa Q(st , a; θ)
9: Execute action at and observe reward rt and state st+1
10: Store transition (st , at , rt , st+1 ) in D
11: Sample random minibatch B of transitions from D
12: for each transition (sj , aj , rj , sj+1 ) in B do
13: if sj+1 is terminal then
14: Set yj = rj
15: else
16: Set yj = rj + γ maxa′ Q̂(sj+1 , a′ ; θ− )
17: end if
18: Perform gradient descent step on (yj − Q(sj , aj ; θ))2 with respect to network parameters θ
19: end for
20: Every C steps reset Q̂ = Q by setting θ− = θ
21: end for
22: end for

In this pseudocode:

1
CS 234: Assignment #2

• D is the replay memory which stores transitions.

• θ are the weights of the Q-network, which are adjusted during training.

• θ− are the weights of the target network, which are periodically updated to match θ.

• M is the number of episodes over which the training occurs.

• T is the maximum number of steps in each episode.

• ϵ is the exploration rate, which is typically decayed over time.

• γ is the discount factor, used to weigh future rewards.

• C is the frequency with which to update the target network’s weights.

1.1 Written Questions (8 pts)

(a) (3 pts) (written) What are three key differences between the DQN and Q-learning algorithms?

(b) (2 pts) (written) When using DQN with a deep neural network, which of the above components would you
hypothesize contributes most to performance gains? Justify your answer.

(c) (3 pts) (written) In DQN, the choice of target network update frequency is important. What might happen
if the target network is updated every 1015 steps for an agent learning to play a simple Atari game like
Pong?

Page 2 of 9
CS 234: Assignment #2

2 Policy Gradient Methods (54 pts coding + 26 pts writeup)

The goal of this problem is to experiment with policy gradient and its variants, including variance reduction
and off-policy methods. Your goals will be to set up policy gradient for both continuous and discrete
environments, use a neural network baseline for variance reduction, and implement the off-policy Proximal
Policy Optimization algorithm. The starter code has detailed instructions for each coding task and includes
a README with instructions to set up your environment. Below, we provide an overview of the key steps
of the algorithm.

2.1 REINFORCE
Recall the policy gradient theorem,

∇θ J(θ) = Eπθ [∇θ log πθ (a|s)Qπθ (s, a)]

REINFORCE is a Monte Carlo policy gradient algorithm, so we will be using the sampled returns Gt
as unbiased estimates of Qπθ (s, a). The REINFORCE estimator can be expressed as the gradient of the
following objective function:
|D| Ti
1 XX
log πθ (ait |sit ) Git

J(θ) = P
Ti i=1 t=1

where D is the set of all trajectories collected by policy πθ , and τ i = (si0 , ai0 , r0i , si1 , . . . , siTi , aiTi , rTi i ) is
trajectory i.

2.2 Baseline
One difficulty of training with the REINFORCE algorithm is that the Monte Carlo sampled return(s) Gt
can have high variance. To reduce variance, we subtract a baseline bϕ (s) from the estimated returns when
computing the policy gradient. A good baseline is the state value function, V πθ (s), which requires a training
update to ϕ to minimize the following mean-squared error loss:
|D| Ti
1 XX
LMSE (ϕ) = P (bϕ (sit ) − Git )2
Ti i=1 t=1

2.3 Advantage Normalization

After subtracting the baseline, we get the following new objective function:

|D| Ti
1 XX
log πθ (ait |sit ) Âit

J(θ) = P
Ti i=1 t=1
where

Âit = Git − bϕ (sit )

A second variance reduction technique is to normalize the computed advantages, Âit , so that they have mean
0 and standard deviation 1. From a theoretical perspective, we can consider centering the advantages to be
simply adjusting the advantages by a constant baseline, which does not change the policy gradient. Likewise,
rescaling the advantages effectively changes the learning rate by a factor of 1/σ, where σ is the standard
deviation of the empirical advantages.

Page 3 of 9
CS 234: Assignment #2

2.4 Proximal Policy Optimization

One might notice that the REINFORCE algorithm above (with or without a baseline function) is an on-policy
algorithm; that is, we collect some number of trajectories under the current policy network parameters, use
that data to perform a single batched policy gradient update, and then proceed to discard that data and
repeat the same steps using the newly updated policy parameters. This is in stark contrast to an algorithm
like DQN which stores all experiences collected over several past episodes. One might imagine that it could be
useful to have a policy gradient algorithm “squeeze” a little more information out of each batch of trajectories
sampled from the environment. Unfortunately, while the Q-learning update immediately allows for this, our
derived REINFORCE estimator does not in its standard form.
Ideally, an off-policy policy gradient algorithm will allow us to do multiple parameter updates on the same
batch of trajectory data. To get a suitable objective function that allows for this, we need to correct for
the mismatch between the policy under which the data was collected and the policy being optimized with
that data. Proximal Policy Optimization (PPO) restricts the magnitude of each update to the policy (i.e.,
through gradient descent) by ensuring the ratio of the current and former policies on the current batch is
not too different. In doing so, PPO tries to prevent updates that are “too large” due to the off-policy data,
which may lead to performance degradation. This technique is related to the idea of importance sampling
which we will examine in detail later in the course. Consider the following ratio zθ , which measures the
probability ratio between a current policy πθ (the “actor”) and an old policy πθold :

πθ (ait | sit )
zθ (sit , ait ) =
πθold (ait | sit )
To do so, we introduce the clipped PPO loss function, shown below, where clip(x, a, b) outputs x if a ≤ x ≤ b,
a if x < a, and b if x > b:

|D| Ti
1 XX
Jclip (θ) = P min(zθ (sit , ait )Âit , clip(zθ (sit , ait ), 1 − ϵ, 1 + ϵ)Âit )
Ti i=1 t=1

where Âit = Git − Vϕ (sit ). Note that in this context, we will refer to Vϕ (sit ) as a “critic”; we will train this
like the baseline network described above.
To train the policy, we collect data in the environment using πθold and apply gradient ascent on Jclip (θ) for
each update. After every K updates to parameters [π, ϕ], we update the old policy πθold to equal πθ .

2.5 Coding Questions (50 pts)

The functions that you need to implement in network utils.py, policy.py, policy gradient.py,
and baseline network.py are enumerated here. Detailed instructions for each function can be found in
the comments in each of these files.
P
Note: The ”batch size” for all the arguments is Ti since we already flattened out all the episode observa-
tions, actions, and rewards for you.
In network utils.py, you need to implement:

• build mlp

In policy.py, you need to implement:

• BasePolicy.act

• CategoricalPolicy.action distribution

• GaussianPolicy. init

• GaussianPolicy.std

Page 4 of 9
CS 234: Assignment #2

• GaussianPolicy.action distribution

In policy gradient.py, you need to implement:

• PolicyGradient.init policy

• PolicyGradient.get returns

• PolicyGradient.normalize advantage

• PolicyGradient.update policy

In baseline network.py, you need to implement:

• BaselineNetwork. init

• BaselineNetwork.forward

• BaselineNetwork.calculate advantage

• BaselineNetwork.update baseline

In ppo.py, you need to implement:

• PPO.update policy

2.6 Debugging
To help debug and verify that your implementation is correct, we provide a set of sanity checks below that
pass with a correct implementation. Note that these are not exhaustive (i.e., they do not verify that your
implementation is correct) and that you may notice oscillation of the average reward across training.
Across most seeds:

• Policy gradient (without baseline) on Pendulum should achieve around an average reward of 100 by
iteration 10.

• Policy gradient (with baseline) on Pendulum should achieve around an average reward of 700 by
iteration 20.

• PPO on Pendulum should achieve an average reward of 200 by iteration 20.

• All methods should reach an average reward of 200 on Cartpole, 1000 on Pendulum, and 200 on
Cheetah at some point.

2.7 Writeup Questions (26 pts)

(a) (3 pts) To compute the REINFORCE estimator, you will need to calculate the values {Gt }Tt=1 (we drop the
trajectory index i for simplicity), where
T
X ′
Gt = γ t −t rt′
t′ =t

Naively, computing all these values takes O(T 2 ) time. Describe how to compute them in O(T ) time.

(b) (3 pts) Consider the cases in the gradient of the clipped PPO loss function equals 0. Express these cases
mathematically and explain why PPO behaves in this manner.

Page 5 of 9
CS 234: Assignment #2

(c) (3 pts) Notice that the method which samples actions from the policy also returns the log-probability with
which the sampled action was taken. Why does REINFORCE not need to cache this information while
PPO does? Suppose this log-probability information had not been collected during the rollout. How
would that affect the implementation (that is, change the code you would write) of the PPO update?

(d) (12 pts) The general form for running your policy gradient implementation is as follows:

python main.py --env-name ENV --seed SEED --METHOD

ENV should be cartpole, pendulum, or cheetah, METHOD should be either baseline, no-baseline,
or ppo, and SEED should be a positive integer.
For the cartpole and pendulum environments, we will consider 3 seeds (seed = 1, 2, 3). For
cheetah, we will only require one seed (seed = 1) since it’s more computationally expensive, but
we strongly encourage you to run multiple seeds if you are able to. Run each of the algorithms we
implemented (PPO, PG with baseline, PG without baseline) across each seed and environment. In
total, you should end up with at least 21 runs.
Plot the results using:

python plot.py --env-name ENV --seeds SEEDS

where SEEDS should be a comma-separated list of seeds which you want to plot (e.g. --seeds
1,2,3). Please include the plots (one for each environment) in your writeup, and comment
on the performance of each method.
We have the following expectations about performance to receive full credit:

• cartpole: Should reach the max reward of 200 (although it may not stay there)
• pendulum: Should reach the max reward of 1000 (although it may not stay there)
• cheetah: Should reach at least 200 (could be as large as 900)

Page 6 of 9
CS 234: Assignment #2

3 Distributions induced by a policy (13 pts)

Suppose we have a single MDP and two policies for that MDP, π and π ′ . Naturally, we are often interested
′
in the performance of policies obtained in the MDP, quantified by V π and V π , respectively. If the reward
function and transition dynamics of the underlying MDP are known to us, we can use standard methods for
policy evaluation. There are many scenarios, however, where the underlying MDP model is not known and
we must try to infer something about the performance of policy π ′ solely based on data obtained through
executing policy π within the environment. In this problem, we will explore a classic result for quantifying
the gap in performance between two policies that only requires access to data sampled from one of the
policies.
Consider an infinite-horizon MDP M = ⟨S, A, R, P, γ⟩ and stochastic policies of the form π : S → ∆(A)1 .
P
Specifically, π(a|s) refers to the probability of taking action a in state s, and a π(a|s) = 1, ∀s. For
simplicity, we’ll assume that this decision process has a single, fixed starting state s0 ∈ S.

(a) (3 pts) (written) Consider a fixed stochastic policy and imagine running several rollouts of this policy within
the environment. Naturally, depending on the stochasticity of the MDP M and the policy itself,
some trajectories are more likely than others. Write down an expression for ρπ (τ ), the probability of
running π in M. To put this distribution in context,
sampling a trajectory τ = (s0 , a0 , s1 , a1 , . . .) from
∞
recall that V π (s0 ) = Eτ ∼ρπ γ t R(st , at ) | s0 .
P
t=0

(b) (1 pt) (written) What is p (st = s), where pπ (st = s) denotes the probability of being in state s at timestep
π

t while following policy π? (Provide an equation)

(c) (5 pts) (written) Just as ρπ captures the distribution over trajectories induced by π, we can also examine the
distribution over states induced by π. In particular, define the discounted, stationary state distribution
of a policy π as
X∞
dπ (s) = (1 − γ) γ t pπ (st = s),
t=0
π
where p (st = s) denotes the probability of being in state s at timestep t while following policy π; your
answer to the previous part should help you reason about how you might compute this value.
The value function of a policy π can be expressed using this distribution dπ (s, a) = dπ (s)π(a | s) over
states and actions, which will shortly be quite useful.
Consider an arbitrary function f : S × A → R. Prove the following identity:
"∞ #
X
t 1
Eτ ∼ρπ γ f (st , at ) = Es∼dπ Ea∼π(s) [f (s, a)] .
t=0
(1 − γ)

Hint: You may find it helpful to first consider how things work out for f (s, a) = 1, ∀(s, a) ∈ S × A.

(d) (5 pts) (written) For any policy π, we define the following function

Aπ (s, a) = Qπ (s, a) − V π (s).

Aπ (s, a) is known as the advantage function and shows up in a lot of policy gradient based RL al-
gorithms, which we shall see later in the class. Intuitively, it is the additional benefit one gets from
first following action a and then following π, instead of always following π. Prove that the following
statement holds for all policies π, π ′ :
′ 1 h h ′ ii
V π (s0 ) − V π (s0 ) = Es∼dπ Ea∼π(s) Aπ (s, a) .
(1 − γ)
1 For a finite set X , ∆(X ) refers to the set of categorical distributions with support on X or, equivalently, the ∆|X |−1

probability simplex.

Page 7 of 9
CS 234: Assignment #2

′
Hint 1: Try adding and subtracting a term that will let you bring Aπ (s, a) into the equation. What
∞ ′
γ t+1 V π (st+1 ) on the LHS?
P
happens on adding and subtracting
t=0
Hint 2: Recall the tower property of expectation which says that E [X] = E [E [X | Y ]].

After proving this result, you might already begin to appreciate why this represents a useful theoretical
contribution. We are often interested in being able to control the gap between two value functions and this
result provides a new mechanism for doing exactly that, when the value functions in question belong to two
particular policies of the MDP.
Additionally, to see how this result is of practical importance as well, suppose the data-generating policy in
the above identity π is some current policy we have in hand and π ′ represents some next policy we would
like to optimize for; concretely, this scenario happens quite often when π is a neural network and π ′ denotes
the same network with updated parameters. As is often the case with function approximation, there are
sources of instability and, sometimes, even small parameter updates can lead to drastic changes in policy
performance, potentially degrading (instead of improving) the performance of the current policy π. These
realities of deep learning motivate a desire to occasionally be conservative in our updates and attempt to
reach a new policy π ′ that provides only a modest improvement over π. Practical approaches can leverage
the above identity to strike the right balance between making progress and maintaining stability.

Page 8 of 9
CS 234: Assignment #2

4 Ethical concerns with Policy Gradients (5 pts)

In this assignment, we focus on policy gradients, an extremely popular and useful model-free technique for
RL. However, policy gradients collect data from the environment with a potentially suboptimal policy during
the learning process. While this is acceptable in simulators like Mujoco or Atari, such exploration in real
world settings such as healthcare and education presents challenges.
Consider a case study of a Stanford CS course considering introducing a RL-based chat bot for office hours.
For each assignment, some students will be given 100% human CA office hours; others 100% chatbot; others
a mix of both. The reward signal is the student grades on each assignment. Since the AI chatbot will learn
through experience, at any given point in the quarter, the help given by the chatbot might be better or
worse than the help given by a randomly selected human CA.
If each time students are randomly assigned to each condition, some students will be assigned more chatbot
hours and others fewer. In addition, some students will be assigned more chatbot hours at the beginning of
the term (when the chatbot has had fewer interactions and may have lower effectiveness) and fewer at the
end, and vice versa. All students will be graded according to the same standards, regardless of which type
of help they have received.
Researchers who experiment on human subjects are morally responsible for ensuring their well being and
protecting them from being harmed by the study. A foundational document in research ethics, the Belmont
Report, identifies three core principles of responsible research:

1. Respect for persons: individuals are capable of making choices about their own lives on the basis of
their personal goals. Research participants should be informed about the study they are considering
undergoing, asked for their consent, and not coerced into giving it. Individuals who are less capable of
giving informed consent, such as young children, should be protected in other ways.

2. Beneficence: the principle of beneficence describes an obligation to ensure the well-being of subjects.
It has been summarized as “do not harm” or “maximize possible benefits and minimize possible harms.”

3. Justice: the principle of justice requires treating all people equally and distributing benefits and harms
to them equitably.

(a) (4 pts) In 4-6 sentences, describe two experimental design or research choices that researchers planning
the above experiment ought to make in order to respect these principles. Justify the importance of
these choices using one of the three ethical principles above and indicating which principle you have
chosen. For example, “Researchers ought to ensure that students advised by the chatbot are able to
revise their assignments after submission with the benefit of human advice if needed. If they did not
take this precaution, the principle of justice would be violated because the risk of harm from poor
advice from the AI chatbot would be distributed unevenly.”

At universities, research experiments that involve human subjects are subject by federal law to Institutional
Review Board (IRB) approval. The purpose of IRB is to protect human subjects of research: to “assure,
both in advance and by periodic review, that appropriate steps are taken to protect the rights and welfare of
humans participating as subjects in the research” (reference). The IRB process was established in response
to abuses of human subjects in the name of medical research performed during WWII (reference). The
IRB is primarily intended to address the responsibilities of the researcher towards the subjects. Familiarize
yourself with Stanford’s IRB Research Compliance process at this link.

(b) (1 pt) If you were conducting the above experiment, what process would you need to follow at Stanford
(who would you email/ where would you upload a research protocol) to get clearance?

Page 9 of 9

Assignment 2 - Policy Gradients
No ratings yet
Assignment 2 - Policy Gradients
7 pages
Lecture 37 - Deep Deterministic Policy Gradient (DDPG)
No ratings yet
Lecture 37 - Deep Deterministic Policy Gradient (DDPG)
17 pages
Policy Gradient Methods
No ratings yet
Policy Gradient Methods
70 pages
20AI903_RL_UNIT 4
No ratings yet
20AI903_RL_UNIT 4
49 pages
Abdolmaleki et al. - 2018 - Maximum a Posteriori Policy Optimisation
No ratings yet
Abdolmaleki et al. - 2018 - Maximum a Posteriori Policy Optimisation
23 pages
5 - Policy Gradient Methods
No ratings yet
5 - Policy Gradient Methods
57 pages
Midterm_Report_Example3
No ratings yet
Midterm_Report_Example3
4 pages
Lecture 30 Reinforcement-Learning
No ratings yet
Lecture 30 Reinforcement-Learning
50 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I
35 pages
Bridging The Gap Between Value and Policy Based Reinforcement Learning
No ratings yet
Bridging The Gap Between Value and Policy Based Reinforcement Learning
21 pages
07 Deep Reinforcement Learning (John)
No ratings yet
07 Deep Reinforcement Learning (John)
52 pages
2023 Week4 Funcapproximate Update
No ratings yet
2023 Week4 Funcapproximate Update
69 pages
Report On Reinforcement Learning
No ratings yet
Report On Reinforcement Learning
26 pages
rl-3
No ratings yet
rl-3
31 pages
RL chap 4
No ratings yet
RL chap 4
7 pages
Full Download Foundations of Deep Reinforcement Learning Theory and Practice in Python First Edition Laura Graesser PDF
100% (5)
Full Download Foundations of Deep Reinforcement Learning Theory and Practice in Python First Edition Laura Graesser PDF
62 pages
Serge Levine Course Introduction To Reinforcement Learning 3: RL Introduction
No ratings yet
Serge Levine Course Introduction To Reinforcement Learning 3: RL Introduction
46 pages
Lec 5 Policy Gradients
No ratings yet
Lec 5 Policy Gradients
40 pages
Tri-Tue-Nhan-Tao - Nathan-Lambert - Lec13 - 6up-Reinforcement-Learning - (Cuuduongthancong - Com)
No ratings yet
Tri-Tue-Nhan-Tao - Nathan-Lambert - Lec13 - 6up-Reinforcement-Learning - (Cuuduongthancong - Com)
8 pages
important questions
No ratings yet
important questions
3 pages
SP14 CS188 Lecture 10 -- Reinforcement Learning I -Print
No ratings yet
SP14 CS188 Lecture 10 -- Reinforcement Learning I -Print
25 pages
Reinforcement Learning (Part 2) : Nguyen Do Van, PHD
No ratings yet
Reinforcement Learning (Part 2) : Nguyen Do Van, PHD
46 pages
cs224r_L04_Actor_Critic
No ratings yet
cs224r_L04_Actor_Critic
89 pages
Deep Reinforcement Learning
No ratings yet
Deep Reinforcement Learning
93 pages
DRL
No ratings yet
DRL
9 pages
2.3+Value+Function+Approximation
No ratings yet
2.3+Value+Function+Approximation
55 pages
Download Full Foundations of Deep Reinforcement Learning Theory and Practice in Python First Edition Laura Graesser PDF All Chapters
100% (4)
Download Full Foundations of Deep Reinforcement Learning Theory and Practice in Python First Edition Laura Graesser PDF All Chapters
62 pages
9 Sqoop Notes
No ratings yet
9 Sqoop Notes
35 pages
10 - Reinforcement Learning
No ratings yet
10 - Reinforcement Learning
24 pages
9. Continuous Control
No ratings yet
9. Continuous Control
28 pages
Policy Gradient Methods-BR
No ratings yet
Policy Gradient Methods-BR
14 pages
A2C Is A Special Case of PPO
No ratings yet
A2C Is A Special Case of PPO
4 pages
Wa 2
No ratings yet
Wa 2
6 pages
CS5500: Reinforcement Learning Assignment 3: Additional Guidelines
No ratings yet
CS5500: Reinforcement Learning Assignment 3: Additional Guidelines
7 pages
Deep Reinforcement Learning Handout v2.0.docx (1)
0% (1)
Deep Reinforcement Learning Handout v2.0.docx (1)
6 pages
Proximal Policy Optimization — Spinning Up Documentation
No ratings yet
Proximal Policy Optimization — Spinning Up Documentation
11 pages
Reinforcement Learning Optimization
No ratings yet
Reinforcement Learning Optimization
6 pages
Deep RL Tutorial Small
No ratings yet
Deep RL Tutorial Small
66 pages
Final MSC Report Divyam Rastogi
No ratings yet
Final MSC Report Divyam Rastogi
78 pages
QPOL Meyer23a
No ratings yet
QPOL Meyer23a
22 pages
2023_week5_policy
No ratings yet
2023_week5_policy
62 pages
Assginment - With Hints
No ratings yet
Assginment - With Hints
2 pages
Stable Baseline3
No ratings yet
Stable Baseline3
11 pages
Conservativeddpg
No ratings yet
Conservativeddpg
13 pages
slidedeck_8_MAS_2021_22_RL_4_Policy_Grad_dQN
No ratings yet
slidedeck_8_MAS_2021_22_RL_4_Policy_Grad_dQN
34 pages
RNN + RL: Shusen Wang
No ratings yet
RNN + RL: Shusen Wang
51 pages
Unit 5 - Policy Based
No ratings yet
Unit 5 - Policy Based
30 pages
13_RL_3
No ratings yet
13_RL_3
48 pages
S18 Reinforcement Learning 2
No ratings yet
S18 Reinforcement Learning 2
46 pages
Batch Reinforcement Learning: Alan Fern
No ratings yet
Batch Reinforcement Learning: Alan Fern
47 pages
Lecture_12_slides_-_after
No ratings yet
Lecture_12_slides_-_after
50 pages
15
No ratings yet
15
17 pages
Reinforcement Learning - A comprehensive Overview
No ratings yet
Reinforcement Learning - A comprehensive Overview
177 pages
Rl Dp and Value and Policy
No ratings yet
Rl Dp and Value and Policy
4 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I PDF
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I PDF
38 pages
RL Concepts and Methods
No ratings yet
RL Concepts and Methods
8 pages
Module 04
No ratings yet
Module 04
63 pages
Lecture Notes v1.0 687 F22
No ratings yet
Lecture Notes v1.0 687 F22
115 pages
Crib Sheet
No ratings yet
Crib Sheet
2 pages
Advanced C Concepts and Programming: First Edition
From Everand
Advanced C Concepts and Programming: First Edition
Gayatri
3/5 (1)
TruePose
No ratings yet
TruePose
6 pages
40-algorithms-every-data-scientist-should-know-jurgen-weichenberger-huw-kwon
No ratings yet
40-algorithms-every-data-scientist-should-know-jurgen-weichenberger-huw-kwon
39 pages
Buy Ebook Deep Reinforcement Learning With Python RLHF For Chatbots and Large Language Models 2nd Edition Nimish Sanghi Cheap Price
100% (21)
Buy Ebook Deep Reinforcement Learning With Python RLHF For Chatbots and Large Language Models 2nd Edition Nimish Sanghi Cheap Price
84 pages
Foundations of Deep Reinforcement Learning Theory and Practice in Python 1st Edition Laura Graesser - Instantly access the full ebook content in just a few seconds
100% (2)
Foundations of Deep Reinforcement Learning Theory and Practice in Python 1st Edition Laura Graesser - Instantly access the full ebook content in just a few seconds
56 pages
The Art of Reinforcement Learning: Fundamentals, Mathematics, and Implementations with Python 1st Edition Michael Hu - Get instant access to the full ebook with detailed content
No ratings yet
The Art of Reinforcement Learning: Fundamentals, Mathematics, and Implementations with Python 1st Edition Michael Hu - Get instant access to the full ebook with detailed content
50 pages
Get Deep Reinforcement Learning with Python RLHF for Chatbots and Large Language Models 2nd Edition Nimish Sanghi free all chapters
100% (2)
Get Deep Reinforcement Learning with Python RLHF for Chatbots and Large Language Models 2nd Edition Nimish Sanghi free all chapters
81 pages
Rocket Landing Control With Random Annealing Jump Start Reinforcement Learning
No ratings yet
Rocket Landing Control With Random Annealing Jump Start Reinforcement Learning
8 pages
Reinforcement Learning From Human Feedback (RLHF)
No ratings yet
Reinforcement Learning From Human Feedback (RLHF)
23 pages
Deep Reinforcement Learning with Python RLHF for Chatbots and Large Language Models 2nd Edition Nimish Sanghi - Read the ebook online or download it as you prefer
100% (1)
Deep Reinforcement Learning with Python RLHF for Chatbots and Large Language Models 2nd Edition Nimish Sanghi - Read the ebook online or download it as you prefer
73 pages
Towards Delivering a Coherent Self-Contained Explanation of Proximal Policy Optimization
No ratings yet
Towards Delivering a Coherent Self-Contained Explanation of Proximal Policy Optimization
36 pages
Learning To Fly in Seconds: Jonas Eschmann, Dario Albani, and Giuseppe Loianno
No ratings yet
Learning To Fly in Seconds: Jonas Eschmann, Dario Albani, and Giuseppe Loianno
8 pages
(Ebook) Foundations of Deep Reinforcement Learning: Theory and Practice in Python (Addison-Wesley Data & Analytics Series) by Laura Graesser, Wah Loon Keng ISBN 9780135172384, 0135172381 - Download the ebook and start exploring right away
100% (2)
(Ebook) Foundations of Deep Reinforcement Learning: Theory and Practice in Python (Addison-Wesley Data & Analytics Series) by Laura Graesser, Wah Loon Keng ISBN 9780135172384, 0135172381 - Download the ebook and start exploring right away
72 pages
A_Futures_Quantitative_Trading_Strategy_Based_on_a_Deep_Reinforcement_Learning_Algorithm
No ratings yet
A_Futures_Quantitative_Trading_Strategy_Based_on_a_Deep_Reinforcement_Learning_Algorithm
5 pages
Ppo Mcts相关概念
No ratings yet
Ppo Mcts相关概念
20 pages
Project Report (3)
No ratings yet
Project Report (3)
11 pages
Course Advance Level of Generative AI
No ratings yet
Course Advance Level of Generative AI
3 pages
7-Pressure control of Once-through steam generator using Proximal policy optimization algorithm
No ratings yet
7-Pressure control of Once-through steam generator using Proximal policy optimization algorithm
11 pages
Decision-Making For Autonomous Vehicles On Highway: Deep Reinforcement Learning With Continuous Action Horizon
No ratings yet
Decision-Making For Autonomous Vehicles On Highway: Deep Reinforcement Learning With Continuous Action Horizon
9 pages
Reinforcement Learning-Based Optimal Scheduling Model of Battery Energy Storage System at The Building Level
No ratings yet
Reinforcement Learning-Based Optimal Scheduling Model of Battery Energy Storage System at The Building Level
16 pages
CS234_A2
No ratings yet
CS234_A2
9 pages
Building Open-Ended Embodied Agent Via Language-Policy Bidirectional Adaptation
No ratings yet
Building Open-Ended Embodied Agent Via Language-Policy Bidirectional Adaptation
27 pages
The Role of Deep Reinforcement Learning in Motion Planning For Robotic Arm Assistive Systems
No ratings yet
The Role of Deep Reinforcement Learning in Motion Planning For Robotic Arm Assistive Systems
7 pages
Reinforcement Learning, Final Project
No ratings yet
Reinforcement Learning, Final Project
43 pages
Towards Reliable Alignment: Uncertainty-Aware RLHF
No ratings yet
Towards Reliable Alignment: Uncertainty-Aware RLHF
25 pages
Download ebooks file (Ebook) Deep Reinforcement Learning with Python: RLHF for Chatbots and Large Language Models, 2nd Edition by Nimish Sanghi ISBN 9798868802720, 8868802724 all chapters
100% (4)
Download ebooks file (Ebook) Deep Reinforcement Learning with Python: RLHF for Chatbots and Large Language Models, 2nd Edition by Nimish Sanghi ISBN 9798868802720, 8868802724 all chapters
81 pages
Resume
No ratings yet
Resume
2 pages
TAHA - Professor Hong's Query Response
No ratings yet
TAHA - Professor Hong's Query Response
26 pages
2 Intelligent Traffic Light Systems Using Edge Flow Predictions
No ratings yet
2 Intelligent Traffic Light Systems Using Edge Flow Predictions
9 pages

CS234_A2

Uploaded by

CS234_A2

Uploaded by

CS 234: Assignment #2

Due date: April 26, 2024 at 6:00 PM (18:00) PST

Please review any additional instructions posted on the assignment page at

1 Deep Q-Networks (DQN) (8 pts writeup)

Algorithm 1 Deep Q-Network (DQN)

• D is the replay memory which stores transitions.

• M is the number of episodes over which the training occurs.

• T is the maximum number of steps in each episode.

• ϵ is the exploration rate, which is typically decayed over time.

• γ is the discount factor, used to weigh future rewards.

• C is the frequency with which to update the target network’s weights.

1.1 Written Questions (8 pts)

2 Policy Gradient Methods (54 pts coding + 26 pts writeup)

∇θ J(θ) = Eπθ [∇θ log πθ (a|s)Qπθ (s, a)]

2.3 Advantage Normalization

Âit = Git − bϕ (sit )

2.4 Proximal Policy Optimization

2.5 Coding Questions (50 pts)

In policy.py, you need to implement:

In policy gradient.py, you need to implement:

In baseline network.py, you need to implement:

In ppo.py, you need to implement:

• PPO on Pendulum should achieve an average reward of 200 by iteration 20.

2.7 Writeup Questions (26 pts)

python main.py --env-name ENV --seed SEED --METHOD

python plot.py --env-name ENV --seeds SEEDS

3 Distributions induced by a policy (13 pts)

t while following policy π? (Provide an equation)

Aπ (s, a) = Qπ (s, a) − V π (s).

4 Ethical concerns with Policy Gradients (5 pts)

You might also like