CS230 Midterm Fall 2022
CS230 Midterm Fall 2022
Multiple Choice 14
Short Answer 30
Feed-Forward Neural Network 15
Backpropagation 19
Discrete Functions in Neural Networks 11
Debugging Code 18
Total 88
• This exam is open book, but collaboration with anyone else, either in person or online,
is strictly forbidden pursuant to The Stanford Honor Code.
• In all cases, and especially if you’re stuck or unsure of your answers, explain your
work, including showing your calculations and derivations! We’ll give partial
credit for good explanations of what you were trying to do.
Name:
SUNETID: @stanford.edu
Signature:
1
CS230
For each of the following questions, circle the letter of your choice. Each question has AT
LEAST one correct option unless explicitly mentioned. No explanation is required.
(a) (2 points) Imagine you are tasked with building a model to diagnose COVID-19 using
chest CT images. You are provided with 100,000 chest CT images, 1,000 of which are
labelled. Which learning technique has the best chance of succeeding on this task?
**(SELECT ONLY ONE)**
(i) Transfer Learning from a ResNet50 that was pre-trained on chest CT images to
detect tumors
(ii) Train a GAN to generate synthetic labeled data and train your model on all the
ground truth and synthetic data
(iii) Supervised Learning directly on the 1,000 labeled images
(iv) Augment the labeled data using random cropping and train using supervised
learning
(b) (2 points) Imagine you are tasked with training a lane detection system that can
detect between two different types of lanes: lanes in the same direction as the car
moves and lanes in the opposite direction. Assume all the images are taken in common
two-way streets in California from a car’s front-view camera. What are the following
data augmentation techniques can be used for your task?
(c) (2 points) You are training a binary classifier and are unsatisfied with the F1-score
as a good metric to combine precision and recall into a single number. You are consid-
ering alternatives to F1-score. Which of the following would be reasonable candidate
metric(s):
(d) (2 points) Dropout can be considered as a form of ensembling over variants of a neural
network. Consider a neural network with N nodes, each of which can be dropped
during training independently with a probability 0 < p < 1. What is the total number
of unique models that can be realized on applying dropout?
2
CS230
(i) ⌊N × p⌋
(ii) (⌊N × p⌋)N
(iii) 2⌊N ×p⌋
(iv) 2N
(e) (2 points) In practice when using Early Stopping, one needs to set a “buffer” hy-
perparameter, which determines the number of epochs model training continues when
no improvement in validation performance is observed before training is terminated.
After training is terminated, the model with the best validation performance is used.
What is the benefit of setting the buffer parameter to a value k = 5 epochs instead of
0:
(f) (2 points) Suppose that you are training a deep neural network and observe that
the training curve contains a lot of oscillations, especially at early stages of training.
Which of the following techniques can help stabilize training?
(g) (2 points) You have a 2-layer MLP with Sigmoid activations in the hidden layers
that you want to train with SGD. Your network weights are initialized from N (10, 1).
From the very first epoch, you observe that some weights in the first layer are not
getting updated or are updated very slowly compared to the second layer. Which of
the following can fix this issue?
The questions in this section can be answered in less than 3 sentences. Please be concise in
your responses.
3
CS230
(a) Imagine that you are building an app to optimize wait times in US emergency rooms
while prioritizing severe cases. You build a deep learning-based app that works as
follows:
You trained and tested your model using 3 months worth of data from hospitals in the
US, before deploying it to several hospitals in the San Francisco Bay Area.
(i) (2 points) Now you want to deploy your app internationally. Do you think your
app will work well? Why or why not?
You noticed that the app tends to rank African American and Hispanic patients
lower than patients from other ethnic backgrounds, even if those patients came
into the emergency department with more severe cases.
(ii) (1 point) Why is this a problem?
(iii) (2 points) What may have caused this problem?
Hint: Think about how the model was trained and the input data that was provided
(iv) (2 points) How can we fix this problem?
(b) Graph Neural Networks (GNNs) are a family of neural networks that can operate on
graph-structured data. Here, we describe a basic 2-layer GNN. Consider a graph with
k nodes labeled {1, 2, . . . , k}. For simplicity, assume that each node i is associated
with a scalar input xi . The first layer of our GNN, parameterized by scalar parameters
[1]
w[1] and b[1] performs the following operation to compute ai at each node i:
∑
ai = ReLU xi + w[1] xn + b[1]
[1]
(1)
n∈N (i)
where N (i) is the set of neighbors of node i in the graph (i.e, all nodes that share an
edge with node i). The second layer, parameterized by scalar parameters w[2] and b[2] ,
[2]
analogously computes ai for each node i:
∑
ai = ReLU ai + w[2] + b[2]
[2] [1]
a[1]
n (2)
n∈N (i)
Answer the following questions for the graph in the figure below, with labels as shown
in the nodes.
[2]
(i) (2 points) What is ∂a1 /∂x6 ?
4
CS230
2 5
1 3 6
(ii) (2 points) You are allowed to add one additional node (suppose this is node 7)
[2]
and accompanying edges such that the value of ∂a1 /∂x6 changes from the value
computed in part (i). Describe how you would do this with fewest number of
edges accompanying node 7.
(c) Consider the graph in figure below representing the training procedure of a GAN.
The figure shows the cost function of the generator plotted against the output of the
discriminator when given a generated image G(z). Concerning the discriminator’s
output, we consider that 0 means that the discriminator thinks the input “has been
generated by G”, whereas 1 means the discriminator thinks the input “comes from the
real data”.
(i) (2 points) After one round of training the generator and discriminator, is the
value of D(G(z)) closer to 0 or closer to 1? Explain.
(ii) (2 points) Two cost functions are presented in Figure 1 above. Which one would
you choose to train your GAN? Justify your answer.
(iii) (2 points) True or false. Your GAN is finished training when D(G(z)) is close
to 1. Please explain your answer for full credit.
(d) We would like to train a self-supervised generative model that can learn encodings z of
a given input image X by reconstructing the same input image as X̂. For our example,
lets say our input images are MNIST digits. Consider the architecture shown below:
5
CS230
Latent space
representation
x q(z | x) z p(x | z) x̂
(i) (3 points) In plain English, intuitively, explain what each loss function is trying
to optimize.
(ii) (3 points) Say we choose the dimension of z to be 2 so we can plot the z’s on a
graph. Consider the three graphs below where each of the two axes is a dimension
of z. The different colours indicate different MNIST digits as indicated by the
legend. The plots are numbered left to right as (1), (2) and (3).
Match each graph to Alice, Bob and Carol (draw lines connecting the two columns
if you printed the midterm) and explain your reasoning for each.
Alice (1)
Bob (2)
Carol (3)
6
CS230
8 4 4
(1) (2) (3)
0 0 0
-8 -4 -4
-8 0 8 -4 0 4 -4 0 4
Figure 3: Plotted graphs for different loss functions. Plots are numbered, left to right as (1),
(2) and (3).
Consider the following neural network with arbitrary dimensions (ie, x is not necessarily
5-dimensional, etc.):
where σ is the sigmoid activation function, and ⊙ is the operator for element-wise products,
and y is a k-dimensional vector of 1’s and 0’s. Note that yi represents the i-th element of
vector y, and likewise for ŷi .
(i) (3 points) What is ∂L/∂ ŷi ? You must write the most reduced form to get full credit.
(ii) (2 points) What is ∂L/∂ ŷ? Refer to this result as ŷ. Please write your answer
according to the shape convention, i.e., your result should be the same shape as ŷ.
(iii) (2 points) What is ∂L/∂z[2] ? Refer to this result as z[2] . To receive full credit, your
answer must include ŷ and your answer must be in the most reduced form.
(iv) (2 points) What is ∂L/∂W[2] ? Please refer to this result as W[2] . Please include z[2]
in your answer.
(v) (2 point) What is ∂L/∂b[2] ? Please refer to this result as b[2] . Please include z[2] in
your answer.
7
CS230
(vi) (2 points) What is ∂L/∂h? Please refer to this result as h. Please include z[2] in your
answer.
(vii) (2 points) What is ∂L/∂z[1] ? Refer to this result as z[1] . Please include h in your
answer.
(viii) (2 point) What is ∂L/∂W[1] ? Please refer to this result as W[1] . Please include z[1]
in your answer.
(ix) (2 point) What is ∂L/∂b[1] ? Please refer to this result as b[1] . Please include z[1] in
your answer.
8
CS230
In this problem, we will explore training neural networks with discrete functions. Consider
a neural network encoder z = softmax[fθ (X)]. You can think of fθ as an MLP for this
example. z is the softmax output and we want to discretize this output into a one-hot
representation before we pass it into the next layer. Consider the operation one_hot where
one_hot(z) returns a one-hot vector where the 1 is at the argmax location. For example,
one_hot([0.1, 0.5, 0.4]) = [0, 1, 0]. Say we want to pass this output to another FC layer gϕ to
get a final output y.
(i) (1 points) Is there a problem with the neural network defined below?
y = g(one_hot(softmax(f (X))))
Here dividing by τ means every element in the vector is divided by τ . Obviously, when
τ = 1, this is exactly the same as the regular softmax function. What happens when
τ → ∞? What happens when τ → 0?
Hint: You don’t need to prove these limits, just showing a trend and justifying is good
enough.
(iii) (4 points) Assume f (X) = w⊤ X where w is a weight vector. What is the derivative
of Sτ (f (X))i with respect to w for a fixed τ ? In other words, what is ∂Sτ (w⊤ X)i /∂w,
the derivative of the i-th element of Sτ (w⊤ X) with respect to w? You must write your
answer in the most reduced form to receive full credit.
(iv) (2 points) How can we use this modified softmax function S to get discrete vectors
in our neural networks? Perhaps we cannot get perfect one-hot vectors but can we get
close?
(v) (2 points) What problems could arise by setting τ to very low values?
9
CS230
Consider the pseudocode below for an MLP model to perform regression. The model takes
an input of dim 10, hidden layer of size 20 with ReLU activations and outputs a real number.
There are biases in both layers.
Weights are initialized from the random normal distribution and biases to 0.
Point out the errors in the code with line numbers and suggest fixes to them.
Your fixes should suggest code changes and not just English descriptions.
Functions/classes that are not implemented completely can be assumed to be
correctly written and have no errors in them.
1 import numpy as np
2
3 def mse_loss ( predictions , targets ):
4 """
5 Returns the Mean Squared Error Loss given the
6 predictions and targets
7
8 Args :
9 predictions (np. ndarray ): Model predictions
10 targets (np. ndarray ): True outputs
11
12 Returns :
13 Mean squared error loss between predictions and targets
14 """
15 return 0.5 * \
16 ( predictions . reshape ( -1) - targets . reshape ( -1))**2
17
18
19 def dropout (x, p =0.1):
20 """
21 Applies dropout on the input x with a drop
22 probability of p
23
24 Args :
25 x ( np. ndarray ): 2D array input
26 p ( float ): dropout probability )
27
28 Returns :
29 Array with values dropped out
30 """
31 ind = np. random . choice (x. shape [1]* x. shape [0] , replace = False ,
32 size =int(x. shape [1]* x. shape [0]* p))
33 x[ np . unravel_index ( indices , x. shape )] = 0
34 return x / p
10
CS230
35
36
37 def get_grads (loss , w1 , b1 , w2 , b2 ):
38 """
39 This function takes the loss and returns the gradients
40 for the weights and biases
41 YOU MAY ASSUME THIS FUNCTION HAS NO ERRORS
42 """
43 ...
44 return dw1 , db1 , dw2 , db2
45
46 def sample_batches (data , batchsize ):
47 """
48 This function samples of batches of size `batchsize `
49 from the training data .
50 YOU MAY ASSUME THIS FUNCTION HAS NO ERRORS
51 """
52 ...
53 return x, y
54
55 class Adam :
56 """
57 The class for the Adam optimizer that
58 accepts the parameters and updates them .
59 YOU MAY ASSUME THIS CLASS AND ITS METHODS HAVE
60 NO ERRORS
61 """
62 def __init__ ( self , w1 , b1 , w2 , b2 ):
63 ...
64
65 def update ( self ):
66 """
67 Updates the params according to the
68 Adam update rule
69 """
70 ...
71
72 class MLP :
73 """
74 MLP Model to perform regression
75 """
76 def __init__ ( self ):
77 super (). __init__ ()
78 self .w1 = np. random . randn (10 , 20)
79 self .b1 = np. zeros (10)
11
CS230
12
CS230
13
CS230
END OF PAPER
14