SS_2020
SS_2020
Department of Informatics
Technical University of Munich nm
Note:
Esolution • During the attendance check a sticker containing a unique code will be put on this exam.
Place student sticker here
• This code contains a unique number that associates this exam with your registration number.
• This number is printed both next to the code and to the signature field in the attendance check
list.
P1 P2 P3 P4 P5 P6 P7 P8
I
I ~-I I II I I I I I
from to
Early submission at
Notes
Chair of Visual Computing & Artificial Intelligence
Department of Informatics
Technical University of Munich nm
Endterm
Working instructions
• This exam consists of 20 pages with a total of 8 problems.
Please make sure now that you received a complete copy of the exam.
• If you need additional space for a question, use the additional pages in the back and properly note that
you are using additional space in the question’s solution box.
– Page 1 / 20 –
Problem 1 Multiple Choice Questions: (18 credits)
• For all multiple choice questions any number of answers, i.e. either zero (!), one, all or multiple answers
can be correct.
• For each question, you’ll receive 2 points if all boxes are answered correctly (i.e. correct answers are
checked, wrong answers are not checked) and 0 otherwise.
• If you change your mind again, please place a cross to the left side of the box: (interpreted
as checked)
a) Which of the following statements regarding successful ImageNet-classification architectures are correct?
VGG16 uses
ResNet18 million
has 11Skip parameters more than VGG16.
Connections
AlexNet uses filters of different kernel sizes.
b) You train a neural network and the train loss diverges. What are reasonable things to do? (check all that apply)
Decrease the learning rate.
Add dropout.
c) What is the correct order of operations for an optimization with gradient descent?
(a) Update the network weights to minimize the loss.
(b) Calculate the difference between the predicted and target value.
(c) Iteratively repeat the procedure until convergence.
(d) Compute a forward pass.
(e) Initialize the neural network weights.
bcdea
ebadc
eadbc
edbac
d) Consider a simple convolutional neural network with a single convolutional layer. Which of the following
statements is true about this network?
It is rotation invariant.
It is translation equivariant.
It is scale-invariant.
– Page 2 / 20 –
e) Which of the following activation functions can lead to vanishing gradients?
Tanh.
ReLU.
Sigmoid.
Leaky Relu.
g) A sigmoid layer
cannot be used during backpropagation.
maps surjectively to values in (-1, 1), i.e., hits all values in that interval.
Bad initialization.
i) Which of the following have trainable parameters? (check all that apply)
Leaky ReLU
Batch normalization
Dropout
Max pooling
– Page 3 / 20 –
Problem 2 Activation Functions and Weight Initialization (8 credits)
For your first job, you have to set up a neural network but you have some issue with its weight initialization. You
remember from your I2DL lecture that you can sample the weights from a zero-centered normal distribution,
but you can’t remember which variance to use. Therefore, you set up a small network and try some numbers.
You initialize the weights one time with Var(w) = 0.02 and one time with Var(w) = 1.0:
Q--~i...:::::::-::{~}-
.." .
½)_.-·W3
Inputs:
• i1 = 2, i2 = −4, i3 = 1
Var(w) = 0.02:
0 a) Compute a forward pass for each set of weights and draw the results of the linear layer in the Figure of the
tanh plot. You don’t need to compute the tanh.
1
x
−2 −1 1 2
−1
– Page 4 / 20 –
b) Using the results above, explain what problems can arise during backpropagation of deep neural networks 0
when initializing the weights with too small and too large variance. Also, explain the root of these problems.
1
c) Which initialization scheme did you learn in the lecture that tackles these problems? What does this 0
initialization try to achieve in the activations of deep layers of the neural network?
1
d) After switching from tanh to ReLU activation functions, one of your initial problems occurs again. Why 0
does this happen? How can you modify the initialization scheme proposed in c) to adjust it for this new
non-linearity? 1
– Page 5 / 20 –
Problem 3 Batch Normalization and Computation Graphs (6 credits)
For an input vector x as well as variables γ and β the general formula of batch normalization is given by
x − E[x]
x̂ = √
Var[x]
y = γ x̂ + β .
0 c) How is a batch normalization layer applied at training (1p) and at test (1p) time?
0 d) Computational graph of a batch normalization layer. Fill out the nodes (circles) of the following computa-
√ 1
tional graph. Each node can consist of one of the following operations +, −, ∗, 2 , , ..
1
– Page 6 / 20 –
Problem 4 Convolutional Neural Networks and Receptive Field (12 credits)
A friend of yours asked for a quick review of convolutional neural networks. As he has some background in
computer graphics, you start by explaining previous uses of convolutional layers.
a) You are given a two dimensional input (e.g., a grayscale image). Consider the following convolutional 0
kernels
1
1 1 1 % &
1 1 −1 2
C1 = · 1 1 1 , C2 = .
9 1 −1
1 1 1
What are the effects of the filter kernels C1 and C2 when applied to the image?
After showing him some results of a trained network, he immediately wants to use them and starts building a
model in Pytorch. However, he is unsure about the layer sizes so you quickly help him out.
b) Given a Convolution Layer in a network with 5 filters, filter size of 7, a stride of 3, and a padding of 1. For 0
an input feature map of 26 × 26 × 26, what is the output dimensionality after applying the Convolution Layer
to the input? 1
c) You are given a convolutional layer with 4 filters, kernel size 5, stride 1, and no padding that operates on 0
an RGB image.
1
1. What is the shape of its weight tensor?
2
2. Name all dimensions of your weight tensor.
Now that he knows how to combine convolutional layers, he wonders how deep his network should be. After
some thinking, you illustrate the concept of receptive field to him by these two examples. For the following
two questions, consider a grayscale 224x224 image as network input.
d) A convolutional neural network consists of 3 consecutive 3 × 3 convolutional layers with stride 1 and no 0
padding. How large is the receptive field of a feature in the last layer of this network?
1
– Page 7 / 20 –
0 e) Consider a network consisting of a single layer.
Blindly, he stacks 10 convolutional layers together to solve his task. However, the gradients seem to vanish
and he can’t seem to be able to train the network. You remember from your lecture that ResNet blocks were
designed for these purposes.
x ;. H(x)
I
I
0 f) Draw a ResNet block in the image above (1p) containing two linear layers, which you can represent by
l1 and l2 . For simplicity, you don’t need to draw any non-linearities. Why does such a block improve the
1 vanishing gradient problem in deep neural networks (1p)?
2
∂ R(x)
0 g) For your above drawing, given the partial derivative of the residual block R(x) = l2 (l1 (x)) as ∂x = r,
calculate ∂ H(x)
∂x .
1
– Page 8 / 20 –
Problem 5 Training a Neural Network (15 credits)
A team of architects approaches you for your deep learning expertise. They have collected nearly 5,000
hand-labeled RGB images and want to build a model to classify the buildings into their different architectural
styles. Now they want to classify images of architectures into 3 classes depending on their style:
a) How would you split your dataset and give a meaningful percentage as answer. 0
b) After visually inspecting the different splits in the dataset, you realize that the training set only contains 0
pictures taken during the day, whereas the validation set only has pictures taken at night. Explain what is the
issue and how you would correct it. 1
c) As you train your model, you realize that you do not have enough data. Unfortunately, the architects are 0
unable to collect more data so you have to temper the data. Provide 4 data augmentation techniques that
can be used to overcome the shortage of data. 1
– Page 9 / 20 –
0 What is the saddle point and what is the problem with GD?
1
0 e) While training your classifier you experience that loss only slowly converges and always plateaus
independent of the used learning rate. Now you want to use Stochastic Grading Descent (SGD) instead of
1 Gradient Descent (GD). What is an advantage of SGD compared to GD in dealing with saddle points?
0 i) There exists a whole zoo of different optimizers. Name an optimizer that uses both first and second order
moment
1
1 1. Name a problem that will result from using a learning rate that is too high (1p).
2 2. Name a problem that will arise from using a learning rate that is too low (1p)?
– Page 10 / 20 –
k) Finally you plot the loss curves with a suitable learning rate for both training data and validation data. 0
What’s the issue of period 2 called? Name a possible actions that you could do without changing the number
of parameters in your network to counteract this problem. 1
2
Loss
Validation set
Training set
1 2
– Page 11 / 20 –
Problem 6 Recurrent Neural Networks and Backpropagation (9 credits)
Consider a vanilla RNN cell of the form ht = tanh(V · ht −1 + W · xt + b). The figure below shows the input
sequence x1 , x2 , and x3 .
0 a) Given the dimensions xt ∈ R3 and ht ∈ R5 , what is the number of parameters in the RNN cell? (Calculate
final number)
1
0 b) If xt and b are the 0 vector, then ht = ht −1 for any value of ht . Discuss whether this statement is correct.
1 V = −3, W = 3, h0 = 0, x1 = 2, x2 = 3 and x3 = 1.
– Page 12 / 20 –
∂ h3 ∂ h3 ∂ h3
d) Calculate the derivatives ∂V , ∂W , and ∂ x1 for the forward pass of the ReLU-RNN where 0
3 1
V = −2, W = 1, h0 = 2, x1 = 2, x2 = and x3 = 4.
2
2
for the forward outputs
2 3
h1 = 0, h2 = , h3 = 1.
3
!
!
Use that !
∂ x ReLU(x)! = 0.
∂
x=0
– Page 13 / 20 –
0 e) A Long-Short Term Memory (LSTM) unit is defined as
1 g1 = σ (W1 · xt + U1 · ht −1 ) ,
g2 = σ (W2 · xt + U2 · ht −1 ) ,
2
g3 = σ (W3 · xt + U3 · ht −1 ) ,
c̃t = tanh (Wc · xt + uc · ht −1 ) ,
ct = g2 ◦ ct −1 + g3 ◦ c̃t ,
h t = g 1 ◦ ct ,
– Page 14 / 20 –
Problem 7 Autoencoder and Network Transfer (11 credits)
You are given a dataset containing 10,000 RGB images with height H and width W of single coins without
any labels or additional information.
To work with the image dataset you build an autoencoder as depicted in the figure below:
HxWx3
HxWx3
H Encoder
IZ
Decoder H
W W
The input of the encoder is the images of dimension (H × W × 3) which are transformed into a one-
dimensional real vector with z entries. The latent code is used to decode the input image with the same
dimension (H × W × 3). Both encoder and decoder are neural networks and the combined network is
trainable and uses the L2 loss as its optimization function.
b) As the data gets scaled down from the original dimension to a lower-dimensional bottleneck, an autoen- 0
coder can be used for data compression. How does an autoencoder as described above differ from linear
methods to reduce the dimensionality of the data such as PCA (principal component analysis)? 1
c) For an autoencoder we can vary the size of the bottleneck. Discuss briefly what may happen if 0
– Page 15 / 20 –
0 d) Now, you want to generate a random image of a coin. To do so, can you just randomly sample a vector
from the latent space to generate a new coin image?
1
0 e) Now, someone gives you 1,000 images that are annotated for semantic segmentation of coin and
background as shown in the image above. How would you change the architecture of the discussed
1 autoencoder network to perform semantic segmentation?
0 f) If you wanted to train the new semantic segmentation network what loss function would you use and how?
0 g) How would you leverage your pretrained autoencoder for training a new segmentation network efficiently?
0 h) Why do you expect the pretrained autoencoder variant to generalize more than a randomly initialized
network?
1
– Page 16 / 20 –
Problem 8 Unsorted Short Questions (11 credits)
b) You are solving the binary classification task of classifying images as cars vs. persons. You design a CNN 0
with a single output neuron. Let the output of this neuron be z . The final output of your network, ŷ is given by:
1
ŷ = σ (ReLU(z)) ,
where σ denotes the sigmoid function. You classify all inputs with a final value ŷ ≥ 0.5 as car images. What
problem are you going to encounter?
c) Suggest a method to solve exploding gradients when training fully-connected neural networks. 0
d) 0
e) Why do we often refer to L 2-regularization as “weight decay”? Derive a the mathematical expression that 0
includes the weights W , the learning rate η, and the L 2-regularization hyperparameter λ to explain your
point. 1
– Page 17 / 20 –
0 f) You are given input samples x = (x1 , ... , xn ) for which each component xj is drawn from a distribution with
zero mean. For an input vector x the output s = (s1 , ... , sn ) is given by
1
n
!
2 si = wij · xj ,
j=1
3
where your weights w are inititalized by a uniform random distribution U (−α, α).
4
How do you have to choose α such that the variance of the input data and the output is identical, hence
Var(s) = Var(x)?
– Page 18 / 20 –
Additional space for solutions–clearly mark the (sub)problem your answers are related to and strike
out invalid solutions.
– Page 19 / 20 –
– Page 20 / 20 –