0% found this document useful (0 votes)
4 views

SS_2020

The document outlines the structure and instructions for an end-term exam on Deep Learning at the Technical University of Munich, scheduled for August 11, 2020. It includes details about the exam format, rules, and types of questions, covering topics such as neural networks, activation functions, batch normalization, and convolutional neural networks. The exam consists of multiple-choice questions and problem-solving tasks, with a total of 90 credits available.

Uploaded by

aleksanderpiciga
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

SS_2020

The document outlines the structure and instructions for an end-term exam on Deep Learning at the Technical University of Munich, scheduled for August 11, 2020. It includes details about the exam format, rules, and types of questions, covering topics such as neural networks, activation functions, batch normalization, and convolutional neural networks. The exam consists of multiple-choice questions and problem-solving tasks, with a total of 90 credits available.

Uploaded by

aleksanderpiciga
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Chair of Visual Computing & Artificial Intelligence

Department of Informatics
Technical University of Munich nm
Note:
Esolution • During the attendance check a sticker containing a unique code will be put on this exam.
Place student sticker here
• This code contains a unique number that associates this exam with your registration number.
• This number is printed both next to the code and to the signature field in the attendance check
list.

Introduction to Deep Learning


Exam: IN2346 / Endterm Date: Tuesday 11th August, 2020
Examiner: Prof. Leal-Taixé and Prof. Nießner Time: 08:00 – 09:30

P1 P2 P3 P4 P5 P6 P7 P8

I
I ~-I I II I I I I I

Left room from to

from to

Early submission at

Notes
Chair of Visual Computing & Artificial Intelligence
Department of Informatics
Technical University of Munich nm
Endterm

Introduction to Deep Learning

Prof. Leal-Taixé and Prof. Nießner


Chair of Visual Computing & Artificial Intelligence
Department of Informatics
Technical University of Munich

Tuesday 11th August, 2020


08:00 – 09:30

Working instructions
• This exam consists of 20 pages with a total of 8 problems.
Please make sure now that you received a complete copy of the exam.

• The total amount of achievable credits in this exam is 90 credits.


• Detaching pages from the exam is prohibited.
• Allowed resources: none

• Do not write with red or green colors nor use pencils.


• Physically turn off all electronic devices, put them into your bag and close the bag.

• If you need additional space for a question, use the additional pages in the back and properly note that
you are using additional space in the question’s solution box.

– Page 1 / 20 –
Problem 1 Multiple Choice Questions: (18 credits)

• For all multiple choice questions any number of answers, i.e. either zero (!), one, all or multiple answers
can be correct.
• For each question, you’ll receive 2 points if all boxes are answered correctly (i.e. correct answers are
checked, wrong answers are not checked) and 0 otherwise.

How to Check a Box:

• Please cross the respective box: (interpreted as checked)

• If you change your mind, please fill the box:


• (interpreted as not checked)

• If you change your mind again, please place a cross to the left side of the box: (interpreted
as checked)

a) Which of the following statements regarding successful ImageNet-classification architectures are correct?

VGG16 uses
ResNet18 million
has 11Skip parameters more than VGG16.
Connections
AlexNet uses filters of different kernel sizes.

InceptionV3 uses filters of different kernel sizes.

VGG16 only uses convolutional layers.

b) You train a neural network and the train loss diverges. What are reasonable things to do? (check all that apply)
Decrease the learning rate.

Add dropout.

Increase the learning rate.


Try a different optimizer.

c) What is the correct order of operations for an optimization with gradient descent?
(a) Update the network weights to minimize the loss.
(b) Calculate the difference between the predicted and target value.
(c) Iteratively repeat the procedure until convergence.
(d) Compute a forward pass.
(e) Initialize the neural network weights.

bcdea

ebadc

eadbc

edbac

d) Consider a simple convolutional neural network with a single convolutional layer. Which of the following
statements is true about this network?
It is rotation invariant.

It is translation equivariant.

All input nodes are connected to all output nodes.

It is scale-invariant.

– Page 2 / 20 –
e) Which of the following activation functions can lead to vanishing gradients?
Tanh.

ReLU.

Sigmoid.

Leaky Relu.

f) Logistic regression (check all that apply).


Is a linear function.
Is a supervised learning algorithm.

Uses a type of cross-entropy loss.

Allows to perform binary classification.

g) A sigmoid layer
cannot be used during backpropagation.

has a learnable parameter.

maps surjectively to values in (-1, 1), i.e., hits all values in that interval.

is continuous and differentiable everywhere.

h) Your training loss does not decrease. What could be wrong?


Learning rate is too high.

Too much regularization.

Dropout probability not high enough.

Bad initialization.

i) Which of the following have trainable parameters? (check all that apply)
Leaky ReLU

Batch normalization

Dropout

Max pooling

– Page 3 / 20 –
Problem 2 Activation Functions and Weight Initialization (8 credits)
For your first job, you have to set up a neural network but you have some issue with its weight initialization. You
remember from your I2DL lecture that you can sample the weights from a zero-centered normal distribution,
but you can’t remember which variance to use. Therefore, you set up a small network and try some numbers.
You initialize the weights one time with Var(w) = 0.02 and one time with Var(w) = 1.0:

Q--~i...:::::::-::{~}-
.." .
½)_.-·W3
Inputs:

• i1 = 2, i2 = −4, i3 = 1
Var(w) = 0.02:

• w1 = 0.05, w2 = 0.025, w3 = −0.03


Var(w) = 1.0:
• w1 = 1.0, w2 = 0.5, w3 = 1.5

0 a) Compute a forward pass for each set of weights and draw the results of the linear layer in the Figure of the
tanh plot. You don’t need to compute the tanh.
1

x
−2 −1 1 2

−1

– Page 4 / 20 –
b) Using the results above, explain what problems can arise during backpropagation of deep neural networks 0
when initializing the weights with too small and too large variance. Also, explain the root of these problems.
1

c) Which initialization scheme did you learn in the lecture that tackles these problems? What does this 0
initialization try to achieve in the activations of deep layers of the neural network?
1

d) After switching from tanh to ReLU activation functions, one of your initial problems occurs again. Why 0
does this happen? How can you modify the initialization scheme proposed in c) to adjust it for this new
non-linearity? 1

– Page 5 / 20 –
Problem 3 Batch Normalization and Computation Graphs (6 credits)
For an input vector x as well as variables γ and β the general formula of batch normalization is given by

x − E[x]
x̂ = √
Var[x]
y = γ x̂ + β .

0 a) Why would one want to apply batch normalization in a neural network?

0 b) Why are γ and β needed in the batch normalization formula?

0 c) How is a batch normalization layer applied at training (1p) and at test (1p) time?

0 d) Computational graph of a batch normalization layer. Fill out the nodes (circles) of the following computa-
√ 1
tional graph. Each node can consist of one of the following operations +, −, ∗, 2 , , ..
1

– Page 6 / 20 –
Problem 4 Convolutional Neural Networks and Receptive Field (12 credits)
A friend of yours asked for a quick review of convolutional neural networks. As he has some background in
computer graphics, you start by explaining previous uses of convolutional layers.

a) You are given a two dimensional input (e.g., a grayscale image). Consider the following convolutional 0
kernels
  1
1 1 1 % &
1 1 −1 2
C1 = · 1 1 1 , C2 = .
9 1 −1
1 1 1

What are the effects of the filter kernels C1 and C2 when applied to the image?

After showing him some results of a trained network, he immediately wants to use them and starts building a
model in Pytorch. However, he is unsure about the layer sizes so you quickly help him out.

b) Given a Convolution Layer in a network with 5 filters, filter size of 7, a stride of 3, and a padding of 1. For 0
an input feature map of 26 × 26 × 26, what is the output dimensionality after applying the Convolution Layer
to the input? 1

c) You are given a convolutional layer with 4 filters, kernel size 5, stride 1, and no padding that operates on 0
an RGB image.
1
1. What is the shape of its weight tensor?
2
2. Name all dimensions of your weight tensor.

Now that he knows how to combine convolutional layers, he wonders how deep his network should be. After
some thinking, you illustrate the concept of receptive field to him by these two examples. For the following
two questions, consider a grayscale 224x224 image as network input.

d) A convolutional neural network consists of 3 consecutive 3 × 3 convolutional layers with stride 1 and no 0
padding. How large is the receptive field of a feature in the last layer of this network?
1

– Page 7 / 20 –
0 e) Consider a network consisting of a single layer.

1 1. What layer choice has a receptive field of 1?

2 2. What layer has a receptive field of the full image input?

Blindly, he stacks 10 convolutional layers together to solve his task. However, the gradients seem to vanish
and he can’t seem to be able to train the network. You remember from your lecture that ResNet blocks were
designed for these purposes.

x ;. H(x)
I
I

0 f) Draw a ResNet block in the image above (1p) containing two linear layers, which you can represent by
l1 and l2 . For simplicity, you don’t need to draw any non-linearities. Why does such a block improve the
1 vanishing gradient problem in deep neural networks (1p)?
2

∂ R(x)
0 g) For your above drawing, given the partial derivative of the residual block R(x) = l2 (l1 (x)) as ∂x = r,
calculate ∂ H(x)
∂x .
1

– Page 8 / 20 –
Problem 5 Training a Neural Network (15 credits)
A team of architects approaches you for your deep learning expertise. They have collected nearly 5,000
hand-labeled RGB images and want to build a model to classify the buildings into their different architectural
styles. Now they want to classify images of architectures into 3 classes depending on their style:

Islamic Baroque Soochow

a) How would you split your dataset and give a meaningful percentage as answer. 0

b) After visually inspecting the different splits in the dataset, you realize that the training set only contains 0
pictures taken during the day, whereas the validation set only has pictures taken at night. Explain what is the
issue and how you would correct it. 1

c) As you train your model, you realize that you do not have enough data. Unfortunately, the architects are 0
unable to collect more data so you have to temper the data. Provide 4 data augmentation techniques that
can be used to overcome the shortage of data. 1

– Page 9 / 20 –
0 What is the saddle point and what is the problem with GD?
1

0 e) While training your classifier you experience that loss only slowly converges and always plateaus
independent of the used learning rate. Now you want to use Stochastic Grading Descent (SGD) instead of
1 Gradient Descent (GD). What is an advantage of SGD compared to GD in dealing with saddle points?

0 f) Explain the concept behind momentum in SGD

0 g) Why would one want to use larger mini-batches in SGD?

0 h) Why do we usually use small mini-batches in practice?

0 i) There exists a whole zoo of different optimizers. Name an optimizer that uses both first and second order
moment
1

0 j) Choosing a reasonable learning rate is not easy.

1 1. Name a problem that will result from using a learning rate that is too high (1p).

2 2. Name a problem that will arise from using a learning rate that is too low (1p)?

– Page 10 / 20 –
k) Finally you plot the loss curves with a suitable learning rate for both training data and validation data. 0
What’s the issue of period 2 called? Name a possible actions that you could do without changing the number
of parameters in your network to counteract this problem. 1

2
Loss

Validation set

Training set
1 2

– Page 11 / 20 –
Problem 6 Recurrent Neural Networks and Backpropagation (9 credits)
Consider a vanilla RNN cell of the form ht = tanh(V · ht −1 + W · xt + b). The figure below shows the input
sequence x1 , x2 , and x3 .

0 a) Given the dimensions xt ∈ R3 and ht ∈ R5 , what is the number of parameters in the RNN cell? (Calculate
final number)
1

0 b) If xt and b are the 0 vector, then ht = ht −1 for any value of ht . Discuss whether this statement is correct.

Now consider the following one-dimensional ReLU-RNN cell without bias b.


ht = ReLU(V · ht −1 + W · xt )

(Hidden state, input, and weights are scalars)

0 c) Calculate h2 and h3 where

1 V = −3, W = 3, h0 = 0, x1 = 2, x2 = 3 and x3 = 1.

– Page 12 / 20 –
∂ h3 ∂ h3 ∂ h3
d) Calculate the derivatives ∂V , ∂W , and ∂ x1 for the forward pass of the ReLU-RNN where 0

3 1
V = −2, W = 1, h0 = 2, x1 = 2, x2 = and x3 = 4.
2
2
for the forward outputs
2 3
h1 = 0, h2 = , h3 = 1.
3
!
!
Use that !
∂ x ReLU(x)! = 0.

x=0

– Page 13 / 20 –
0 e) A Long-Short Term Memory (LSTM) unit is defined as

1 g1 = σ (W1 · xt + U1 · ht −1 ) ,
g2 = σ (W2 · xt + U2 · ht −1 ) ,
2
g3 = σ (W3 · xt + U3 · ht −1 ) ,
c̃t = tanh (Wc · xt + uc · ht −1 ) ,
ct = g2 ◦ ct −1 + g3 ◦ c̃t ,
h t = g 1 ◦ ct ,

where g1 , g2 , and g3 are the gates of the LSTM cell.


1) Assign these gates correctly to the forget f , update u, and output o gates. (1p)
2) What does the value ct represent in a LSTM? (1p)

– Page 14 / 20 –
Problem 7 Autoencoder and Network Transfer (11 credits)
You are given a dataset containing 10,000 RGB images with height H and width W of single coins without
any labels or additional information.

To work with the image dataset you build an autoencoder as depicted in the figure below:
HxWx3

HxWx3
H Encoder
IZ
Decoder H

W W

The input of the encoder is the images of dimension (H × W × 3) which are transformed into a one-
dimensional real vector with z entries. The latent code is used to decode the input image with the same
dimension (H × W × 3). Both encoder and decoder are neural networks and the combined network is
trainable and uses the L2 loss as its optimization function.

a) Is an autoencoder an example of unsupervised learning or supervised learning? 0

b) As the data gets scaled down from the original dimension to a lower-dimensional bottleneck, an autoen- 0
coder can be used for data compression. How does an autoencoder as described above differ from linear
methods to reduce the dimensionality of the data such as PCA (principal component analysis)? 1

c) For an autoencoder we can vary the size of the bottleneck. Discuss briefly what may happen if 0

i) the latent space is too small (1p). 1

ii) the latent space is too big (1pt) 2

– Page 15 / 20 –
0 d) Now, you want to generate a random image of a coin. To do so, can you just randomly sample a vector
from the latent space to generate a new coin image?
1

0 e) Now, someone gives you 1,000 images that are annotated for semantic segmentation of coin and
background as shown in the image above. How would you change the architecture of the discussed
1 autoencoder network to perform semantic segmentation?

0 f) If you wanted to train the new semantic segmentation network what loss function would you use and how?

0 g) How would you leverage your pretrained autoencoder for training a new segmentation network efficiently?

0 h) Why do you expect the pretrained autoencoder variant to generalize more than a randomly initialized
network?
1

– Page 16 / 20 –
Problem 8 Unsorted Short Questions (11 credits)

a) Why do we need activation functions in our neural networks? 0

b) You are solving the binary classification task of classifying images as cars vs. persons. You design a CNN 0
with a single output neuron. Let the output of this neuron be z . The final output of your network, ŷ is given by:
1
ŷ = σ (ReLU(z)) ,
where σ denotes the sigmoid function. You classify all inputs with a final value ŷ ≥ 0.5 as car images. What
problem are you going to encounter?

c) Suggest a method to solve exploding gradients when training fully-connected neural networks. 0

d) 0

Was a badly phrased question. Removed.

e) Why do we often refer to L 2-regularization as “weight decay”? Derive a the mathematical expression that 0
includes the weights W , the learning rate η, and the L 2-regularization hyperparameter λ to explain your
point. 1

– Page 17 / 20 –
0 f) You are given input samples x = (x1 , ... , xn ) for which each component xj is drawn from a distribution with
zero mean. For an input vector x the output s = (s1 , ... , sn ) is given by
1
n
!
2 si = wij · xj ,
j=1
3
where your weights w are inititalized by a uniform random distribution U (−α, α).
4
How do you have to choose α such that the variance of the input data and the output is identical, hence
Var(s) = Var(x)?

Hints: For two statistically independent variables X and Y holds:


" #2 " #2
Var(X · Y ) = E(X ) Var(Y ) + E(Y ) Var(X ) + Var(X )Var(Y )

Furthermore the PDF of an uniform distribution U (a, b) is


$
1
for x ∈ [a, b]
f (x) = b −a
0 otherwise.

The variance of a continuous distribution is calculated as


%
Var(X ) = x 2 f (x) dx − µ2 ,
R

where µ is the expected value of X .

Bonus question: Too complex.

– Page 18 / 20 –
Additional space for solutions–clearly mark the (sub)problem your answers are related to and strike
out invalid solutions.

– Page 19 / 20 –
– Page 20 / 20 –

You might also like