Week 1 Sol Merged
Week 1 Sol Merged
1. The table below shows the temperature and humidity data for two cities. Is the data linearly
separable?
(a) Yes
(b) No
(c) Cannot be determined from the given information
1
(d) NOT
Answer: (c) XOR
Solution: Perceptrons can only implement linearly separable functions. XOR is not linearly
separable, hence cannot be implemented by a single-layer Perceptron.
5. We are given 4 points in R2 say, x1 = (0, 1), x2 = (−1, −1), x3 = (2, 3), x4 = (4, −5).Labels
of x1, x2, x3, x4 are given to be −1, 1, −1, 1 We initiate the perceptron algorithm with an
initial weight w0 = (0, 0) on this data. What will be the value of w0 after the algorithm
converges? (Take points in sequential order from x1 to x)( update happens when the value of
weight changes)
a)(0, 0)
b)(−2, −2)
c)(−2, −3)
d)(1, 1)
Answer: c)
Solution: First misclassified point is x3, hence w0 changes to:
w0 − x3 = (0, 0) − (2, 3) = (−2, −3). All the points are correctly classified after this update
6. We are given the following data:
x1 x2 y3
2 4 1
3 -1 -1
5 6 -1
2 0 1
-1 0 1
-2 -2 1
Can you classify every label correctly by training a perceptron algorithm? (assume bias to be
0 while training)
a)Yes
b)No
Answer: a)No
Solution: By plotting x1 and x2 on graph paper we can observe that 1 and −1 can’t be
separated using a line passing through the origin. Hence perceptron will fail to classify all the
points correctly.
7. Suppose we have a boolean function that takes 5 inputs x1, x2, x3, x4, x5? We have an MP
neuron with parameter θ = 1. For how many inputs will this MP neuron give output y = 1?
a)21
b)31
c)30
d)32
2
Answer: b)
Solution: Total no of possible boolean inputs is 25 = 32. The only input that will give output
y = 0 is (0, 0, 0, 0, 0). Hence required answer is 32 − 1 = 31.
8. Which of the following best represents the meaning of term ”Artificial Intelligence”?
a) The ability of a machine to perform tasks that normally require human intelligence
b) The ability of a machine to perform simple, repetitive tasks
c) The ability of a machine to follow a set of pre-defined rules
d) The ability of a machine to communicate with other machines
Answer: a) The ability of a machine to perform tasks that normally require human
intelligence.
9. Which of the following statements is true about error surfaces in deep learning?
(a) They are always convex functions.
(b) They can have multiple local minima.
(c) They are never continuous.
(d) They are always linear functions.
Answer: (b) They can have multiple local minima Solution: Error surfaces in deep learning
can have multiple local minima due to the non-convex nature of the optimization problem.
10. What is the output of the following MP neuron for the AND Boolean function?
(
1, if x1 + x2 + x3 ≥ 1
y=
0, otherwise
Answer: (a),(c)
Solution: The MP neuron in the question computes the sum of its inputs, and returns 1 if
the sum is greater than or equal to 1, and 0 otherwise.
3
DEEP LEARNING WEEK 2
1
1. What is the range of the sigmoid function σ(x) = 1+e−x ?
(a) (−1, 1)
(b) (0, 1)
(c) (−∞, ∞)
(d) (0, ∞)
Answer: (b)
Solution: The sigmoid function is commonly used to map any input value to a value between
0 and 1, which makes it a popular choice for modeling probability in neural networks.
2. What happens to the output of the sigmoid function as |x| very small?
Answer: (a)
Solution: As |x| becomes very small,(that is when x approaches 0) sigmoid function
approaches 0.5
3. Which of the following theorem states that a neural network with a single hidden layer
containing a finite number of neurons can approximate any continuous function?
Answer: d) Solution: The universal approximation theorem states that a neural network
with a single hidden layer containing a finite number of neurons can approximate any
continuous function on a compact input domain.
4. We have a function that we want to approximate using 150 rectangles (towers). How many
neurons are required to construct the required network?
a)301
b)451
c)150
d)500
Answer: a)
Solution: To approximate one rectangle we need 2 neurons. Hence to create 150 towers we
will need 300 neurons. One extra neuron is required for aggregation
1
5. A neural network has two hidden layers with 5 neurons in each layer, and an output layer
with 3 neurons, and an input layer with 2 neurons. How many weights are there in total?
(Dont assume any bias terms in the network)
Answer: 50
Solution: No of weights are given by 5 ∗ 2 + 5 ∗ 5 + 5 ∗ 3 = 50
6. What is the derivative of the ReLU activation function with respect to its input at 0?
a) 0
b) 1
c) −1
d) Not differentiable
Answer: d) Not differentiable
Solution: The derivative of ReLU is 0 when its input is negative and 1 when its input is
positive. However, at the point where its input is 0, the derivative is undefined, as the
function is not differentiable at that point. Therefore, the correct answer is d) Not
differentiable.
7. Consider a function f (x) = x3 − 3x2 + 2. What is the updated value of x after 3rd iteration
of the gradient descent update, if the learning rate is 0.1 and the initial value of x is 4?
Answer: 1.9
solution: Gradient of the function is 3x(x − 2) The value of x after the first update is x -
0.1(3x(x-2))=4-2.4=1.6 The value of x after the second update is x -
0.1(3x(x-2))=1.6+0.192=1.79 The value of x after the third update is x - 0.1(3x(x-2))= 1.79
+ 0.112=1.90
8. Which of the following statements is true about the representation power of a multilayer
network of sigmoid neurons?
(a) A multilayer network of sigmoid neurons can represent any Boolean function.
(b) A multilayer network of sigmoid neurons can represent any continuous function.
(c) A multilayer network of sigmoid neurons can represent any function.
(d) A multilayer network of sigmoid neurons can represent any linear function.
Answer: (b)
Solution: A multilayer network of sigmoid neurons with a sufficient number of hidden units
can approximate any continuous function arbitrarily well. However, it may not be able to
represent certain functions, such as those that require discontinuities, or those that require a
high degree of precision with a limited number of hidden units.
9. How many boolean functions can be designed for 3 inputs?
a)65,536
b)8
c)256
d)64
2
Answer: c)
3
Solution: No.of boolean functions are given by 22 = 256.
10. How many neurons do you need in the hidden layer of a perceptron to learn any boolean
function with 6 inputs? (Only one hidden layer is allowed)
a)16
b)64
c)16
d)32
Answer: b)
Solution: No of neurons needed to represent all boolean functions of n inputs in the
perceptron is 2n .
3
DEEP LEARNING WEEK 3
1
5. Let p and q be two probability distributions. Under what conditions will the cross entropy
between p and q be minimized?
a) p=q
b) All the values in p are lower than corresponding values in q
c) All the values in p are lower than corresponding values in q
d) p = 0 [0 is a vector]
Answer: a
Solution: Cross entropy is lowest when both distributions are the same.
6. Which of the following is false about cross-entropy loss between two probability distributions?
a) It is always in the range (0,1)
b) It can be negative.
c) It is always positive.
d) It can be 1.
Answer: a,b
Solution: Cross entropy loss can not be negative and its range is (0,∞)
7. The probability of all the events x1 , x2 , x2 ....xn in a system is equal(n > 1). What can you
say about the entropy H(X) of that system?(base of log is 2)
a)H(X) ≤ 1
b)H(X) = 1
c)H(X) ≥ 1
d)We can’t say anything conclusive with the provided information.
Answer: c)
Solution: P
Since all elements are
Pnequal our entropy is of the form
n
H(X) = i=1 −pi .log(pi ) = i=1 −log(1/n) ≥ 1
8. Suppose we have a problem where data x and label y are related by y = x4 + 1. Which of the
following is not a good choice for the activation function in the hidden layer if the activation
function at the output layer is linear?
a)Linear
b)Relu
c)Sigmoid
d)Tan−1 (x)
Answer: a)
Solution: If we chose the first activation function then the output of the neural network will
be a linear function of the data since the network is just doing a combination of weight and
biases at every layer, hence we won’t be able to learn the non-linear relationship.
2
9. We are given that the probability of Event A happening is 0.95 and the probability of Event
B happening is 0.05. Which of the following statements is True? (MSQ)
Answer: c),d)
Solution: Events with high probability have low information content while events with low
probability have high information content.
10. Which of the following activation functions can only give positive outputs greater than 0?
(a) Sigmoid
(b) ReLU
(c) Tanh
(d) Linear
Answer: (a) Solution: The range of the sigmoid is (0,1). It can never output 0 for any input.
3
DEEP LEARNING WEEK 4
1. Which step does Nesterov accelerated gradient descent perform before finding the update
size?
a) Increase the momentum
b) Estimate the next position of the parameters
c) Adjust the learning rate
d) Decrease the step size
Answer: b) Estimate the next position of the parameters
Solution: Nesterov gradient descent estimates the next position of the parameter and
calculates the gradient of parameters at that position. The new position is determined using
this gradient and the gradient at the original step.
2. Select the parameter of vanilla gradient descent controls the step size in the direction of the
gradient.
a) Learning rate
b) Momentum
c) Gamma
d) None of the above
Answer: a) Learning rate
Solution: Learning rate determines the step size in vanilla gradient descent. Momentum is
not used in the normal gradient descent.
3. What does the distance between two contour lines on a contour map represent?
a) The change in the output of the function
b) The direction of the function
c) The rate of change of the function
d) None of the above
Answer: c) The rate of change of the function
Solution: The distance between two contour line determine the rate of change/ steepness of
the function
4. Which of the following represents the contour plot of the function f(x,y) = x2 − y?
−20
4 −10 −1
0
0 0
2
10
10
0
0 0
20
20
−2
0 0
10
10
−10 −1
−4 0
−20
a) −4 −2 0 2 4
1
4
5
2
0
20
20
15
15
10
10
5
0
0
5
−2
15
15
10
10
20
20
5
25
25
−4
5
b) −4 −2 0 2 4
4
8
− 6 4
− − 2 0
−
2
2
4 2 0
0 − −
4
−2 2
2 0
−
6
−4 4
2 8
c) −4 −2 0 2 4
45
45 20
25
4 25
0
40 3
15
35 4
10 20
30
30
5
20
2
15
15
25
25
10
10
5
0
5
25
20
20
15
−2 5
15
10
10
25
−4 15
20 20 25
30
45 40 35 30
25 35 40 45
d) −4 −2 0 2 4
2
4
5
2
0
20
20
15
15
10
10
5
0
0
5
−2
15
15
10
10
20
20
5
25
25
−4
5
Answer: b) −4 −2 0 2 4
5. What is the main advantage of using Adagrad over other optimization algorithms?
a)It converges faster than other optimization algorithms.
b)It is less sensitive to the choice of hyperparameters(learning rate).
c)It is more memory-efficient than other optimization algorithms.
d)It is less likely to get stuck in local optima than other optimization algorithms.
Answer: b) The main advantage of using Adagrad over other optimization algorithms is
that it is less sensitive to the choice of hyperparameters.
Solution: Adagrad automatically adapts the learning rate for each weight based on the
gradient history, which makes it less sensitive to the choice of hyperparameters than other
optimization algorithms. This can be especially useful when dealing with high-dimensional
datasets or complex models where the manual tuning of hyperparameters can be
time-consuming and error-prone.
6. We are training a neural network using the vanilla gradient descent algorithm. We observe
that the change in weights is small in successive iterations. What are the possible causes for
the following phenomenon? (MSQ)
a)η is large
b)∇w is small
c)∇w is large
d)η is small
Solution: (b),(d)
Answer: Small update changes signifies that the quantity η∇w is small. This can happen if
∇w or η is small.
7. You are given labeled data which we call X where rows are data points and columns feature.
One column has most of its values as 0. What algorithm should we use here for faster
convergence and achieve the optimal value of the loss function?
a)NAG
b)Adam
c)Stochastic gradient descent
3
d)Momentum-based gradient descent
Answer: b)
Solution: One of our weight vectors is sparse hence adam would work best here.
Solution: The moving averages used in ADAM are initialized to zero, which can result in
biased estimates of the first and second moments of the gradient. To address this, ADAM
applies bias correction terms to the moving averages, which corrects for the initial bias and
leads to more accurate estimates of the moments.
8. What is the update rule for the ADAM optimizer?
√
a).wt = wt−1 − lr ∗ (mt /( vt + ϵ))
b). wt = wt−1 − lr ∗ m
c). wt = wt−1 − lr ∗ (mt /(vt + ϵ))
d). wt = wt−1 − lr ∗ (vt /(mt + ϵ))
Answer: a)
Solution: The update rule for the ADAM optimizer is w = w - lr * (m / (sqrt(v) + eps)),
where w is the weight, lr is the learning rate, m is the first-moment estimate, v is the
second-moment estimate, and eps is a small constant to prevent division by zero.
9. What is the advantage of using mini-batch gradient descent over batch gradient descent?
a) Mini-batch gradient descent is more computationally efficient than batch gradient descent.
b) Mini-batch gradient descent leads to a more accurate estimate of the gradient than batch
gradient descent.
c) Mini batch gradient descent gives us a better solution.
d) Mini-batch gradient descent can converge faster than batch gradient descent.
Answer: a) and d).
Solution: The advantage of using mini-batch gradient descent over batch gradient descent is
that it is more computationally efficient, allows for parallel processing of the training
examples, and can converge faster than batch gradient descent.
10. Which of the following is a variant of gradient descent that uses an estimate of the next
gradient to update the current position of the parameters?
a) Momentum optimization
b) Stochastic gradient descent
c) Nesterov accelerated gradient descent
d) Adagrad Answer: c) Nesterov accelerated gradient descent
Solution: Nesterov gradient descent estimates the next position of the parameter and
calculates the gradient of parameters at that position. The new position is determined using
this gradient and the gradient at the original step.
4
DEEP LEARNING WEEK 5
1
1
c)
1
0.5
d)
0.5
Answer: c)
x1 + x2 + x3 1 3 1
Solution: Mean of x1, x2, x3 = = =
3 3 3 1
1 Pn
5. The covariance matrix C = (x − x̄)(x − x̄)T is given by: (x̄ is mean of the data points)
n i=1
0.33 −0.33
a)
−0.33 0.33
1 −1
b)
−1 1
0 0
c)
0 0
0.67 −0.67
d)
−0.67 0.67
Answer: d)
1 1 2 −2
Solution: (x′1 x′1 T + x′2 x′2 T + x′3 x′3 T ) =
3 3 −2 2
6. The maximum eigenvalue of the covariance matrix C is:
1
a)
3
4
b)
3
1
c)
6
1
d)
2
Answer: b)
Solution: If v is the eigenvector of A we get
1 2 −2 1 1 1 1
A= Av = λv =⇒ (A − λI)v = 0 =⇒ |A − λI| = 0 =⇒ (λ2 − 4λ) = 0 =⇒ λ(λ − 4) = 0
3 −2 2 3 3 3 3
Hence, λ = 4/3, 0
7. The eigenvector corresponding to the maximum eigenvalue of the given matrix C is:
0.7
a)
0.7
−0.7
b)
0.7
−1
c)
0
2
1
d)
1
Answer: b)
1
Solution: Using the λ value found above we get the equation (A − 4I)v = 0. The unit
3
−0.7
vector in the null space i.e v is the solution to this equation given by
0.7
8. The data points x1, x2, x3 are projected on the eigenvector
calculated
above.
After
−0.7 1 −1
projection what will be the new coordinate of x2? (Hint: =√ )
0.7 2 1
−1
a)
1
0
b)
2
0
c)
0
1
d)
−1
Answer: a)
1 −1 1 −1 −1
Solution: The required projection is given by (xT w)w = ( 0 2 √ )√ =
2 1 2 1 1
9. What is the covariance between height and weight in the given dataset?( Use the formula
n
1X
cov(X, Y ) = (Xi − X̄)(Yi − Ȳ )
n i=1
a) 121.2
b) 89.6
c) 62.6
d) 74
Answer: c)
Solution: The formula for covariance between two variables X and Y with n observations is:
n
1X
cov(X, Y ) = (Xi − X̄)(Yi − Ȳ )
n i=1
3
1
cov(Height, Weight) = [(70 − 69)(155 − 163) + (65 − 69)(130 − 163) + (72 − 69)(180 − 163) + (68 − 69)(160 − 1
5
10. What is the correlation between height and weight in the given dataset
a)0.7
b)1
c)0.96
d)0.59
Answer: c)
4
DEEP LEARNING WEEK 6
1
c) Autoencoders may overfit the training data and generalize poorly to new data.
d) Autoencoders are unable to handle linear relationships between data.
Answer: a), c)
Solution: Autoencoders can be computationally expensive and may require more training
data than PCA. They are also more prone to overfitting the training data if not properly
regularized.
6. What is the primary objective of sparse autoencoders that distinguishes it from vanilla
autoencoder?
a) They learn a low-dimensional representation of the input data
b) They minimize the reconstruction error between the input and the output
c) They capture only the important variations/features in the data
d) They maximize the mutual information between the input and the output
Answer: c)
Solution: Sparse autoencoders are designed to promote sparsity in the hidden layer by
adding a sparsity constraint to the objective function. The goal is to encourage the model to
learn a compact representation of the input data that captures only the most salient features.
7. Which of the following networks represents an autoencoder?
2
Input Hidden Output
layer
x layer 1 layer
ŷ
1 1
x2 (1)
h1 ŷ2
x3 (1)
h2 ŷ3
x4 ŷ4
c)
Input Hidden Output
layer layer 1 layer
ŷ1
x1 (1)
h1
ŷ2
x2 (1)
h2
ŷ3
d)
Answer: c)
Solution: Autoencoder is used for learning the representation of input data. Hence the
output layer’s size should be the same as the input layer’s size to compare the reconstruction
error. ‘
8. If the dimension of the hidden layer representation is more than the dimension of the input
layer, then what kind of autoencoder do we have?
a)Complete autoencoder
b)Under-complete autoencoder
c)Overcomplete autoencoder
d)Sparse autoencoder
Answer:c)
Solution: If the dim(hi ) > dim(xi ) then the given autoencoder is a overcomplete encoder.
9. Suppose for one data point we have features x1 , x2 , x3 , x4 , x5 as −2, 12, 4.2, 7.6, 0 then, which
of the following function should we use on the output layer(decoder)?
a)Logistic
b)Relu
c)Tanh
d)Linear
Answer: d)
Solution: Since our data comes from R and not (−1, 1) or (0, 1), Linear function would
work best.
10. If the dimension of the input layer in an under-complete autoencoder is 6, what is the
possible dimension of the hidden layer?
3
a)6
b)2
c)8
d)0
Answer: b)
Solution: The dimension of the hidden layer is less than the input layer in the
under-complete autoencoder.
4
DEEP LEARNING WEEK 7
1. Which of the following statements is true about the bias-variance tradeoff in deep learning?
A) Increasing the learning rate reduces bias
B) Increasing the learning rate reduces variance
C) None of These
Answer: D
Solution: Learning doesn’t determine the capacity of the model which is the main cause of
bias or variance.
2. Which of the following statements is true about the bias-variance tradeoff in deep learning?
A) Increasing the size of the training dataset reduces bias
B) Increasing the size of the training dataset reduces variance
C) Decreasing the size of the training dataset reduces bias
D) Decreasing the size of the training dataset reduces variance
Answer: B)
Solution: Increasing the size of the training dataset can help reduce variance in deep learning
models by providing more examples for the model to learn from, which can help reduce the
impact of noise in the training data. Decreasing the size of the training dataset can lead to
overfitting, which increases variance. Therefore, increasing the size of the training dataset
can help reduce variance in deep learning models.
3. What is the effect of high bias on a model’s performance?
a. The model will overfit the training data.
b. The model will underfit the training data.
c. The model will be unable to learn anything from the training data.
d. The model’s performance will be unaffected by bias.
Answer: b
Solution: High bias occurs when a model is too simple and is unable to capture the
complexity of the underlying problem. In this case, the model will underfit the training data,
meaning that it will not be able to generalize well to new data.
4. What is the usual relationship between train error and test error?
a) Train error is usually higher than test error
b) Train error is usually lower than test error
c) Train error and test error are usually the same
d) Train error and test error are unrelated
Answer: b)
Solution: In deep learning, the model is trained on a set of data and then tested on a
separate set of data to measure its performance. The training error is calculated using the
same data that was used to train the model, while the test error is calculated using new,
unseen data. Since the model is optimized to fit the training data, it is expected to have a
lower error on the training data than on new, unseen data. Therefore, the training error is
always lower than the test error.
5. What is overfitting in deep learning?(MSQ) a) When the model performs well on the training
data but poorly on new, unseen data
1
b) When the model performs poorly on the training data and on new, unseen data
c) When the model has a high test error and a low train error
d) When the model has a low test error and a high train error
Answer: a),c) When the model performs well on the training data but poorly on new,
unseen data
Solution: Overfitting occurs when the model is too complex and is able to fit the training
data too closely. This results in the model performing well on the training data but poorly
on new, unseen data. This is because the model has essentially memorized the training data
and is not able to generalize to new data.
6. How can overfitting be prevented in deep learning?
a) By increasing the complexity of the model
b) By decreasing the size of the training data
c) By adding more layers to the model
d) By using regularization techniques such as dropout
Answer: d) By using regularization techniques such as dropout
Solution: Regularization techniques such as dropout can be used to prevent overfitting in
deep learning.
7. Which of the following statements is true about L2 regularization?
A. It adds a penalty term to the loss function that is proportional to the absolute value of
the weights.
B. It adds a penalty term to the loss function that is proportional to the square of the
weights.
C. It give us sparse solutions for w.
D. It is equivalent to adding gaussian noise to the weights.
Answer: B,D
8. Which of the following regularization techniques is likely to produce a sparse weight vector?
A. L1 regularization
B. L2 regularization
C. Dropout
D. Data augmentation
Answer: A
Solution: L1 regularization is likely to produce a sparse weight vector because the penalty
term it adds to the loss function encourages some weights to be exactly zero. In contrast, L2
regularization encourages the weight vector to be small overall, but it does not necessarily
lead to sparsity.
9. We trained different models on data and then we used the bagging technique. We observe
that our test error reduces drastically after using bagging. Choose the correct options.
(MSQ)
a) All models had the same hyperparameters and were trained on the same features
b) All the models were correlated.
c) All the models were uncorrelated(independent).
d) All of these.
2
Answer: c)
Solution: If the models were correlated then the covariance of test errors will not be 0 hence
test errors wouldn’t reduce drastically. If all models have the same hyperparameters and
train on the same set of data than they are correlated.
10. Which of the following is an example of how to add Gaussian noise to input data x in Deep
Learning? (MSQ)
a) y = x + ϵ, where ϵ ∼ N (0, σ)
b) y = x ∗ ϵ, where ϵ ∼ N (0, σ)
c) y = x − ϵ, where ϵ ∼ N (0, σ)
d) y = x/ϵ, where ϵ ∼ N (0, σ)
Answer: a) ,c) y = x + ϵ, where ϵ ∼ N (0, σ)
Solution: To add Gaussian noise to input data x in Deep Learning, we can use the
equationy = x + ϵ, or y = x − ϵ where ϵ ∼ N (0, σ)represents a small amount of noise sampled
from a Gaussian distribution with zero mean and a predefined standard deviation σ. The
resulting output y will be a noisy version of the input data x, and this can be used during
training to improve the model’s generalization performance.
3
DEEP LEARNING WEEK 8
1. which of the following best describes the concept of saturation in deep learning?
a) When the activation function output approaches either 0 or 1 and the gradient is close to
zero.
b) When the activation function output is very small and the gradient is close to zero.
c) When the activation function output is very large and the gradient is close to zero.
d) None of the above.
Answer: a,b,c
Solution: Saturation happens when gradient is close to 0
2. Which of the following methods can help to avoid saturation in deep learning?
a) Using a different activation function.
b) Increasing the learning rate.
c) Increasing the model complexity
d) All of the above.
Answer: a)
Solution: Using a different activation function such as ReLU can avoid saturation.
3. Which of the following is true about the role of unsupervised pre-training in deep learning?
a. It is used to replace the need for labeled data
b. It is used to initialize the weights of a deep neural network
c. It is used to fine-tune a pre-trained model
d. It is only useful for small datasets
Answer: b
Solution: Unsupervised pre-training is used to initialize the weights of a deep neural network
before being fine-tuned on labeled data. It is not used to replace the need for labeled data.
This technique is not limited to small datasets and can be used for large datasets as well.
4. Which of the following is an advantage of unsupervised pre-training in deep learning?
a. It helps in reducing overfitting
b. Pre-trained models converge faster
c. It improves the accuracy of the model
d. It requires fewer computational resources
Answer: b,c
Solution: Unsupervised pre-training helps in reducing overfitting in deep neural networks by
initializing the weights in a better way. This technique requires more computational
resources than supervised learning, but it can improve the accuracy of the model.
Additionally, the pre-trained model is shown to converge faster than non-pre-trained models
5. What is the main cause of the Dead ReLU problem in deep learning?
a) High variance
b) High negative bias
c) Overfitting
d) Underfitting
Answer: b) High bias
1
Solution: The Dead ReLU problem arises when the bias term in the neural network is too
high. This causes a large number of neurons to have negative inputs, which in turn leads to a
large number of dead neurons (i.e., neurons that output zero regardless of the input). This
can significantly reduce the expressive power of the network and make it difficult to learn
from the data.
6. How can you tell if your network is suffering from the Dead ReLU problem?
a) The loss function is not decreasing during training
b) The accuracy of the network is not improving
c) A large number of neurons have zero output
d) The network is overfitting to the training data
Answer: c) A large number of neurons have zero output
Solution: The Dead ReLU problem can be detected by checking the output of each neuron
in the network. If a large number of neurons have zero output, then the network may be
suffering from the Dead ReLU problem. This can indicate that the bias term is too high,
causing a large number of dead neurons.
2
10. In Batch Normalization, which parameter is learned during training?
A) Mean
B) Variance
C) γ
D)ϵ
Answer: C) γ
Explanation: In Batch Normalization, the scaling and shifting parameters gamma and beta
are learned during training, while the mean and variance of the inputs are estimated over the
current batch. The small constant epsilon is typically set to a small value, such as 1e-5, to
avoid numerical instability.
3
DEEP LEARNING WEEK 9
1
d) To adjust the weights of the neural network during training
Answer: b) To transform the dot product into a probability distribution
Solution: The softmax function is used in the skip-gram method to transform the dot
product between the target word and the context words into a probability distribution. This
distribution represents the likelihood of seeing each context word given the target word, and
is used to train the model by minimizing the cross-entropy loss between the predicted and
actual distributions.
6. Suppose we are learning the representations of words using Glove representations. If we
observe that the cosine similarity between two representations vi and vj for words ‘i’ and ‘j’
is very high. which of the following statements is true?( parameter bi = 0.02 and bj = 0.05
a)Xij = 0.03.
b)Xij = 0.8.
c)Xij = 0.35.
d)Xij = 0.
Answer: b)
Solution: Since the word representations are similar we know viT vj is high but
viT vj = Xij − bi − bj . Hence Xij is high but the only high value for Xij is 0.8
7. We add incorrect pairs into our corpus to maximize the probability of words that occur in
the same context and minimize the probability of words that occur in different contexts.
This technique is called-
a)Hierarchical softmax
b)Contrastive estimation
c)Negative sampling
d)Glove representations
Answer: c)
Solution: The process of adding incorrect pair to the training set is called negative sampling.
8. What is the computational complexity of computing the softmax function in the output layer
of a neural network?
a) O(n)
b) O(n2 )
c) O(nlogn)
d) O(logn)
Answer: a)
Explanation: The computational complexity of computing the softmax function in the
output layer of a neural network is O(n), where n is the number of output classes.
9. How does Hierarchical Softmax reduce the computational complexity of computing the
softmax function?
a) It replaces the softmax function with a linear function
b) It uses a binary tree to approximate the softmax function
c) It uses a heuristic to compute the softmax function faster
2
d) It does not reduce the computational complexity of computing the softmax function
Answer: b)
Explanation: Hierarchical Softmax uses a binary tree to approximate the softmax function.
This reduces the computational complexity of computing the softmax function from O(n) to
O(log n).
10. What is the disadvantage of using Hierarchical Softmax?
a) It requires more memory to store the binary tree
b) It is slower than computing the softmax function directly
c) It is less accurate than computing the softmax function directly
d) It is more prone to overfitting than computing the softmax function directly
Answer: a)
Explanation: The disadvantage of using Hierarchical Softmax is that it requires more
memory to store the binary tree. This can be a problem when dealing with large datasets or
models with a large number of output classes.
3
DEEP LEARNING WEEK 10
a)AlexNet
b)GoogleNet
c)VGG
d)ResNet
Answer: d)
Solution: ResNet has the highest no of layers among all other architectures
2. Consider a convolution operation with an input image of size 100x100x3 and a filter of size
8x8x3, using a stride of 1 and a padding of 1. What is the output size?
A. 100x100x3
B. 98x98x1
C. 102x102x3
D. 95x95x1
Answer: d)
Solution: Output size = (Input size - Filter size + 2Padding)/Stride + 1 Here, Input size =
100x100x3, Filter size = 7x7x3, Padding = 1, Stride = 1 Output size = (100 - 8 + 2)/1 + 1
= 95 Therefore, the output size is 95x95x1. Hence, the correct answer is option D.
3. Consider a convolution operation with an input image of size 256x256x3 and 40 filters of size
11x11x3, using a stride of 4 and a padding of 2. What is the height of the output size?
A. 63
B. 64
C. 40
D. 3
Answer: C
Solution: The height of the image is equal to the number of filters.
4. Which statement is true about the number of filters in CNNs?
a) More filters lead to better accuracy.
b) Fewer filters lead to better accuracy.
c) The number of filters has no effect on accuracy.
d) The number of filters only affects the computation time.
Answer: a) More filters lead to better accuracy.
Solution: More filters can lead to better accuracy because they allow the network to learn
more complex and diverse features. However, increasing the number of filters also increases
the number of parameters in the network.
5. Which of the following statements is true regarding the occlusion experiment in a CNN?
A. It is used to determine the importance of each feature map in the output of the network.
B. It involves masking a portion of the input image with a patch of zeroes.
C. It is a technique used to prevent overfitting in deep learning models.
D. It is used to increase the number of filters in a convolutional layer.
Answer: A B
1
Solution: In the occlusion experiment, a patch of zeroes is placed over a portion of the
input image to observe the effect on the output of the network. This helps to determine the
importance of each region of the image in the network’s prediction.
6. Which of the following is an innovation introduced in GoogleNet architecture?
a) 1x1 convolutions to reduce the dimension
b) ReLU activation function
c) Dropout regularization
d) use of different-sized filters for the same input
Correct Answer: a),d)
2
10. We have a trained CNN. We have the picture on the left which when fed into the network as
input is given the label ’HUMAN’ with high probability. The picture on the right is the same
image with some added noise. If we feed the right image as input to the CNN then which of
3
DEEP LEARNING WEEK 11
2. Which of the following is a common architecture used for sequence learning in deep learning?
a) Convolutional Neural Networks (CNNs)
b) Autoencoders
c) Recurrent Neural Networks (RNNs)
d) Generative Adversarial Networks (GANs) Answer: c) Recurrent Neural Networks
(RNNs)
Solution: Recurrent Neural Networks (RNNs) are a common architecture used for sequence
learning in deep learning. RNNs are designed to handle sequential data by maintaining a
hidden state that captures the context of the previous inputs in the sequence. This allows
RNNs to model the temporal dependencies between sequential data.
1
In BPTT, what is the role of the error gradient?
a) To update the weights of the connections between the neurons.
b) To propagate information backward through time.
c) To determine the output of the network.
d) To adjust the learning rate of the network. Answer: b) To propagate information
backward through time.
Solution: In BPTT, the error gradient is used to propagate information backward through
time by computing the derivative of the error with respect to each weight in the network.
This allows the network to learn from past inputs and to use that information to make
predictions about future inputs.
5. Arrange the following sequence in the order they are performed by LSTM at time step t.
[Selectively read, Selectively write, Selectively forget]
Answer: c)
Solution: At time step t we first selectively read from the state st−1 , then selectively forget
to create the state st . Then we selectively write to create the state ht from st which will be
used in the t+1 time step.
6. What are the problems in the RNN architecture? (MSQ)
Answer: d)
Solution: Information stored in the network gets morphed at every time step due to new
input. Exploding and vanishing gradient problems are caused by the long dependency chains
in RNN.
7. What is the purpose of the forget gate in an LSTM network?
A) To decide how much of the cell state to keep from the previous time step
B) To decide how much of the current input to add to the cell state
C) To decide how much of the current cell state to output
D) To decide how much of the current input to output
Answer: A) To decide how much of the cell state to keep from the previous time step
Explanation: The forget gate in an LSTM network determines how much of the previous
cell state to forget and how much to keep for the current time step.
8. Which of the following is the formula for calculating the output gate in a GRU network?
A) zt = σ(Wz ∗ [ht−1 , xt ])
B) zt = σ(Wz ∗ ht−1 + Uz ∗ xt )
C) zt = σ(Wz ∗ ht−1 + Uz ∗ xt + bz )
2
D) zt = tanh(Wz ∗ ht−1 + Uz ∗ xt )
Answer: c) zt = σ(Wz ∗ ht−1 + Uz ∗ xt + bz )
Common data for question 1-3
We are given the following RNN. We are also given the architecture for this RNN (doesn’t
include W connecting the states of the network).
9. How many neurons are in the hidden layer at state s2 of the RNN?
a)6
b)2
c)9
d)4
Answer: d)
Solution: There is only one architecture in RNN. The different blocks in the picture
represent the state of the network at different times.
10. We have trained the above given RNN and it has learned weights and biases accordingly. If
the weight of x1 to h1 (1) at s5 is 3, what will be the value of the same weight at s6 ?
a)3
b)6
c)4
d)1
Answer: a)
Solution: Weights for all the states are the same in RNN.
3
DEEP LEARNING WEEK 12
1. We are performing the task of ”Image Question Answering” using the encoder-decoder
model. Choose the equation representing the Decoder model for this task. (MSQ)
a)CNN(xi )
b)RNN(st−1 , e(ŷt−1 ))
c)P (y|q, I) = Sof tmax(V s + b)
d)RNN(xit )
Answer: c)
Solution: In the following task our output is coming from a fixed vocabulary. Hence we just
need to select the word with the highest output probability based on the representations of
the inputs learned by our encoder model.
2. Which of the following is a disadvantage of using an encoder-decoder model for
sequence-to-sequence tasks?
a) The model requires a large amount of training data
b) The model is slow to train and requires a lot of computational resources
c) The generated output sequences may be limited by the capacity of the model
d) The model is prone to overfitting on the training data
Answer: b) The model is slow to train and requires a lot of computational resources
Solution: Encoder-decoder models are powerful but computationally expensive models that
require a lot of training data and computational resources to train. The training process can
be slow and may require the use of specialized hardware such as GPUs. Additionally, the
capacity of the model may limit the quality of the generated output sequences.
3. Which of the following is NOT a component of the attention mechanism?
A. Decoder
B. Key
C. Value
D. Encoder
Answer: A, D
Solution: The attention mechanism consists of three components: query, key, and value.
The query is the current state of the decoder, the key and value are the output and hidden
states of the encoder, respectively. The encoder itself is not part of the attention mechanism.
4. What is the purpose of the softmax function in the attention mechanism?
A. To normalize the attention weights
B. To compute the dot product between the query and key vectors
C. To compute the element-wise product between the query and key vectors
D. To apply a non-linear activation function to the attention weights
Answer: A
Solution: The softmax function is used to normalize the attention weights so that they sum
to 1. This allows the weights to be interpreted as a probability distribution over the input
sequence. The dot product between the query and key vectors is used to compute the raw
attention scores, and the element-wise product is used in some variations of the attention
mechanism.
1
5. Which of the following is a common variant of the attention mechanism?
A. Self-attention
B. Multi-task attention
C. Adversarial attention
D. Transfer learning attention
Answer: A
Solution: Self-attention, also known as intra-attention, is a common variant of the attention
mechanism. It allows the model to attend to different parts of the input sequence while
generating the output sequence. Multi-task attention refers to using attention across multiple
tasks, while adversarial attention and transfer learning attention are not common variants of
the attention mechanism.
6. Which of the following is a major advantage of using an attention mechanism in an
encoder-decoder model?
A. Reduced computational complexity
B. Improved generalization to new data
C. Reduced risk of overfitting
D. None of These
Answer: B
Solution: One advantage of using an attention mechanism in an encoder-decoder model is
improved generalization to new data. The attention mechanism allows the model to
selectively focus on different parts of the input sequence, which can be particularly useful
when the input and output sequences are of different lengths. This can help the model
generalize better to new data.
7. Which of the following is a commonly used attention mechanism in the encoder-decoder
model?
a) Dot product attention
b) Additive attention
c) Multiplicative attention
d) All of the above
Answer: a) Dot product attention
Solution: There are several types of attention mechanisms that can be used in the
encoder-decoder model, including dot product attention, additive attention, and
multiplicative attention. Each of these mechanisms has its own strengths and weaknesses,
and the choice of which one to use will depend on the specific task and dataset.
8. Which of the following output functions is most commonly used in the decoder of an
encoder-decoder model for translation tasks?
a) Sigmoid
b) ReLU
c) Softmax
d) Tanh
Answer: c) Softmax
Solution: The softmax activation function is commonly used in the output layer of the
decoder in an encoder-decoder model. It is used to convert the outputs of the decoder into a
2
probability distribution over the vocabulary of the output sequence. This allows the model to
generate a coherent and meaningful output sequence.
9. In the encoder-decoder model, what is the role of the decoder?
a) To generate output based on the input representations.
b) To encode the input
c) To learn the attention mechanism
d) None of the above
Answer: a) To generate output based on the input
Solution: The decoder in the encoder-decoder model takes the output of the attention
mechanism as input and generates the final output based on the task at hand. This could be
an image caption, a translation, or any other type of output.
10. We are performing a task where we generate the summary for an image using the
encoder-decoder model. Choose the correct statements. (MSQ)
Answer: a)
Solution: We use CNN to learn representations of the image which is fed as state 0 to the
LSTM model.