0% found this document useful (0 votes)
96 views

Week 1 Sol Merged

RL nptel solution
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
96 views

Week 1 Sol Merged

RL nptel solution
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

DEEP LEARNING WEEK 1

1. The table below shows the temperature and humidity data for two cities. Is the data linearly
separable?

City Temperature (°C) Humidity (%)


A 25 50
A 20 60
A 30 40
D -28 45

(a) Yes
(b) No
(c) Cannot be determined from the given information

Answer: (b) Yes


Solution: Yes, the data is linearly separable as there is straight line that can be drawn to
completely separate the data points for each city. Try straight line x=0.
2. What is the perceptron algorithm used for?

(a) Clustering data points


(b) Finding the shortest path in a graph
(c) Classifying data
(d) Solving optimization problems

Answer: (c) Classifying data


Solution: Perceptron can only classify linearly separable data.
3. What is the most common activation function used in perceptrons?
(a) Sigmoid
(b) ReLU
(c) Tanh
(d) Step
Answer: (d) Step
Solution: The step function is commonly used as the activation function in perceptrons as it
outputs either 0 or 1 based on whether the weighted sum of inputs is greater than or less
than a threshold value.
4. Which of the following Boolean functions cannot be implemented by a perceptron?
(a) AND
(b) OR
(c) XOR

1
(d) NOT
Answer: (c) XOR
Solution: Perceptrons can only implement linearly separable functions. XOR is not linearly
separable, hence cannot be implemented by a single-layer Perceptron.
5. We are given 4 points in R2 say, x1 = (0, 1), x2 = (−1, −1), x3 = (2, 3), x4 = (4, −5).Labels
of x1, x2, x3, x4 are given to be −1, 1, −1, 1 We initiate the perceptron algorithm with an
initial weight w0 = (0, 0) on this data. What will be the value of w0 after the algorithm
converges? (Take points in sequential order from x1 to x)( update happens when the value of
weight changes)

a)(0, 0)
b)(−2, −2)
c)(−2, −3)
d)(1, 1)

Answer: c)
Solution: First misclassified point is x3, hence w0 changes to:
w0 − x3 = (0, 0) − (2, 3) = (−2, −3). All the points are correctly classified after this update
6. We are given the following data:

x1 x2 y3
2 4 1
3 -1 -1
5 6 -1
2 0 1
-1 0 1
-2 -2 1

Can you classify every label correctly by training a perceptron algorithm? (assume bias to be
0 while training)

a)Yes
b)No

Answer: a)No
Solution: By plotting x1 and x2 on graph paper we can observe that 1 and −1 can’t be
separated using a line passing through the origin. Hence perceptron will fail to classify all the
points correctly.
7. Suppose we have a boolean function that takes 5 inputs x1, x2, x3, x4, x5? We have an MP
neuron with parameter θ = 1. For how many inputs will this MP neuron give output y = 1?

a)21
b)31
c)30
d)32

2
Answer: b)
Solution: Total no of possible boolean inputs is 25 = 32. The only input that will give output
y = 0 is (0, 0, 0, 0, 0). Hence required answer is 32 − 1 = 31.
8. Which of the following best represents the meaning of term ”Artificial Intelligence”?

a) The ability of a machine to perform tasks that normally require human intelligence
b) The ability of a machine to perform simple, repetitive tasks
c) The ability of a machine to follow a set of pre-defined rules
d) The ability of a machine to communicate with other machines

Answer: a) The ability of a machine to perform tasks that normally require human
intelligence.
9. Which of the following statements is true about error surfaces in deep learning?
(a) They are always convex functions.
(b) They can have multiple local minima.
(c) They are never continuous.
(d) They are always linear functions.
Answer: (b) They can have multiple local minima Solution: Error surfaces in deep learning
can have multiple local minima due to the non-convex nature of the optimization problem.

10. What is the output of the following MP neuron for the AND Boolean function?
(
1, if x1 + x2 + x3 ≥ 1
y=
0, otherwise

(a) y = 1 for (x1 , x2 , x3 ) = (0, 1, 1)


(b) y = 0 for (x1 , x2 , x3 ) = (0, 0, 1)
(c) y = 1 for (x1 , x2 , x3 ) = (1, 1, 1)
(d) y = 0 for (x1 , x2 , x3 ) = (1, 0, 0)

Answer: (a),(c)
Solution: The MP neuron in the question computes the sum of its inputs, and returns 1 if
the sum is greater than or equal to 1, and 0 otherwise.

3
DEEP LEARNING WEEK 2
1
1. What is the range of the sigmoid function σ(x) = 1+e−x ?

(a) (−1, 1)
(b) (0, 1)
(c) (−∞, ∞)
(d) (0, ∞)
Answer: (b)
Solution: The sigmoid function is commonly used to map any input value to a value between
0 and 1, which makes it a popular choice for modeling probability in neural networks.
2. What happens to the output of the sigmoid function as |x| very small?

(a) The output approaches 0.5


(b) The output approaches 1.
(c) The output oscillates between 0 and 1.
(d) The output becomes undefined.

Answer: (a)
Solution: As |x| becomes very small,(that is when x approaches 0) sigmoid function
approaches 0.5
3. Which of the following theorem states that a neural network with a single hidden layer
containing a finite number of neurons can approximate any continuous function?

(a) Bayes’ theorem


(b) Central limit theorem
(c) Fourier’s theorem
(d) Universal approximation theorem

Answer: d) Solution: The universal approximation theorem states that a neural network
with a single hidden layer containing a finite number of neurons can approximate any
continuous function on a compact input domain.
4. We have a function that we want to approximate using 150 rectangles (towers). How many
neurons are required to construct the required network?

a)301
b)451
c)150
d)500

Answer: a)
Solution: To approximate one rectangle we need 2 neurons. Hence to create 150 towers we
will need 300 neurons. One extra neuron is required for aggregation

1
5. A neural network has two hidden layers with 5 neurons in each layer, and an output layer
with 3 neurons, and an input layer with 2 neurons. How many weights are there in total?
(Dont assume any bias terms in the network)
Answer: 50
Solution: No of weights are given by 5 ∗ 2 + 5 ∗ 5 + 5 ∗ 3 = 50
6. What is the derivative of the ReLU activation function with respect to its input at 0?
a) 0
b) 1
c) −1
d) Not differentiable
Answer: d) Not differentiable
Solution: The derivative of ReLU is 0 when its input is negative and 1 when its input is
positive. However, at the point where its input is 0, the derivative is undefined, as the
function is not differentiable at that point. Therefore, the correct answer is d) Not
differentiable.
7. Consider a function f (x) = x3 − 3x2 + 2. What is the updated value of x after 3rd iteration
of the gradient descent update, if the learning rate is 0.1 and the initial value of x is 4?
Answer: 1.9
solution: Gradient of the function is 3x(x − 2) The value of x after the first update is x -
0.1(3x(x-2))=4-2.4=1.6 The value of x after the second update is x -
0.1(3x(x-2))=1.6+0.192=1.79 The value of x after the third update is x - 0.1(3x(x-2))= 1.79
+ 0.112=1.90
8. Which of the following statements is true about the representation power of a multilayer
network of sigmoid neurons?
(a) A multilayer network of sigmoid neurons can represent any Boolean function.
(b) A multilayer network of sigmoid neurons can represent any continuous function.
(c) A multilayer network of sigmoid neurons can represent any function.
(d) A multilayer network of sigmoid neurons can represent any linear function.
Answer: (b)
Solution: A multilayer network of sigmoid neurons with a sufficient number of hidden units
can approximate any continuous function arbitrarily well. However, it may not be able to
represent certain functions, such as those that require discontinuities, or those that require a
high degree of precision with a limited number of hidden units.
9. How many boolean functions can be designed for 3 inputs?

a)65,536
b)8
c)256
d)64

2
Answer: c)
3
Solution: No.of boolean functions are given by 22 = 256.
10. How many neurons do you need in the hidden layer of a perceptron to learn any boolean
function with 6 inputs? (Only one hidden layer is allowed)

a)16
b)64
c)16
d)32

Answer: b)
Solution: No of neurons needed to represent all boolean functions of n inputs in the
perceptron is 2n .

3
DEEP LEARNING WEEK 3

1. Which of the following statements about backpropagation is true?


(a) It is used to optimize the weights in a neural network.
(b) It is used to compute the output of a neural network.
(c) It is used to initialize the weights in a neural network.
(d) It is used to regularize the weights in a neural network.
Answer: (a)
Solution: Backpropagation is a commonly used algorithm for optimizing the weights in a
neural network. It works by computing the gradient of the loss function with respect to each
weight in the network, and then using that gradient to update the weight in a way that
minimizes the loss function.
2. Let y be the true class label and p be the predicted probability of the true class label in a
binary classification problem. Which of the following is the correct formula for binary cross
entropy?
a) ylogp + (1 − y)log(1 − p)
b) −(ylogp + (1 − y)log(1 − p))
c) p
d) ylog(p)
Answer: b
Solution:
3. Let yi be the true class label of the i-th instance and pi be the predicted probability of the
true class label in a multi-class classification problem. Write down the formula for multi-class
cross entropy loss.
PM
a) c=1 yo,c log(po,c )
b) −(ylogp + (1 − y)log(1 − p))
PM
c) − c=1 yo,c log(po,c )
d) ylog(p)
Answer: c
Solution:
4. Can cross-entropy loss be negative between two probability distributions?
a) Yes
b) No
Answer: b) NO
Solution: Since the probabilities are between 0 and 1 the output of log(p) is always negative
which due to the external negative signs becomes positive. Sum of positive numbers is a
positive number.

1
5. Let p and q be two probability distributions. Under what conditions will the cross entropy
between p and q be minimized?
a) p=q
b) All the values in p are lower than corresponding values in q
c) All the values in p are lower than corresponding values in q
d) p = 0 [0 is a vector]
Answer: a
Solution: Cross entropy is lowest when both distributions are the same.

6. Which of the following is false about cross-entropy loss between two probability distributions?
a) It is always in the range (0,1)
b) It can be negative.
c) It is always positive.
d) It can be 1.
Answer: a,b
Solution: Cross entropy loss can not be negative and its range is (0,∞)
7. The probability of all the events x1 , x2 , x2 ....xn in a system is equal(n > 1). What can you
say about the entropy H(X) of that system?(base of log is 2)

a)H(X) ≤ 1
b)H(X) = 1
c)H(X) ≥ 1
d)We can’t say anything conclusive with the provided information.

Answer: c)
Solution: P
Since all elements are
Pnequal our entropy is of the form
n
H(X) = i=1 −pi .log(pi ) = i=1 −log(1/n) ≥ 1
8. Suppose we have a problem where data x and label y are related by y = x4 + 1. Which of the
following is not a good choice for the activation function in the hidden layer if the activation
function at the output layer is linear?

a)Linear
b)Relu
c)Sigmoid
d)Tan−1 (x)

Answer: a)
Solution: If we chose the first activation function then the output of the neural network will
be a linear function of the data since the network is just doing a combination of weight and
biases at every layer, hence we won’t be able to learn the non-linear relationship.

2
9. We are given that the probability of Event A happening is 0.95 and the probability of Event
B happening is 0.05. Which of the following statements is True? (MSQ)

a)Event A has a high information content


b)Event B has a low information content
c)Event A has a low information content
d)Event B has a high information content

Answer: c),d)
Solution: Events with high probability have low information content while events with low
probability have high information content.
10. Which of the following activation functions can only give positive outputs greater than 0?
(a) Sigmoid
(b) ReLU
(c) Tanh
(d) Linear
Answer: (a) Solution: The range of the sigmoid is (0,1). It can never output 0 for any input.

3
DEEP LEARNING WEEK 4

1. Which step does Nesterov accelerated gradient descent perform before finding the update
size?
a) Increase the momentum
b) Estimate the next position of the parameters
c) Adjust the learning rate
d) Decrease the step size
Answer: b) Estimate the next position of the parameters
Solution: Nesterov gradient descent estimates the next position of the parameter and
calculates the gradient of parameters at that position. The new position is determined using
this gradient and the gradient at the original step.
2. Select the parameter of vanilla gradient descent controls the step size in the direction of the
gradient.
a) Learning rate
b) Momentum
c) Gamma
d) None of the above
Answer: a) Learning rate
Solution: Learning rate determines the step size in vanilla gradient descent. Momentum is
not used in the normal gradient descent.
3. What does the distance between two contour lines on a contour map represent?
a) The change in the output of the function
b) The direction of the function
c) The rate of change of the function
d) None of the above
Answer: c) The rate of change of the function
Solution: The distance between two contour line determine the rate of change/ steepness of
the function
4. Which of the following represents the contour plot of the function f(x,y) = x2 − y?
−20
4 −10 −1
0

0 0
2
10

10

0
0 0
20

20

−2
0 0
10

10

−10 −1
−4 0
−20

a) −4 −2 0 2 4

1
4

5
2

0
20

20
15

15
10

10
5

0
0

5
−2
15

15
10

10
20

20
5
25

25
−4
5

b) −4 −2 0 2 4

4
8
− 6 4
− − 2 0

2

2
4 2 0
0 − −

4
−2 2
2 0

6
−4 4

2 8
c) −4 −2 0 2 4
45
45 20
25
4 25
0

40 3

15
35 4

10 20
30

30
5
20

2
15

15
25

25
10

10
5

0
5

25
20
20

15

−2 5
15

10
10
25

−4 15
20 20 25
30
45 40 35 30
25 35 40 45
d) −4 −2 0 2 4

2
4

5
2

0
20

20
15

15
10

10
5

0
0

5
−2
15

15
10

10
20

20
5
25

25
−4

5
Answer: b) −4 −2 0 2 4

5. What is the main advantage of using Adagrad over other optimization algorithms?
a)It converges faster than other optimization algorithms.
b)It is less sensitive to the choice of hyperparameters(learning rate).
c)It is more memory-efficient than other optimization algorithms.
d)It is less likely to get stuck in local optima than other optimization algorithms.
Answer: b) The main advantage of using Adagrad over other optimization algorithms is
that it is less sensitive to the choice of hyperparameters.
Solution: Adagrad automatically adapts the learning rate for each weight based on the
gradient history, which makes it less sensitive to the choice of hyperparameters than other
optimization algorithms. This can be especially useful when dealing with high-dimensional
datasets or complex models where the manual tuning of hyperparameters can be
time-consuming and error-prone.
6. We are training a neural network using the vanilla gradient descent algorithm. We observe
that the change in weights is small in successive iterations. What are the possible causes for
the following phenomenon? (MSQ)

a)η is large
b)∇w is small
c)∇w is large
d)η is small

Solution: (b),(d)
Answer: Small update changes signifies that the quantity η∇w is small. This can happen if
∇w or η is small.
7. You are given labeled data which we call X where rows are data points and columns feature.
One column has most of its values as 0. What algorithm should we use here for faster
convergence and achieve the optimal value of the loss function?

a)NAG
b)Adam
c)Stochastic gradient descent

3
d)Momentum-based gradient descent

Answer: b)
Solution: One of our weight vectors is sparse hence adam would work best here.
Solution: The moving averages used in ADAM are initialized to zero, which can result in
biased estimates of the first and second moments of the gradient. To address this, ADAM
applies bias correction terms to the moving averages, which corrects for the initial bias and
leads to more accurate estimates of the moments.
8. What is the update rule for the ADAM optimizer?

a).wt = wt−1 − lr ∗ (mt /( vt + ϵ))
b). wt = wt−1 − lr ∗ m
c). wt = wt−1 − lr ∗ (mt /(vt + ϵ))
d). wt = wt−1 − lr ∗ (vt /(mt + ϵ))
Answer: a)
Solution: The update rule for the ADAM optimizer is w = w - lr * (m / (sqrt(v) + eps)),
where w is the weight, lr is the learning rate, m is the first-moment estimate, v is the
second-moment estimate, and eps is a small constant to prevent division by zero.
9. What is the advantage of using mini-batch gradient descent over batch gradient descent?
a) Mini-batch gradient descent is more computationally efficient than batch gradient descent.
b) Mini-batch gradient descent leads to a more accurate estimate of the gradient than batch
gradient descent.
c) Mini batch gradient descent gives us a better solution.
d) Mini-batch gradient descent can converge faster than batch gradient descent.
Answer: a) and d).
Solution: The advantage of using mini-batch gradient descent over batch gradient descent is
that it is more computationally efficient, allows for parallel processing of the training
examples, and can converge faster than batch gradient descent.
10. Which of the following is a variant of gradient descent that uses an estimate of the next
gradient to update the current position of the parameters?
a) Momentum optimization
b) Stochastic gradient descent
c) Nesterov accelerated gradient descent
d) Adagrad Answer: c) Nesterov accelerated gradient descent
Solution: Nesterov gradient descent estimates the next position of the parameter and
calculates the gradient of parameters at that position. The new position is determined using
this gradient and the gradient at the original step.

4
DEEP LEARNING WEEK 5

1. Which of the following is a property of eigenvalues of a symmetric matrix?


a) Eigenvalues are always positive
b) Eigenvalues are always real
c) Eigenvalues are always negative
d) Eigenvalues can be complex numbers with imaginary part non zero Answer: b)
Eigenvalues are always real
Solution: Eigenvalues are the scalars that satisfy the equation Av = λv, where A is a square
matrix and v is an eigenvector. These eigenvalues can be complex numbers, but they are
always real for real symmetric matrices, which are commonly used in many applications.
Therefore, option b is correct.
2. What is the determinant of a matrix with eigenvalues λ1 and λ2?
a) λ1 + λ2
b) λ1 - λ2
c) λ1 * λ2
d) λ1 / λ2
Answer: c)λ1*λ2
Solution: The determinant of a matrix is defined as the product of its eigenvalues.
Therefore, if a matrix has eigenvalues λ1 and λ2, its det is given by Det(A) = λ1 *λ2. This
implies that option c is correct.
3. Which of the following is a measure of the amount of variance explained by a principal
component in PCA?
a) Eigenvalue
b) Covariance
c) Correlation
d) Mean absolute deviation
Answer: a)
Solution: The eigenvalue of a principal component is a measure of the amount of variance
explained by that component. The larger the eigenvalue, the more variance is explained by
the component. Therefore, option (a) is correct. The covariance and correlation are measures
of the linear relationship between two variables, so options (b) and (c) are incorrect. The
mean absolute deviation is a measure of the average distance between data points and the
mean, so option (d) is also incorrect.
Questions 4-8 are based on common data.  
1
Consider the following data points x1, x2, x3 to answer following questions: x1 = ,
1
   
0 2
x2 = , x3 =
2 0
4. What is the mean of the given data points x1, x2, x3?
 
3
a)
3
 
0
b)
0

1
 
1
c)
1 
0.5
d)
0.5
Answer: c)    
x1 + x2 + x3 1 3 1
Solution: Mean of x1, x2, x3 = = =
3 3 3 1
1 Pn
5. The covariance matrix C = (x − x̄)(x − x̄)T is given by: (x̄ is mean of the data points)
n i=1
 
0.33 −0.33
a)
−0.33 0.33
1 −1
b)
−1 1
 
0 0
c)
0 0
 
0.67 −0.67
d)
−0.67 0.67
Answer: d)  
1 1 2 −2
Solution: (x′1 x′1 T + x′2 x′2 T + x′3 x′3 T ) =
3 3 −2 2
6. The maximum eigenvalue of the covariance matrix C is:
1
a)
3
4
b)
3
1
c)
6
1
d)
2
Answer: b)
Solution: If v is the eigenvector of A we get
 
1 2 −2 1 1 1 1
A= Av = λv =⇒ (A − λI)v = 0 =⇒ |A − λI| = 0 =⇒ (λ2 − 4λ) = 0 =⇒ λ(λ − 4) = 0
3 −2 2 3 3 3 3

Hence, λ = 4/3, 0

7. The eigenvector corresponding to the maximum eigenvalue of the given matrix C is:
 
0.7
a)
0.7 
−0.7
b)
 0.7

−1
c)
0

2
 
1
d)
1
Answer: b)
1
Solution: Using the λ value found above we get the equation (A − 4I)v = 0. The unit
3  
−0.7
vector in the null space i.e v is the solution to this equation given by
0.7
8. The data points x1, x2, x3 are projected on the eigenvector
 calculated
 above.
  After
−0.7 1 −1
projection what will be the new coordinate of x2? (Hint: =√ )
0.7 2 1
 
−1
a)
1
 
0
b)
2
 
0
c)
0 
1
d)
−1
Answer: a)      
  1 −1 1 −1 −1
Solution: The required projection is given by (xT w)w = ( 0 2 √ )√ =
2 1 2 1 1

Height (in) Weight (lb)


70 155
65 130
72 180
68 160
74 190

9. What is the covariance between height and weight in the given dataset?( Use the formula
n
1X
cov(X, Y ) = (Xi − X̄)(Yi − Ȳ )
n i=1

a) 121.2
b) 89.6
c) 62.6
d) 74
Answer: c)
Solution: The formula for covariance between two variables X and Y with n observations is:

n
1X
cov(X, Y ) = (Xi − X̄)(Yi − Ȳ )
n i=1

Using this formula with the given dataset, we get:

3
1
cov(Height, Weight) = [(70 − 69)(155 − 163) + (65 − 69)(130 − 163) + (72 − 69)(180 − 163) + (68 − 69)(160 − 1
5

10. What is the correlation between height and weight in the given dataset
a)0.7
b)1
c)0.96
d)0.59
Answer: c)

4
DEEP LEARNING WEEK 6

1. What is the main purpose of a hidden layer in an under-complete autoencoder?


a) To increase the number of neurons in the network
b) To reduce the number of neurons in the network
c) To limit the capacity of the network
d) None of These
Answer: c)
Solution: The hidden layer in an under-complete autoencoder is used to limit the network’s
capacity and force it to learn a compressed representation of the input data.
2. Which of the following problems prevents us from using autoencoders for the task of Image
compression?
a) Images are not allowed as input to autoencoders
b) Difficulty in training deep neural networks
c) Loss of image quality due to compression
d) Auto encoders are not capable of producing image output
Answer: c)
Solution: Autoencoders can suffer from loss of image quality when used for compression,
especially if the bottleneck layer is too small or if the network is not trained properly.
3. Which of the following is a potential advantage of using an overcomplete autoencoder?
a) Reduction of the risk of overfitting
b) Ability to learn more complex and nonlinear representations
c) Faster training time
d) To compress the input data
Answer: b)
Solution: Overcomplete autoencoders have more hidden units in the encoder than in the
decoder, which can increase the capacity of the network and allow it to learn more complex
and nonlinear representations of the input data.

4. What is/are the primary advantages of Autoencoders over PCA?


a) Autoencoders are less prone to overfitting than PCA.
b) Autoencoders are faster and more efficient than PCA.
c) Autoencoders require fewer input data than PCA.
d) Autoencoders can capture nonlinear relationships in the input data.
Answer: d)
Solution: Autoencoders can capture nonlinear relationships in the input data, which allows
them to learn more complex representations than PCA. This can be particularly useful in
applications where the input data contains nonlinear relationships that cannot be captured
by a linear method like PCA.

5. Which of the following is a potential disadvantage of using autoencoders for dimensionality


reduction over PCA?
a) Autoencoders are computationally expensive and may require more training data than
PCA.
b) Autoencoders are bad at capturing complex relationships in data

1
c) Autoencoders may overfit the training data and generalize poorly to new data.
d) Autoencoders are unable to handle linear relationships between data.
Answer: a), c)
Solution: Autoencoders can be computationally expensive and may require more training
data than PCA. They are also more prone to overfitting the training data if not properly
regularized.
6. What is the primary objective of sparse autoencoders that distinguishes it from vanilla
autoencoder?
a) They learn a low-dimensional representation of the input data
b) They minimize the reconstruction error between the input and the output
c) They capture only the important variations/features in the data
d) They maximize the mutual information between the input and the output
Answer: c)
Solution: Sparse autoencoders are designed to promote sparsity in the hidden layer by
adding a sparsity constraint to the objective function. The goal is to encourage the model to
learn a compact representation of the input data that captures only the most salient features.
7. Which of the following networks represents an autoencoder?

Input Hidden Output


layer layer
(1) 1 layer
h1
x1
(1)
h2 ŷ1
x2
(1)
h3 ŷ2
x3
(1)
h4
a)
Input Hidden Output
layer layer 1 layer
x1
(1)
h1 ŷ1
x2
(1)
h2 ŷ2
x3
b)

2
Input Hidden Output
layer
x layer 1 layer

1 1

x2 (1)
h1 ŷ2

x3 (1)
h2 ŷ3

x4 ŷ4
c)
Input Hidden Output
layer layer 1 layer
ŷ1
x1 (1)
h1
ŷ2
x2 (1)
h2
ŷ3
d)

Answer: c)
Solution: Autoencoder is used for learning the representation of input data. Hence the
output layer’s size should be the same as the input layer’s size to compare the reconstruction
error. ‘
8. If the dimension of the hidden layer representation is more than the dimension of the input
layer, then what kind of autoencoder do we have?

a)Complete autoencoder
b)Under-complete autoencoder
c)Overcomplete autoencoder
d)Sparse autoencoder
Answer:c)
Solution: If the dim(hi ) > dim(xi ) then the given autoencoder is a overcomplete encoder.
9. Suppose for one data point we have features x1 , x2 , x3 , x4 , x5 as −2, 12, 4.2, 7.6, 0 then, which
of the following function should we use on the output layer(decoder)?

a)Logistic
b)Relu
c)Tanh
d)Linear
Answer: d)
Solution: Since our data comes from R and not (−1, 1) or (0, 1), Linear function would
work best.
10. If the dimension of the input layer in an under-complete autoencoder is 6, what is the
possible dimension of the hidden layer?

3
a)6
b)2
c)8
d)0
Answer: b)
Solution: The dimension of the hidden layer is less than the input layer in the
under-complete autoencoder.

4
DEEP LEARNING WEEK 7

1. Which of the following statements is true about the bias-variance tradeoff in deep learning?
A) Increasing the learning rate reduces bias
B) Increasing the learning rate reduces variance
C) None of These
Answer: D
Solution: Learning doesn’t determine the capacity of the model which is the main cause of
bias or variance.
2. Which of the following statements is true about the bias-variance tradeoff in deep learning?
A) Increasing the size of the training dataset reduces bias
B) Increasing the size of the training dataset reduces variance
C) Decreasing the size of the training dataset reduces bias
D) Decreasing the size of the training dataset reduces variance
Answer: B)
Solution: Increasing the size of the training dataset can help reduce variance in deep learning
models by providing more examples for the model to learn from, which can help reduce the
impact of noise in the training data. Decreasing the size of the training dataset can lead to
overfitting, which increases variance. Therefore, increasing the size of the training dataset
can help reduce variance in deep learning models.
3. What is the effect of high bias on a model’s performance?
a. The model will overfit the training data.
b. The model will underfit the training data.
c. The model will be unable to learn anything from the training data.
d. The model’s performance will be unaffected by bias.
Answer: b
Solution: High bias occurs when a model is too simple and is unable to capture the
complexity of the underlying problem. In this case, the model will underfit the training data,
meaning that it will not be able to generalize well to new data.
4. What is the usual relationship between train error and test error?
a) Train error is usually higher than test error
b) Train error is usually lower than test error
c) Train error and test error are usually the same
d) Train error and test error are unrelated
Answer: b)
Solution: In deep learning, the model is trained on a set of data and then tested on a
separate set of data to measure its performance. The training error is calculated using the
same data that was used to train the model, while the test error is calculated using new,
unseen data. Since the model is optimized to fit the training data, it is expected to have a
lower error on the training data than on new, unseen data. Therefore, the training error is
always lower than the test error.
5. What is overfitting in deep learning?(MSQ) a) When the model performs well on the training
data but poorly on new, unseen data

1
b) When the model performs poorly on the training data and on new, unseen data
c) When the model has a high test error and a low train error
d) When the model has a low test error and a high train error
Answer: a),c) When the model performs well on the training data but poorly on new,
unseen data
Solution: Overfitting occurs when the model is too complex and is able to fit the training
data too closely. This results in the model performing well on the training data but poorly
on new, unseen data. This is because the model has essentially memorized the training data
and is not able to generalize to new data.
6. How can overfitting be prevented in deep learning?
a) By increasing the complexity of the model
b) By decreasing the size of the training data
c) By adding more layers to the model
d) By using regularization techniques such as dropout
Answer: d) By using regularization techniques such as dropout
Solution: Regularization techniques such as dropout can be used to prevent overfitting in
deep learning.
7. Which of the following statements is true about L2 regularization?
A. It adds a penalty term to the loss function that is proportional to the absolute value of
the weights.
B. It adds a penalty term to the loss function that is proportional to the square of the
weights.
C. It give us sparse solutions for w.
D. It is equivalent to adding gaussian noise to the weights.
Answer: B,D
8. Which of the following regularization techniques is likely to produce a sparse weight vector?
A. L1 regularization
B. L2 regularization
C. Dropout
D. Data augmentation
Answer: A
Solution: L1 regularization is likely to produce a sparse weight vector because the penalty
term it adds to the loss function encourages some weights to be exactly zero. In contrast, L2
regularization encourages the weight vector to be small overall, but it does not necessarily
lead to sparsity.
9. We trained different models on data and then we used the bagging technique. We observe
that our test error reduces drastically after using bagging. Choose the correct options.
(MSQ)

a) All models had the same hyperparameters and were trained on the same features
b) All the models were correlated.
c) All the models were uncorrelated(independent).
d) All of these.

2
Answer: c)
Solution: If the models were correlated then the covariance of test errors will not be 0 hence
test errors wouldn’t reduce drastically. If all models have the same hyperparameters and
train on the same set of data than they are correlated.

10. Which of the following is an example of how to add Gaussian noise to input data x in Deep
Learning? (MSQ)
a) y = x + ϵ, where ϵ ∼ N (0, σ)
b) y = x ∗ ϵ, where ϵ ∼ N (0, σ)
c) y = x − ϵ, where ϵ ∼ N (0, σ)
d) y = x/ϵ, where ϵ ∼ N (0, σ)
Answer: a) ,c) y = x + ϵ, where ϵ ∼ N (0, σ)
Solution: To add Gaussian noise to input data x in Deep Learning, we can use the
equationy = x + ϵ, or y = x − ϵ where ϵ ∼ N (0, σ)represents a small amount of noise sampled
from a Gaussian distribution with zero mean and a predefined standard deviation σ. The
resulting output y will be a noisy version of the input data x, and this can be used during
training to improve the model’s generalization performance.

3
DEEP LEARNING WEEK 8

1. which of the following best describes the concept of saturation in deep learning?
a) When the activation function output approaches either 0 or 1 and the gradient is close to
zero.
b) When the activation function output is very small and the gradient is close to zero.
c) When the activation function output is very large and the gradient is close to zero.
d) None of the above.
Answer: a,b,c
Solution: Saturation happens when gradient is close to 0
2. Which of the following methods can help to avoid saturation in deep learning?
a) Using a different activation function.
b) Increasing the learning rate.
c) Increasing the model complexity
d) All of the above.
Answer: a)
Solution: Using a different activation function such as ReLU can avoid saturation.

3. Which of the following is true about the role of unsupervised pre-training in deep learning?
a. It is used to replace the need for labeled data
b. It is used to initialize the weights of a deep neural network
c. It is used to fine-tune a pre-trained model
d. It is only useful for small datasets
Answer: b
Solution: Unsupervised pre-training is used to initialize the weights of a deep neural network
before being fine-tuned on labeled data. It is not used to replace the need for labeled data.
This technique is not limited to small datasets and can be used for large datasets as well.
4. Which of the following is an advantage of unsupervised pre-training in deep learning?
a. It helps in reducing overfitting
b. Pre-trained models converge faster
c. It improves the accuracy of the model
d. It requires fewer computational resources
Answer: b,c
Solution: Unsupervised pre-training helps in reducing overfitting in deep neural networks by
initializing the weights in a better way. This technique requires more computational
resources than supervised learning, but it can improve the accuracy of the model.
Additionally, the pre-trained model is shown to converge faster than non-pre-trained models
5. What is the main cause of the Dead ReLU problem in deep learning?
a) High variance
b) High negative bias
c) Overfitting
d) Underfitting
Answer: b) High bias

1
Solution: The Dead ReLU problem arises when the bias term in the neural network is too
high. This causes a large number of neurons to have negative inputs, which in turn leads to a
large number of dead neurons (i.e., neurons that output zero regardless of the input). This
can significantly reduce the expressive power of the network and make it difficult to learn
from the data.

6. How can you tell if your network is suffering from the Dead ReLU problem?
a) The loss function is not decreasing during training
b) The accuracy of the network is not improving
c) A large number of neurons have zero output
d) The network is overfitting to the training data
Answer: c) A large number of neurons have zero output
Solution: The Dead ReLU problem can be detected by checking the output of each neuron
in the network. If a large number of neurons have zero output, then the network may be
suffering from the Dead ReLU problem. This can indicate that the bias term is too high,
causing a large number of dead neurons.

7. What is the mathematical expression for the ReLU activation function?


a) f(x) = x if x < 0, 0 otherwise
b) f(x) = 0 if x > 0, x otherwise
c) f(x) = max(0,x)
d) f(x) = min(0,x)
Answer: c) f(x) = max(0,x)
Solution: The Dead ReLU activation function is defined as f(x) = max(0,x). This means
that the output of the function is equal to the input x if x is greater than zero, and zero
otherwise.
8. What is the main cause of the symmetry-breaking problem in deep learning?
a) High variance
b) High bias
c) Overfitting
d) Equal initialization of weights
Answer: d) Equal initialization of weights
Solution: The symmetry-breaking problem arises when all the weights in a neural network
are initialized to the same value. This can lead to a situation where all neurons in a layer
compute the same function, making it difficult for the network to learn complex features and
patterns in the data.
9. What is the purpose of Batch Normalization in Deep Learning?
A) To improve the generalization of the model
B) To reduce overfitting in the model
C) To reduce bias in the model
D) To ensure that the distribution of the inputs at different layers doesn’t change.
Answer: To ensure that the distribution of the inputs at different layers doesn’t change.
Explanation: Batch Normalization normalizes the inputs to a layer by re-centering and
re-scaling the activations using the mean and variance of the current batch.

2
10. In Batch Normalization, which parameter is learned during training?
A) Mean
B) Variance
C) γ
D)ϵ
Answer: C) γ
Explanation: In Batch Normalization, the scaling and shifting parameters gamma and beta
are learned during training, while the mean and variance of the inputs are estimated over the
current batch. The small constant epsilon is typically set to a small value, such as 1e-5, to
avoid numerical instability.

3
DEEP LEARNING WEEK 9

1. Which of the following is a disadvantage of one hot encoding?


a) It requires a large amount of memory to store the vectors
b) It can result in a high-dimensional sparse representation
c) It cannot capture the semantic similarity between words
d) All of the above
Answer: d) All of the above
Explanation: One hot encoding has several disadvantages. It requires a large amount of
memory to store the vectors, it can result in a high-dimensional sparse representation, and it
cannot capture the semantic similarity between words. .
2. Which of the following is true about the input representation in the CBOW model?
a. Each word is represented as a one-hot vector
b. Each word is represented as a continuous vector
c. Each word is represented as a sequence of one-hot vectors
d. Each word is represented as a sequence of continuous vectors
Answer: a. Each word is represented as a one-hot vector
Solution: In the CBOW model, each word in the context is represented as a one-hot vector,
which is then multiplied by a weight matrix to obtain a continuous vector representation.
These vector representations are then averaged to obtain a single vector representation of the
context.
3. Which of the following is an advantage of the CBOW model compared to the Skip-gram
model?
a. It is faster to train
b. It requires less memory
c. It performs better on rare words
d. All of the above
Answer: a) It is faster to train
Solution: The CBOW model is faster to train than the Skip-gram model because it involves
predicting a single target word given its context, whereas the Skip-gram model involves
predicting multiple context words given a single target word.
4. Which of the following is an advantage of using the skip-gram method over the bag-of-words
approach?
a) The skip-gram method is faster to train
b) The skip-gram method performs better on rare words
c) The bag-of-words approach is more accurate
d) The bag-of-words approach is better for short texts
Answer: b)
Solution: The skip-gram method performs better on rare words.
5. What is the role of the softmax function in the skip-gram method?
a) To calculate the dot product between the target word and the context words
b) To transform the dot product into a probability distribution
c) To calculate the distance between the target word and the context words

1
d) To adjust the weights of the neural network during training
Answer: b) To transform the dot product into a probability distribution
Solution: The softmax function is used in the skip-gram method to transform the dot
product between the target word and the context words into a probability distribution. This
distribution represents the likelihood of seeing each context word given the target word, and
is used to train the model by minimizing the cross-entropy loss between the predicted and
actual distributions.
6. Suppose we are learning the representations of words using Glove representations. If we
observe that the cosine similarity between two representations vi and vj for words ‘i’ and ‘j’
is very high. which of the following statements is true?( parameter bi = 0.02 and bj = 0.05

a)Xij = 0.03.
b)Xij = 0.8.
c)Xij = 0.35.
d)Xij = 0.

Answer: b)
Solution: Since the word representations are similar we know viT vj is high but
viT vj = Xij − bi − bj . Hence Xij is high but the only high value for Xij is 0.8
7. We add incorrect pairs into our corpus to maximize the probability of words that occur in
the same context and minimize the probability of words that occur in different contexts.
This technique is called-

a)Hierarchical softmax
b)Contrastive estimation
c)Negative sampling
d)Glove representations

Answer: c)
Solution: The process of adding incorrect pair to the training set is called negative sampling.
8. What is the computational complexity of computing the softmax function in the output layer
of a neural network?
a) O(n)
b) O(n2 )
c) O(nlogn)
d) O(logn)
Answer: a)
Explanation: The computational complexity of computing the softmax function in the
output layer of a neural network is O(n), where n is the number of output classes.
9. How does Hierarchical Softmax reduce the computational complexity of computing the
softmax function?
a) It replaces the softmax function with a linear function
b) It uses a binary tree to approximate the softmax function
c) It uses a heuristic to compute the softmax function faster

2
d) It does not reduce the computational complexity of computing the softmax function
Answer: b)
Explanation: Hierarchical Softmax uses a binary tree to approximate the softmax function.
This reduces the computational complexity of computing the softmax function from O(n) to
O(log n).
10. What is the disadvantage of using Hierarchical Softmax?
a) It requires more memory to store the binary tree
b) It is slower than computing the softmax function directly
c) It is less accurate than computing the softmax function directly
d) It is more prone to overfitting than computing the softmax function directly
Answer: a)
Explanation: The disadvantage of using Hierarchical Softmax is that it requires more
memory to store the binary tree. This can be a problem when dealing with large datasets or
models with a large number of output classes.

3
DEEP LEARNING WEEK 10

1. Which of the following architectures has the highest no of layers?

a)AlexNet
b)GoogleNet
c)VGG
d)ResNet

Answer: d)
Solution: ResNet has the highest no of layers among all other architectures
2. Consider a convolution operation with an input image of size 100x100x3 and a filter of size
8x8x3, using a stride of 1 and a padding of 1. What is the output size?
A. 100x100x3
B. 98x98x1
C. 102x102x3
D. 95x95x1
Answer: d)
Solution: Output size = (Input size - Filter size + 2Padding)/Stride + 1 Here, Input size =
100x100x3, Filter size = 7x7x3, Padding = 1, Stride = 1 Output size = (100 - 8 + 2)/1 + 1
= 95 Therefore, the output size is 95x95x1. Hence, the correct answer is option D.
3. Consider a convolution operation with an input image of size 256x256x3 and 40 filters of size
11x11x3, using a stride of 4 and a padding of 2. What is the height of the output size?
A. 63
B. 64
C. 40
D. 3
Answer: C
Solution: The height of the image is equal to the number of filters.
4. Which statement is true about the number of filters in CNNs?
a) More filters lead to better accuracy.
b) Fewer filters lead to better accuracy.
c) The number of filters has no effect on accuracy.
d) The number of filters only affects the computation time.
Answer: a) More filters lead to better accuracy.
Solution: More filters can lead to better accuracy because they allow the network to learn
more complex and diverse features. However, increasing the number of filters also increases
the number of parameters in the network.
5. Which of the following statements is true regarding the occlusion experiment in a CNN?
A. It is used to determine the importance of each feature map in the output of the network.
B. It involves masking a portion of the input image with a patch of zeroes.
C. It is a technique used to prevent overfitting in deep learning models.
D. It is used to increase the number of filters in a convolutional layer.
Answer: A B

1
Solution: In the occlusion experiment, a patch of zeroes is placed over a portion of the
input image to observe the effect on the output of the network. This helps to determine the
importance of each region of the image in the network’s prediction.
6. Which of the following is an innovation introduced in GoogleNet architecture?
a) 1x1 convolutions to reduce the dimension
b) ReLU activation function
c) Dropout regularization
d) use of different-sized filters for the same input
Correct Answer: a),d)

Solution: GoogleNet introduced an inception module that consists of 1x1 convolutions to


reduce the dimension of the input image and then use different-sized filters for the same
reduced input to get different feature maps before concatenating them and sending them to
the further layers.

7. What is the purpose of guided backpropagation in CNNs?


a) To visualize which pixels in an image are most important for a particular class prediction.
b) To train the CNN to improve its accuracy on a given task.
c) To reduce the size of the input images in order to speed up computation.
d) None of the above.
Answer: a)
Explanation: Guided backpropagation is a technique used to visualize the parts of an input
image that are most important for a particular class prediction. It achieves this by
backpropagating the gradients of the output class with respect to the input image, but only
allowing positive gradients to flow through the network.
8. Which layer in a CNN is used for guided backpropagation?
a) Input layer
(b) Convolutional layer
(c) Activation layer
(d) Pooling layer
Answer: (c)
Explanation: Guided backpropagation is typically applied to the activation layers in a
CNN since these layers contain the most relevant information about which parts of the input
image are contributing to the output.
9. Which of the following is a technique used to fool CNNs in Deep Learning?
a) Adversarial examples
b) Transfer learning
c) Dropout
d) Batch normalization
Answer: a) Adversarial examples
Solution: Adversarial examples are images that have been specifically designed to trick a
CNN into misclassifying them. They are created by making small, imperceptible changes to
an image that cause the CNN to output the wrong classification.

2
10. We have a trained CNN. We have the picture on the left which when fed into the network as
input is given the label ’HUMAN’ with high probability. The picture on the right is the same
image with some added noise. If we feed the right image as input to the CNN then which of

the following statements is True?


Left Image Right Image

a)CNN will detect the image as ‘HUMAN’


b)CNN will not detect the image as ‘HUMAN’ since noise is added to the image.
c)CNN will detect the image as ‘HUMAN’ but with a lower probability than the left image.
d)Insufficient information to say anything
Answer: d)
Solution: CNN may detect this image as ‘HUMAN’ or ‘NOT HUMAN’ depending upon the
decision boundary it has learned. We can’t say what will happen since the addition of noise
may push the image out of the decision boundary, or maybe push it more inside which
increases the probability score given by CNN to the image.

3
DEEP LEARNING WEEK 11

1. Which of the following is a limitation of traditional feedforward neural networks in handling


sequential data?(MSQ)
a) They can only process fixed-length input sequences
b) They can handle variable-length input sequences
c) They can’t model temporal dependencies between sequential data
d) They are not affected by the order of input sequences
Answer: a),c),d) They can only process fixed-length input sequences
Solution: Traditional feedforward neural networks are limited in their ability to handle
sequential data because they can only process fixed-length input sequences. In contrast,
recurrent neural networks (RNNs) can handle variable-length input sequences and model the
temporal dependencies between sequential data.

2. Which of the following is a common architecture used for sequence learning in deep learning?
a) Convolutional Neural Networks (CNNs)
b) Autoencoders
c) Recurrent Neural Networks (RNNs)
d) Generative Adversarial Networks (GANs) Answer: c) Recurrent Neural Networks
(RNNs)
Solution: Recurrent Neural Networks (RNNs) are a common architecture used for sequence
learning in deep learning. RNNs are designed to handle sequential data by maintaining a
hidden state that captures the context of the previous inputs in the sequence. This allows
RNNs to model the temporal dependencies between sequential data.

3. What is the vanishing gradient problem in training RNNs?


a) The weights of the network converge to zero during training
b) The gradients used for weight updates become too large
c) The gradients used for weight updates become too small
d) The network becomes overfit to the training data
Answer: c) The gradients used for weight updates become too small
Solution: The vanishing gradient problem is a common issue in training RNNs where the
gradients used for weight updates become too small, making it difficult to learn long-term
dependencies in the input sequence. This can lead to poor performance and slow convergence
during training.

4. Which of the following is the main disadvantage of using BPTT?


a) It is computationally expensive.
b) It is difficult to implement.
c) It requires a large amount of data.
d) It is prone to overfitting.
Answer: a) It is computationally expensive.
Solution: The main disadvantage of using BPTT is that it can be computationally
expensive, especially for long sequences. This is because the network needs to be unrolled for
each timestep in the sequence, which can result in a large number of weights and
calculations. Additionally, the use of gradient descent for weight updates can result in slow
convergence and potentially unstable learning.

1
In BPTT, what is the role of the error gradient?
a) To update the weights of the connections between the neurons.
b) To propagate information backward through time.
c) To determine the output of the network.
d) To adjust the learning rate of the network. Answer: b) To propagate information
backward through time.
Solution: In BPTT, the error gradient is used to propagate information backward through
time by computing the derivative of the error with respect to each weight in the network.
This allows the network to learn from past inputs and to use that information to make
predictions about future inputs.
5. Arrange the following sequence in the order they are performed by LSTM at time step t.
[Selectively read, Selectively write, Selectively forget]

a)Selectively read, Selectively write, Selectively forget


b)Selectively write, Selectively read, Selectively forget
c)Selectively read, Selectively forget, Selectively write
d)Selectively forget, Selectively write, Selectively read

Answer: c)
Solution: At time step t we first selectively read from the state st−1 , then selectively forget
to create the state st . Then we selectively write to create the state ht from st which will be
used in the t+1 time step.
6. What are the problems in the RNN architecture? (MSQ)

a)Morphing of information stored at each time step.


b)Exploding and Vanishing gradient problem.
c)Errors caused at time step tn can’t be related to previous time steps faraway
d)All of the above

Answer: d)
Solution: Information stored in the network gets morphed at every time step due to new
input. Exploding and vanishing gradient problems are caused by the long dependency chains
in RNN.
7. What is the purpose of the forget gate in an LSTM network?
A) To decide how much of the cell state to keep from the previous time step
B) To decide how much of the current input to add to the cell state
C) To decide how much of the current cell state to output
D) To decide how much of the current input to output
Answer: A) To decide how much of the cell state to keep from the previous time step
Explanation: The forget gate in an LSTM network determines how much of the previous
cell state to forget and how much to keep for the current time step.
8. Which of the following is the formula for calculating the output gate in a GRU network?
A) zt = σ(Wz ∗ [ht−1 , xt ])
B) zt = σ(Wz ∗ ht−1 + Uz ∗ xt )
C) zt = σ(Wz ∗ ht−1 + Uz ∗ xt + bz )

2
D) zt = tanh(Wz ∗ ht−1 + Uz ∗ xt )
Answer: c) zt = σ(Wz ∗ ht−1 + Uz ∗ xt + bz )
Common data for question 1-3
We are given the following RNN. We are also given the architecture for this RNN (doesn’t
include W connecting the states of the network).

Input Hidden Output


layer layer
(1) 1 layer
h1
x1 ŷ1
(1)
h2
x2 ŷ2
(1)
h3
x3 ŷ3
(1)
h4

9. How many neurons are in the hidden layer at state s2 of the RNN?
a)6
b)2
c)9
d)4
Answer: d)
Solution: There is only one architecture in RNN. The different blocks in the picture
represent the state of the network at different times.
10. We have trained the above given RNN and it has learned weights and biases accordingly. If
the weight of x1 to h1 (1) at s5 is 3, what will be the value of the same weight at s6 ?
a)3
b)6
c)4
d)1
Answer: a)
Solution: Weights for all the states are the same in RNN.

3
DEEP LEARNING WEEK 12

1. We are performing the task of ”Image Question Answering” using the encoder-decoder
model. Choose the equation representing the Decoder model for this task. (MSQ)

a)CNN(xi )
b)RNN(st−1 , e(ŷt−1 ))
c)P (y|q, I) = Sof tmax(V s + b)
d)RNN(xit )

Answer: c)
Solution: In the following task our output is coming from a fixed vocabulary. Hence we just
need to select the word with the highest output probability based on the representations of
the inputs learned by our encoder model.
2. Which of the following is a disadvantage of using an encoder-decoder model for
sequence-to-sequence tasks?
a) The model requires a large amount of training data
b) The model is slow to train and requires a lot of computational resources
c) The generated output sequences may be limited by the capacity of the model
d) The model is prone to overfitting on the training data
Answer: b) The model is slow to train and requires a lot of computational resources
Solution: Encoder-decoder models are powerful but computationally expensive models that
require a lot of training data and computational resources to train. The training process can
be slow and may require the use of specialized hardware such as GPUs. Additionally, the
capacity of the model may limit the quality of the generated output sequences.
3. Which of the following is NOT a component of the attention mechanism?
A. Decoder
B. Key
C. Value
D. Encoder
Answer: A, D
Solution: The attention mechanism consists of three components: query, key, and value.
The query is the current state of the decoder, the key and value are the output and hidden
states of the encoder, respectively. The encoder itself is not part of the attention mechanism.
4. What is the purpose of the softmax function in the attention mechanism?
A. To normalize the attention weights
B. To compute the dot product between the query and key vectors
C. To compute the element-wise product between the query and key vectors
D. To apply a non-linear activation function to the attention weights
Answer: A
Solution: The softmax function is used to normalize the attention weights so that they sum
to 1. This allows the weights to be interpreted as a probability distribution over the input
sequence. The dot product between the query and key vectors is used to compute the raw
attention scores, and the element-wise product is used in some variations of the attention
mechanism.

1
5. Which of the following is a common variant of the attention mechanism?
A. Self-attention
B. Multi-task attention
C. Adversarial attention
D. Transfer learning attention
Answer: A
Solution: Self-attention, also known as intra-attention, is a common variant of the attention
mechanism. It allows the model to attend to different parts of the input sequence while
generating the output sequence. Multi-task attention refers to using attention across multiple
tasks, while adversarial attention and transfer learning attention are not common variants of
the attention mechanism.
6. Which of the following is a major advantage of using an attention mechanism in an
encoder-decoder model?
A. Reduced computational complexity
B. Improved generalization to new data
C. Reduced risk of overfitting
D. None of These
Answer: B
Solution: One advantage of using an attention mechanism in an encoder-decoder model is
improved generalization to new data. The attention mechanism allows the model to
selectively focus on different parts of the input sequence, which can be particularly useful
when the input and output sequences are of different lengths. This can help the model
generalize better to new data.
7. Which of the following is a commonly used attention mechanism in the encoder-decoder
model?
a) Dot product attention
b) Additive attention
c) Multiplicative attention
d) All of the above
Answer: a) Dot product attention
Solution: There are several types of attention mechanisms that can be used in the
encoder-decoder model, including dot product attention, additive attention, and
multiplicative attention. Each of these mechanisms has its own strengths and weaknesses,
and the choice of which one to use will depend on the specific task and dataset.
8. Which of the following output functions is most commonly used in the decoder of an
encoder-decoder model for translation tasks?
a) Sigmoid
b) ReLU
c) Softmax
d) Tanh
Answer: c) Softmax
Solution: The softmax activation function is commonly used in the output layer of the
decoder in an encoder-decoder model. It is used to convert the outputs of the decoder into a

2
probability distribution over the vocabulary of the output sequence. This allows the model to
generate a coherent and meaningful output sequence.
9. In the encoder-decoder model, what is the role of the decoder?
a) To generate output based on the input representations.
b) To encode the input
c) To learn the attention mechanism
d) None of the above
Answer: a) To generate output based on the input
Solution: The decoder in the encoder-decoder model takes the output of the attention
mechanism as input and generates the final output based on the task at hand. This could be
an image caption, a translation, or any other type of output.
10. We are performing a task where we generate the summary for an image using the
encoder-decoder model. Choose the correct statements. (MSQ)

a)LSTM is used as the decoder.


b)CNN is used as the decoder.
c)LSTM is used as the encoder.
d)None of These

Answer: a)
Solution: We use CNN to learn representations of the image which is fed as state 0 to the
LSTM model.

You might also like