0% found this document useful (0 votes)
5 views

DL Assignment Solutions

The document covers the first two weeks of a deep learning course, focusing on perceptron models, decision boundaries, and the perceptron algorithm's capabilities. It includes multiple-choice questions with correct answers and explanations related to classification, neural networks, gradient descent, and sigmoid functions. Key concepts include linear separability, weight updates, and the behavior of activation functions in neural networks.

Uploaded by

reshap.one
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

DL Assignment Solutions

The document covers the first two weeks of a deep learning course, focusing on perceptron models, decision boundaries, and the perceptron algorithm's capabilities. It includes multiple-choice questions with correct answers and explanations related to classification, neural networks, gradient descent, and sigmoid functions. Key concepts include linear separability, weight updates, and the behavior of activation functions in neural networks.

Uploaded by

reshap.one
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

Deep Learning - Week 1

Common data for questions 1,2 and 3


In the figure shown below, the blue points belong to class 1 (positive class) and the
red points belong to class 0 (negative class). Suppose that we use a perceptron model,
with the weight vector w as shown in the figure, to separate these data points. We
define the point belongs to class 1 if wT x ≥ 0 else it belongs to class 0.
y

C
−1 1
G
−1

1. The points G and C will be classified as?


Note: the notation (G, 0) denotes the point G will be classified as class-0 and (C, 1)
denotes the point C will be classified as class-1

(a) (C, 0), (G, 0)


(b) (C, 0), (G, 1)
(c) (C, 1), (G, 1)
(d) (C, 1), (G, 0)

Correct Answer: (d)


Solution:  
0
w= ,
1.5
(
1 if wT x > 0
x∈
0 if wT x ≤ 0

For C(−0.6, 0.2):


 
T
 −0.6

w x = 0 1.25 = (0)(−0.6) + (1.25)(0.2) = 0.25
0.25
∴ (C, 1)
For G(0.5, −0.5):
 
T
  0.5
w x = 0 1.25 = (0)(0.5) + (1.25)(−0.5) = −0.625
−0.5
∴ (G, 0)
2. The statement that “there exists more than one decision lines that could separate
these data points with zero error” is,

(a) True
(b) False

Correct Answer: (a)


Solution: The given statement is True.
In the perceptron algorithm, when the data points are linearly separable, there can
exist multiple hyperplanes (decision lines) that perfectly classify the data points with
zero error. This is because a decision boundary depends on the orientation of the
separating hyperplane and the margin around it, which can vary as long as it satisfies
the linear separability condition.
For example, in the graph provided, multiple lines can separate the red and blue data
points such that all points are correctly classified. These decision boundaries can
differ in slope and position while still achieving zero classification error. Hence, the
solution is True.

3. Suppose that we multiply the weight vector w by −1. Then the same points G and
C will be classified as?

(a) (C, 0), (G, 0)


(b) (C, 0), (G, 1)
(c) (C, 1), (G, 1)
(d) (C, 1), (G, 0)

Correct Answer: (b)


Solution: Simply multiply w by −1 and repeat the calculations from question 1.

4. Which of the following can be achieved using the perceptron algorithm in machine
learning?

(a) Grouping similar data points into clusters, such as organizing customers based
on purchasing behavior.
(b) Solving optimization problems, such as finding the maximum profit in a business
scenario.
(c) Classifying data, such as determining whether an email is spam or not.
(d) Finding the shortest path in a graph, such as determining the quickest route
between two cities.

Correct Answer: (c)


Solution: Perceptron can only classify, linearly separable data.
5. Consider the following table, where x1 and x2 are features and y is a label
x1 x2 y
0 0 1
0 1 1
1 0 1
1 1 0
Assume that the elements in w are initialized to zero and the perception learning
algorithm is used to update the weights w. If the learning algorithm runs for long
enough iterations, then

(a) The algorithm never converges


(b) The algorithm converges (i.e., no further weight updates) after some iterations
(c) The classification error remains greater than zero
(d) The classification error becomes zero eventually

Correct Answer: (b),(d)


Solution: Since the data points are linearly separable, the algorithm converges, visu-
alize it using a graphing tool.

6. We know from the lecture that the decision boundary learned by the perceptron is a
line in R2 . We also observed that it divides the entire space of R2 into two regions,
suppose that the input vector x ∈ R4 , then the perceptron decision boundary will
divide the whole R4 space into how many regions?

(a) It depends on whether the data points are linearly separable or not.
(b) 3
(c) 4
(d) 2
(e) 5

Correct Answer: (d)


Solution: A line will become a hyperplane in R4 but still it will divide the region in
2 halves.

7. Choose the correct input-output pair for the given MP Neuron.


(
1, if x1 + x2 + x3 < 2
f (x) =
0, otherwise

(a) y = 1 for (x1 , x2 , x3 ) = (0, 0, 0)


(b) y = 0 for (x1 , x2 , x3 ) = (0, 0, 1)
(c) y = 1 for (x1 , x2 , x3 ) = (1, 0, 0)
(d) y = 1 for (x1 , x2 , x3 ) = (1, 1, 1)
(e) y = 0 for (x1 , x2 , x3 ) = (1, 0, 1)

Correct Answer: (a),(c),(e)


Solution: Substituting values into the above expression and evaluating them yields
the result.

8. Consider
  the following table, where x1 and x2 are features (packed into a single vector
x
x = 1 ) and y is a label:
x2
x1 x2 y
0 0 0
0 1 1
1 0 1
1 1 1
Suppose that the perceptron model is used to classify
  the data points. Suppose
1
further that the weights w are initialized to w = . The following rule is used for
1
classification,
(
1 if wT x > 0
y=
0 if wT x ≤ 0

The perceptron learning algorithm is used to update the weight vector w. Then, how
many times the weight vector w will get updated during the entire training process?

(a) 0
(b) 1
(c) 2
(d) Not possible to determine

Correct Answer: (a)


Solution: Upon computing wT x for all data points with the initial weight values, all
the points are correctly classified. Hence, update is not required.

9. Which of the following threshold values of MP neuron implements AND Boolean


function? Assume that the number of inputs to the neuron is 3 and the neuron does
not have any inhibitory inputs.

(a) 1
(b) 2
(c) 3
(d) 4
(e) 5
Correct Answer: (c)
Solution: suppose, we set θ = 4, then summing all the input never exceeds 3, therefore,
the neuron won’t fire, And suppose we set θ < 3 then it won’t satisfy the AND
operator.
 
−1
10. Consider points shown in the picture. The vector w = . As per this weight
1
vector, the Perceptron algorithm will predict which classes for the data points x1 and
x2 .
NOTE: (
1 if wT x > 0
y=
−1 if wT x ≤ 0

x1 (1.5, 2)

w
x

x2 (−2.5, −2)

(a) x1 = −1
(b) x1 = 1
(c) x2 = −1
(d) x2 = 1

Correct Answer: (b),(d)  


−1
Solution: The decision boundary is wT x= 0. Hence for w = , anything in the
1
direction of w will have wT x > 0 and will get labelled 1.
Deep Learning - Week 2

1. Which of the following statements is(are) true about the following function?
σ(z) = 1+e1−(z)

(a) The function is monotonic


(b) The function is continuously differentiable
(c) The function is bounded between 0 and 1
(d) The function attains its maximum when z → ∞

Correct Answer: (a),(b),(c),(d)


Solution: Plot the function on graphing tools and verify all the correct options

2. How many weights does a neural network have if it consists of an input layer with 2
neurons, two hidden layers each with 5 neurons, and an output layer with 2 neurons?
Assume there are no bias terms in the network.
Correct Answer: 45
Solution: Number of weights = (2 ∗ 5) + (5 ∗ 5) + (5 ∗ 2) = 45.

3. A function f (x) is approximated using 100 tower functions. What is the minimum
number of neurons required to construct the network that approximates the function?

(a) 99
(b) 100
(c) 101
(d) 200
(e) 201
(f) 251

Correct Answer: (e)


Solution: To approximate one rectangle, we need 2 neurons. Therefore, to create 100
towers, we will require 200 neurons. An additional neuron is required for aggregation.

4. Suppose we have a Multi-layer Perceptron with an input layer, one hidden layer and
an output layer. The hidden layer contains 32 perceptrons. The output layer contains
one perceptron. Choose the statement(s) that are true about the network.

(a) Each perceptron in the hidden layer can take in only 32 Boolean inputs
(b) Each perceptron in the hidden layer can take in only 5 Boolean inputs
(c) The network is capable of implementing 25 Boolean functions
(d) The network is capable of implementing 232 Boolean functions
Correct Answer: (d)
Solution: In the lecture, we have seen that, if the hidden layer contains 2n neurons,
where n is a number of inputs, then the network should be able to implement all
n
Boolean functions that take in n inputs. There are 22 Boolean functions.

5. Consider a function f (x) = x3 − 5x2 + 5. What is the updated value of x after 2nd
iteration of the gradient descent update, if the learning rate is 0.1 and the initial value
of x is 5?
Correct Answer: range(3.1,3.2)
Solution: We are tasked to find the updated value of x after the second iteration of
gradient descent for the function:

f (x) = x3 − 5x2 + 5

Step 1: Compute the Gradient The gradient of f (x) is given by:

f ′ (x) = 3x2 − 10x

Step 2: Gradient Descent Update Rule The update rule for gradient descent is:

xnew = xold − η · f ′ (xold )

where η is the learning rate.


Step 3: Initial Parameters

• Initial x = 5
• Learning rate η = 0.1

Step 4: Iteration 1 At x = 5:

f ′ (5) = 3(5)2 − 10(5) = 75 − 50 = 25

Update x:
xnew = 5 − 0.1 · 25 = 5 − 2.5 = 2.5

Step 5: Iteration 2 At x = 2.5:

f ′ (2.5) = 3(2.5)2 − 10(2.5) = 3(6.25) − 25 = 18.75 − 25 = −6.25

Update x:
xnew = 2.5 − 0.1 · (−6.25) = 2.5 + 0.625 = 3.125

Final Answer The updated value of x after the second iteration is:

x = 3.125

1
6. Consider the sigmoid function 1+e−(wx+b) , where w is a positive value. Select all the
correct statements regarding this function.
(a) Increasing the value of b shifts the sigmoid function to the right (i.e., towards
positive infinity)
(b) Increasing the value of b shifts the sigmoid function to the left (i.e., towards
negative infinity)
(c) Increasing the value of w decreases the slope of the sigmoid function
(d) Increasing the value of w increases the slope of the sigmoid function

Correct Answer: (b),(d)


Solution: Plot the sigmoid function using graphing tools, keeping w and b as vari-
ables. Observe how the slope and y-intercept of the sigmoid function change.

7. You are training a model using the gradient descent algorithm and notice that the
loss decreases and then increases after each successive epoch (pass through the data).
Which of the following techniques would you employ to enhance the likelihood of the
gradient descent algorithm converging? (Here, η refers to the step size.)

(a) Set η = 1
(b) Set η = 0
(c) Decrease the value of η
(d) Increase the value of η

Correct Answer: (c)


Solution: The loss is oscillating around the minimum, indicating that our η (step
size) is too high. Hence, lowering η will increase the likelihood of converging to the
minimum.

8. The diagram below shows three functions f , g and h. The function h is obtained by
combining the functions f and g. Choose the right combination that generated h.
f (x) g(x)

1 1

0.8 0.8

0.6 0.6
f g
0.4 0.4

0.2 0.2

0 0
−1 −0.75−0.5−0.25 0 0.25 0.5 0.75 1 −1 −0.75−0.5−0.25 0 0.25 0.5 0.75 1

h(x)

0.5
0.4
0.3
h
0.2
0.1
0

−1 −0.75−0.5−0.25 0 0.25 0.5 0.75 1

(a) h = f − g
(b) h = 0.5 ∗ (f + g)
(c) h = 0.5 ∗ (f − g)
(d) h = 0.5 ∗ (g − f )

Correct Answer: (d)


Solution:
To verify the solution h = 0.5 · (g − f ), we analyze the given graphs:
Observing f : The function f is a sigmoid function centered at x = 0.25, transitioning
smoothly from 0 to 1 as x increases.
Observing g: The function g is another sigmoid function but shifted to the left,
centered approximately at x = −0.25. It also transitions smoothly from 0 to 1.
Observing h: The function h exhibits the following characteristics:
• A plateau around x = 0 with a constant value of 0.5.
• A transition to 0 outside the overlapping regions of f and g.

Derivation of h = 0.5 · (g − f ):

(a) Both f and g are sigmoid functions with different shifts.


(b) The difference g −f is positive where g > f , creating the observed plateau effect.
(c) By scaling g − f by 0.5, the difference is normalized to a maximum of 0.5, as
seen in the graph for h(x).
(d) Outside the overlapping regions, f ≈ 0 or g ≈ 0, so h(x) trends toward 0, as
expected.

Thus, the equation h = 0.5 · (g − f ) perfectly describes the observed behavior of h(x).
The function h(x) is correctly given by:

h(x) = 0.5 · (g(x) − f (x)).

9. Consider the data points as shown in the figure below,

1.2

1.0
(0.4, 0.95)
0.8

0.6

0.4
(0.06, 0.4)
0.2

-0.6 -0.4 -0.2 0 0.2 0.4 0.6


x

Suppose that the sigmoid function given below is used to fit these data points.
1
1+e−(20x+1)
Compute the Mean Square Error (MSE) loss L(w, b)

(a) 0
(b) 0.126
(c) 1.23
(d) 1
Correct Answer: (b)
Solution: The given sigmoid function is:
1
f (x) =
1+ e−(20x+1)
and the Mean Square Error (MSE) loss is defined as:
n
1X
L(w, b) = (f (xi ) − yi )2
n
i=1

Step 1: Data Points The given data points are:


(x1 , y1 ) = (0.06, 0.4), (x2 , y2 ) = (0.4, 0.95)

Step 2: Predicted Values Using the sigmoid function:


1 1
f (x1 ) = = ≈ 0.9002
1+ e−(20·0.06+1) 1 + e−2.2
1 1
f (x2 ) = = ≈ 0.9999
1+ e−(20·0.4+1) 1 + e−9
Step 3: Squared Errors
Error1 = (f (x1 ) − y1 )2 = (0.9002 − 0.4)2 = (0.5002)2 ≈ 0.2502
Error2 = (f (x2 ) − y2 )2 = (0.9999 − 0.95)2 = (0.0499)2 ≈ 0.0025

Step 4: Mean Square Error (MSE)


n
1X 1
L(w, b) = Errori = (0.2502 + 0.0025)
n 2
i=1

1
L(w, b) = · 0.2527 = 0.12635
2

L(w, b) ≈ 0.12635

10. Suppose that we implement the XOR Boolean function using the network shown
below. Consider the statement that “A hidden layer with two neurons is suffice to
implement XOR”. The statement is
w = −1 (red edge)
y w = +1 (blue edge)

w1 w2 w3 w4

h1 1,1 h2 -1,1 1,-1 h3 1,1 h4

bias = -2
x1 x2
(a) True
(b) False

Correct Answer: (a)


Solution:

(a) First, recall the XOR truth table:


x1 x2 XOR Output
0 0 0
0 1 1
1 0 1
1 1 0
(b) Looking at the given network structure:
• It has 4 hidden neurons, each receiving inputs with weights -1 (red) or +1
(blue)
• Each hidden neuron shows two values (e.g., 1,1 or -1,1)
• Network has a bias of -2
(c) Key insight for sufficiency of two neurons:
• XOR function requires two line separators in the input space
• Each neuron can act as one line separator
• Two neurons together can create the necessary separation pattern
(d) Mathematical justification:
• First neuron: Can create a line separating (0,0) from (1,1)
• Second neuron: Can create a line separating (0,1) from (1,0)
• The output layer combines these separations to implement XOR
(e) Implementation with minimum neurons:
• Neuron 1: Detects when both inputs are 1
• Neuron 2: Detects when both inputs are 0
• Output layer: Combines these signals to produce correct XOR output

Conclusion: While the given network uses 4 neurons, it’s overparameterized for the
XOR problem. Two neurons are mathematically sufficient because:

• XOR is not linearly separable (impossible with single neuron)


• Two neurons provide the minimum geometric complexity needed
• Additional neurons (as shown in the network) may aid training but aren’t nec-
essary

Therefore, the statement is True.


Deep Learning - Week 3

Use the following data to answer the questions 1 to 2


A neural network contains an input layer h0 = x, three hidden layers (h1 , h2 , h3 ), and
an output layer O. All the hidden layers use the Sigmoid activation function, and the
output layer uses the Softmax activation function.
Suppose the input x ∈ R200 , and all the hidden layers contain 10 neurons each. The
output layer contains 4 neurons.

1. How many parameters (including biases) are there in the entire network?
Correct Answer: 2274
Solution:
Number of Parameters
Input Layer to h1 : 200 × 10 + 10 = 2010
h1 to h2 : 10 × 10 + 10 = 110
h2 to h3 : 10 × 10 + 10 = 110
h3 to Output Layer: 10 × 4 + 4 = 44
Total Parameters: 2010 + 110 + 110 + 44 = 2274

2. Suppose all elements in the input vector are zero, and the corresponding true label is
also 0. Further, suppose that all the parameters (weights and biases) are initialized
to zero. What is the loss value if the cross-entropy loss function is used? Use the
natural logarithm (ln).
Correct Answer: Range(1.317,1.455)
Solution:
Loss with Zero Inputs and Parameters Input: x = 0, weights and biases = 0.
Hidden Layers: σ(0) = 0.5.
Output Layer Logits: [0, 0, 0, 0].
Softmax: Softmax(zi ) = 14 , ∀i.
Cross-Entropy Loss: − ln 14 = ln(4) ≈ 1.386.


Use the following data to answer the questions 3 to 4


The diagram below shows a neural network. The network contains two hidden layers
and one output layer. The input to the network is a column vector x ∈ R3 . The first
hidden layer contains 9 neurons, the second hidden layer contains 5 neurons and the
output layer contains 2 neurons. Each neuron in the lth layer is connected to all the
neurons in the (l + 1)th layer. Each neuron has a bias connected to it (not explicitly
shown in the figure).
Hidden layer 1

a1 h(1)
1

(1)
h2 Hidden layer 2

(1) a2 (2)
Input layer h3 h1

x1 (1) (2) Output layer


h4 h2
a3 ŷ
1
x2 (1) (2)
h5 h3
ŷ2
x3 (1) (2)
h6 h4

(1) (2)
h7 h5

(1)
W1 h8 W2 W3

(1)
h9

In the diagram, W1 is a matrix and x, a1 , h1 , and O are all column vectors. The
notation Wi [j, :] denotes the j th row of the matrix Wi , Wi [:, j] denotes the j th column
of the matrix Wi and Wkij denotes an element at ith row and j th column of the matrix
Wk .

3. Choose the correct dimensions of W1 and a1

(a) W1 ∈ R3×9
(b) a1 ∈ R9×5
(c) W1 ∈ R9×3
(d) a1 ∈ R1×9
(e) W1 ∈ R1×9
(f) a1 ∈ R9×1

Correct Answer: (c),(f)


Solution:

4. How many learnable parameters(including bias) are there in the network?


Correct Answer: 98
Solution:
Number of parameters in W1 : (9 ∗ 3) + 9
Number of parameters in W1 : (5 ∗ 9) + 5
Number of parameters in W1 : (2 ∗ 5) + 2
Total: 36 + 50 + 12 = 98.
5. We have a multi-classification problem that we decide to solve by training a feedfor-
ward neural network. What activation function should we use in the output layer to
get the best results?

(a) Logistic
(b) Step function
(c) Softmax
(d) linear

Correct Answer: (c)


Solution: Softmax works best on multilayer classification problems since it is scale-
invariant and outputs a probability distribution.

6. Which of the following statements about backpropagation is true?

(a) It is used to compute the output of a neural network.


(b) It is used to optimize the weights in a neural network.
(c) It is used to initialize the weights in a neural network.
(d) It is used to regularize the weights in a neural network.

Correct Answer: (b)


Solution: Backpropagation is a commonly used algorithm for optimizing the weights
in a neural network. It works by computing the gradient of the loss function with
respect to each weight in the network, and then using that gradient to update the
weight in a way that minimizes the loss function.

7. Given two probability distributions p and q, under what conditions is the cross entropy
between them minimized?

(a) All the values in p are lower than corresponding values in q


(b) All the values in p are higher than corresponding values in q
(c) p = 0(0 is a vector)
(d) p = q

Correct Answer: (d)


Solution:Cross entropy is lowest when both distributions are the same.

8. Given that the probability of Event A occurring is 0.18 and the probability of Event
B occurring is 0.92, which of the following statements is correct?

(a) Event A has a low information content


(b) Event A has a high information content
(c) Event B has a low information content
(d) Event B has a high information content

Correct Answer: (b),(c)


Solution: Events with high probability have low information content while events
with low probability have high information content.

Use the following data to answer the questions 9 and 10


The following diagram represents a neural network containing two hidden layers and
one output layer. The input to the network is a column vector x ∈ R3 . The activation
function used in hidden layers is sigmoid. The output layer doesn’t contain any
activation function and the loss used is squared error loss (predy − truey )2 .
Input Hidden Hidden Output
layer layer 1 layer 2 layer
x1 (1)
h1
(2)
h1
x2 (1)
h2 ŷ1
(2)
h2
x3 (1)
h3

The following network doesn’t contain any biases and the weights of the network are
given below:
 
1 1 3  
1 1 2  
W1 =2 −1 1  W2 = W3 = 1 2
3 1 1
1 2 −2
 
1
The input to the network is: x = 2

1
The target value y is: y = 5

9. What is the predicted output for the given input x after doing the forward pass?
Correct Answer: Range(2.9,3.0)
Solution:
Doing the forward
 pass in thenetwork
  we
  get
1 1 3 1 6
h1 = W1 · x1 = 2 −1 1  · 2 = 1
1 2 −2 1 3
 
0.997
a1 = sigmoid(h1 ) =0.731
0.952
 
  0.997  
1 1 2  3.632
h2 = W2 · a 1 = . 0.731 =

3 1 1 4.674
0.952
 
0.974
a2 = sigmoid(h2 ) =
0.990
 
  0.974
y= 1 2 · = 2.954
0.990

10. Compute and enter the loss between the output generated by input x and the true
output y.
Correct Answer: Range(3.97,4.39)
Solution: Loss=(5 − 2.954)2 = 4.1861
Deep Learning - Week 4

1. Using the Adam optimizer with β1 = 0.9, β2 = 0.999, and ϵ = 10−8 , what would be
the bias-corrected first moment estimate after the first update if the initial gradient
is 4?

(a) 0.4
(b) 4.0
(c) 3.6
(d) 0.44

Correct Answer: (a)


Solution: In Adam, the first moment estimate is calculated as:

mt = β1 ∗ mt−1 + (1 − β1 ) ∗ gt

For the first update, m0 = 0, so:

m1 = 0.9 ∗ 0 + 0.1 ∗ 4 = 0.4

The bias-corrected first moment is:

mcorrected
t = mt /(1 − β1t )

mcorrected
1 = 0.4/(1 − 0.91 ) = 0.4/0.1 = 4

Therefore, the bias-corrected first moment estimate after the first update is 4.

2. In a mini-batch gradient descent algorithm, if the total number of training samples


is 50,000 and the batch size is 100, how many iterations are required to complete 10
epochs?

(a) 5,000
(b) 50,000
(c) 500
(d) 5

Correct Answer: (a)


Solution: Let’s break this down step by step: 1) Number of batches per epoch =
Total samples / Batch size = 50,000 / 100 = 500 batches 2) Number of iterations for
10 epochs = Number of batches per epoch * Number of epochs = 500 * 10 = 5,000
iterations
Therefore, 5,000 iterations are required to complete 10 epochs.

3. In a stochastic gradient descent algorithm, the learning rate starts at 0.1 and decays
exponentially with a decay rate of 0.1 per epoch. What will be the learning rate after
5 epochs?
(a) 0.09
(b) 0.059
(c) 0.05
(d) 0.061

Correct Answer: (b)


Solution: The formula for exponential decay is:

ηt = η0 ∗ e−kt

where η0 is the initial learning rate, k is the decay rate, and t is the number of epochs.
Plugging in the values:

η5 = 0.1 ∗ e−0.1∗5 ≈ 0.1 ∗ 0.60653 ≈ 0.059

4. In the context of Adam optimizer, what is the purpose of bias correction?

(a) To prevent overfitting


(b) To speed up convergence
(c) To correct for the bias in the estimates of first and second moments
(d) To adjust the learning rate

Correct Answer: (c)


Solution: In Adam optimizer, bias correction is used to correct for the bias in the
estimates of first and second moments. This is particularly important in the early
stages of training when the moving averages are biased towards zero due to their
initialization.

5. The figure below shows the contours of a surface.


Suppose that a man walks, from -1 to +1, on both the horizontal (x) axis and the
vertical (y) axis. The statement that the man would have seen the slope change
rapidly along the x-axis than the y-axis is,

(a) True
(b) False
(c) Cannot say

Correct Answer: (a)


Solution: The given contour plot represents the function f (x, y) = x2 + 2y 2 . In a
contour plot, the closeness of contour lines indicates the rate of change of the function.
Since the contours are more closely spaced along the x-axis than along the y-axis, the
function changes more rapidly in the x-direction. This means that a person walking
from x = −1 to x = 1 would experience steeper slope changes compared to walking
along the y-axis. Therefore, the statement that the slope changes more rapidly along
the x-axis than the y-axis is True.

6. What is the primary benefit of using Adagrad compared to other optimization algo-
rithms?

(a) It converges faster than other optimization algorithms.


(b) It is more memory-efficient than other optimization algorithms.
(c) It is less sensitive to the choice of hyperparameters(learning rate).
(d) It is less likely to get stuck in local optima than other optimization algorithms.

Correct Answer: (c)


Solution: The main advantage of using Adagrad over other optimization algorithms
is that it is less sensitive to the choice of hyperparameters.

7. What are the benefits of using stochastic gradient descent compared to vanilla gra-
dient descent?

(a) SGD converges more quickly than vanilla gradient descent.


(b) SGD is computationally efficient for large datasets.
(c) SGD theoretically guarantees that the descent direction is optimal.
(d) SGD experiences less oscillation compared to vanilla gradient descent.

Correct Answer: (a),(b)


Solution: SGD updates weight more frequently hence it converges fast. Since it is
computationally faster than vanilla gradient descent, it works well for large datasets.

8. Select the true statements about the factor β used in the momentum based gradient
descent algorithm.

(a) Setting β = 0.1 allows the algorithm to move faster than the vanilla gradient
descent algorithm
(b) Setting β = 0 makes it equivalent to the vanilla gradient descent algorithm
(c) Setting β = 1 makes it equivalent to the vanilla gradient descent algorithm
(d) Oscillation around the minimum will be less if we set β = 0.1 than setting
β = 0.99

Correct Answer: (a),(b)(d)


Solution: Let’s analyze the statements about the factor β used in the momentum-
based gradient descent algorithm:
Momentum-based Gradient Descent: The momentum-based gradient descent algo-
rithm updates the weights using the following rule:

vt+1 = βvt + (1 − β)∇wt


wt+1 = wt − ηvt+1

where:
−vt is the velocity (momentum term).
−β is the momentum factor.
−∇wt is the gradient of the loss with respect to the weight at time t.
−η is the learning rate.
Setting β = 0.1 allows the algorithm to move faster than the vanilla (plain)
gradient descent algorithm: - When β is set to a small positive value like 0.1, the
algorithm incorporates some momentum, which can help accelerate convergence by
navigating more effectively through shallow regions of the loss surface. This statement
is generally true.
Setting β = 1 makes it equivalent to vanilla gradient descent algorithm: -
If β = 1, the velocity term vt+1 would solely depend on the previous velocity vt and
would not incorporate the current gradient ∇wt . This effectively stalls the learning
process. However, the claim that it makes it equivalent to vanilla gradient descent
(which does not use momentum) is incorrect. Vanilla gradient descent updates weights
purely based on the gradient without momentum.
Setting β = 0 makes it equivalent to vanilla gradient descent algorithm:
- When β = 0, the velocity term vt+1 is directly proportional to the current gradi-
ent ∇wt . This reduces the momentum-based gradient descent to the plain gradient
descent update rule. Thus, this statement is true.
Oscillation around the minimum will be less if we set β = 0.1 than setting
β = 0.99: - Higher values of β (close to 1) result in more momentum, which can cause
larger oscillations around the minimum due to the higher inertia. A lower value of
β like 0.1 results in less momentum, leading to reduced oscillations. Therefore, this
statement is true.

9. What is the advantage of using mini-batch gradient descent over batch gradient de-
scent?

(a) Mini-batch gradient descent is more computationally efficient than batch gradi-
ent descent.
(b) Mini-batch gradient descent leads to a more accurate estimate of the gradient
than batch gradient descent.
(c) Mini batch gradient descent gives us a better solution.
(d) Mini-batch gradient descent can converge faster than batch gradient descent.

Correct Answer: (a),(d)


Solution: The advantage of using mini-batch gradient descent over batch gradient
descent is that it is more computationally efficient, allows for parallel processing of
the training examples, and can converge faster than batch gradient descent.

10. In the Nesterov Accelerated Gradient (NAG) algorithm, the gradient is computed at:

(a) The current position


(b) A “look-ahead” position
(c) The previous position
(d) The average of current and previous positions

Correct Answer: (b)


Solution: In NAG, the gradient is computed at a “look-ahead” position. This look-
ahead position is determined by applying the momentum step to the current position.
This allows the algorithm to have a sort of ”prescience” about where the parameters
are going, which can lead to improved convergence rates compared to standard mo-
mentum.
Deep Learning - Week 5

1. Which of the following is the most appropriate description of the method used in
PCA to achieve dimensionality reduction?

(a) PCA achieves this by discarding a random subset of features in the dataset
(b) PCA achieves this by selecting those features in the dataset along which the
variance of the dataset is maximised
(c) PCA achieves this by retaining the those features in the dataset along which
the variance of the dataset is minimised
(d) PCA achieves this by looking for those directions in the feature space along
which the variance of the dataset is maximised

Correct Answer: (d)


Solution: PCA looks for a new set of directions in feature space such that the first
few directions capture the maximum variance in the data. It does this by re-orienting
the feature axes, which can be thought of as rotating the axes in the feature space.

2. What is/are the limitations of PCA?

(a) It can only identify linear relationships in the data.


(b) It can be sensitive to outliers in the data.
(c) It is computationally less efficient than autoencoders
(d) It can only reduce the dimensionality of a dataset by a fixed amount.

Correct Answer: (a),(b)


Solution: PCA can be sensitive to outliers in the data, since the principal components
are calculated based on the covariance matrix of the data. Outliers can have a large
impact on the covariance matrix and can skew the results of the PCA. Also, it can
only capture linear relationships in the data.

3. The following are possible numbers of linearly independent eigenvectors for a 7 × 7


matrix. Choose the incorrect option.

(a) 1
(b) 3
(c) 9
(d) 5
(e) 8

Correct Answer: (c),(e)


Solution: A n×n matrix can have between 1 and n linearly independent eigenvectors.
 
−4 −6
4. Find the singular values of the following matrix:
3 −8
(a) σ1 = 10, σ2 = 5
(b) σ1 = 1, σ2 = 0
(c) σ1 = 100, σ2 = 25
(d) σ1 = σ2 = 0

Correct Answer: (a)


 
−4 −6
Solution: Let A = . Then,
3 −8
 
25 0
AT A = .
0 100

The singular values of A are the positive square roots of the eigenvalues of AT A.
Therefore,

σ1 = 10 and σ2 = 5.

5. PCA is performed on a mean-centred dataset in R3 If the first principal component


is √16 (1, −1, 2), which of the following could be the second principal component?

(a) (1, −1, 2)


(b) (0, 0, 0)
(c) √1 (0, 1, 2)
5

(d) √1 (−1, −1, 0)


2

Correct Answer: (d)


Solution: The principal components are orthogonal eigenvectors of the covariance
matrix. Since they are eigenvectors, (0, 0, 0) is ruled out, A zero vector cannot be a
principal component because it has no direction. Since they have to be orthogonal,
only the option d is correct.
Questions 6-9 are based on common data.
Consider
  the following
 data
  points x1 , x2 , x3 to answer following questions: x1 =
−2 2 1
, x2 = , x3 =
2 −2 1

6. What is the mean of the given data points x1 , x2 , x3 ?


 
1
(a)
1
 
1.67
(b)
1.67
 
2
(c)
2
 
0.33
(d)
0.33
Correct Answer: d)
   
x1 +x2 +x3 1 1 0.33
Solution: Mean of x1 , x2 , x3 = 3 = 3 =
1 0.33
1 Pn T
7. The covariance matrix C = n i=1 (x − x̄)(x − x̄) is given by: (x̄ is mean of the data
points)
 
8.66 −7.33
(a)
−7.33 8.66
 
2.88 −2.44
(b)
−2.44 2.88
 
0.22 −0.22
(c)
−0.22 0.22
 
5.33 −0.33
(d)
−5.33 0.33

Correct Answer: b)       
0.3333 −2.33333 1.66667 0.666667
Solution: x̄ = x −x̄ = x2 −x̄ = x −x̄ =
0.3333 1 1.66667 −2.33333 3 0.666667
Now, let’s calculate (x − x̄)(x − x̄)T for each point:
   
−2.33333   5.44444 −3.88889
For x1 : −2.33333 1.66667 =
1.66667 −3.88889 2.77778
   
1.66667   2.77778 −3.88889
For x2 : 1.66667 −2.33333 =
−2.33333 −3.88889 5.44444
   
0.666667   0.444444 0.444444
For x3 : 0.666667 0.666667 =
0.666667 0.444444 0.444444
Sum these matrices: ni=1 (x − x̄)(x − x̄)T
P

Now, multiply by n1 = 31 :
 
2.88889 −2.44444
C=
−2.44444 2.88889
 
2.88889 −2.44444
Therefore, the correct covariance matrix is
−2.44444 2.88889

8. The maximum eigenvalue of the covariance matrix C is:

(a) 1
(b) 5.33
(c) 0.44
(d) 0.5
Correct Answer:
 b) 
2.88889 −2.44444
Solution: C =
−2.44444 2.88889
This gives us two eigenvalues: λ1 = 0.44445 λ2 = 5.33333
The maximum eigenvalue is λ2 = 5.33.

9. The eigenvector corresponding to the maximum eigenvalue of the given matrix C is:
 
1
(a)
1
 
−1
(b)
1
 
0.67
(c)
0
 
−1.48
(d)
1

Correct Answer: b)

Solution: Using the maximum eigenvalue found earlier, we solve the equation (C − λI)v = 0
to
 find
 the eigenvector v. The eigenvector corresponding to the maximum eigenvalue is
−1
.
1

10. Given that A is a 2 × 2 matrix, what is the determinant of A, if its eigenvalues are 6 and 7?
Correct Answer: 42

Solution: The determinant of a matrix is defined as the product of its eigenvalues. Therefore,
if a matrix has eigenvalues λ1 and λ2, its determinant is given by det(A) = λ1 ∗ λ2.
Deep Learning - Week 6

1. What is/are the primary advantages of Autoencoders over PCA?

(a) Autoencoders are less prone to overfitting than PCA.


(b) Autoencoders are faster and more efficient than PCA.
(c) Autoencoders require fewer input data than PCA.
(d) Autoencoders can capture nonlinear relationships in the input data.

Correct Answer: (d)


Solution: Autoencoders can capture nonlinear relationships in the input data, which
allows them to learn more complex representations than PCA. This can be particu-
larly useful in applications where the input data contains nonlinear relationships that
cannot be captured by a linear method like PCA.

2. Which of the following is a potential advantage of using an overcomplete autoencoder?

(a) Reduction of the risk of overfitting


(b) Faster training time
(c) Ability to learn more complex and nonlinear representations
(d) To compress the input data

Correct Answer: (c)

Solution: Overcomplete autoencoders have more hidden units in the encoder than
in the decoder, which can increase the capacity of the network and allow it to learn
more complex and nonlinear representations of the input data.

3. We are given an autoencoder A. The average activation value of neurons in this


network is 0.015. The given autoencoder is

(a) Contractive autoencoder


(b) Sparse autoencoder
(c) Overcomplete neural network
(d) Denoising autoencoder

Correct Answer: (b)


Solution: The neurons are mostly inactive for a given input. Hence the autoencoder
is sparse autoencoder.

4. Suppose we build a neural network for a 5-class classification task. Suppose for a
single training example, the true label is [0 1 0 0 1] while the predictions by the
neural network are [0.4 0.25 0.2 0.1 0.6]. What would be the value of cross-entropy
loss for this example? (Answer up to two decimal places, Use base 2 for log-related
calculations)
Correct Answer: range(2.7, 2.8)
Solution: Cross entropy loss is given by − 5i=1 (yi ) log2 (yˆi )
P
= −0 · log2 (0.4) − 1 · log2 (0.25) − 0 · log2 (0.2) − 0 · log2 (0.1) − 1 · log2 (0.6)
= −1 · log2 (0.25) − 1 · log2 (0.6)
= −1 · −2 − 1 · −0.7369
= 2.7369

5. If an under-complete autoencoder has an input layer with a dimension of 5, what


could be the possible dimension of the hidden layer?

(a) 5
(b) 4
(c) 2
(d) 0
(e) 6

Correct Answer: (b),(c)


Solution: The dimension of the hidden layer is less than the input layer in the under-
complete autoencoder.

6. Which of the following networks represents an autoencoder?

Input Hidden Output


layer layer 1 layer
ŷ1
(1)
h1
x1 ŷ2
(1)
h2
x2 ŷ3
(1)
h3
ŷ4
(a)
Input Hidden Output
layer layer 1 layer

x1
(1)
h1
x2 ŷ1
(1)
h2
x3
(b)
Input Hidden Output
layer layer 3 layer
(1)
h1
x1
(1)
h2 ŷ1
x2
(1)
h3 ŷ2
x3
(1)
h4
(c)
Input Hidden Output
layer layer 1 layer
x1 ŷ1

x2 (1)
h1 ŷ2

x3 (1)
h2 ŷ3

x4 ŷ4
(d)

Correct Answer: (d)


Solution: Autoencoder is used to learn the representation of input data. Hence
the output layer’s size should be the same as the input layer’s size to compare the
reconstruction error.

7. What is the primary reason for adding corruption to the input data in a denoising
autoencoder?

(a) To increase the complexity of the model.


(b) To improve the model’s ability to generalize to unseen data.
(c) To reduce the size of the training dataset.
(d) To increase the training time.

Correct Answer: b)
Solution: Adding corruption to the input data in a denoising autoencoder serves the
purpose of forcing the model to learn robust features that can reconstruct the original
input even when parts of it are missing or noisy. This process prevents the model
from merely memorizing the training data, thereby enhancing its ability to generalize
to new, unseen data. This generalization is crucial for the model’s performance on
real-world tasks where the input may not always be clean or complete.

8. Suppose for one data point we have features x1 , x2 , x3 , x4 , x5 as −4, 6, 2.8, 0, 17.3 then,
which of the following function should we use on the output layer(decoder)?

(a) Linear
(b) Logistic
(c) Relu
(d) Tanh

Correct Answer: (a)


Solution: The linear activation function is commonly used in regression tasks where
the output can be any real number, which aligns with the nature of the given features.
It allows the model to predict values across the entire real number line, which is
suitable for the diverse range of input values we see in the features.

9. Which of the following statements about overfitting in overcomplete autoencoders is


true?

(a) Reconstruction error is very high while training


(b) Reconstruction error is very low while training
(c) Network fails to learn good representations of input
(d) Network learns good representations of input

Correct Answer: (b),(c)


Solution: (b) Reconstruction error is very low while training: An overcomplete au-
toencoder, with more neurons in the hidden layer than in the input layer, has enough
capacity to memorize the training data. This often results in a very low reconstruction
error during training because the network effectively ”copies” the input.
(c) Network fails to learn good representations of input: Although the autoencoder
can reconstruct the training data accurately (due to memorization), it may not cap-
ture meaningful or generalizable features from the data. This means it fails to learn
good representations that can be useful for tasks like dimensionality reduction or
feature extraction on unseen data.
(a) is incorrect because the reconstruction error during training is low, not high.
(d) is incorrect because while the network learns to reconstruct inputs, it doesn’t
necessarily learn useful or robust representations.

10. What is the purpose of a decoder in an autoencoder?

(a) To reconstruct the input data


(b) To generate new data
(c) To compress the input data
(d) To extract features from the input data

Correct Answer: (a)


Solution: The decoder in an autoencoder is responsible for reconstructing the input
data from the encoded representation generated by the encoder. It is used for data
reconstruction and is typically the reverse of the encoding process.
Deep Learning - Week 7

1. Which of the following statements about L2 regularization is true?

(a) It adds a penalty term to the loss function that is proportional to the absolute
value of the weights.
(b) It results in sparse solutions for w.
(c) It adds a penalty term to the loss function that is proportional to the square of
the weights.
(d) It is equivalent to adding Gaussian noise to the weights.

Correct Answer: (c)


Solution:
It adds a penalty term to the loss function that is proportional to the
square of the weights. L2 regularization, also known as Ridge Regularization,
adds a penalty term to the loss function that is proportional to the sum of the squares
of the weights. The modified loss function typically looks like:
X
Lreg = L + λ w2

where λ is a hyperparameter that controls the strength of regularization.


Now, let’s analyze the other options:
It adds a penalty term to the loss function that is proportional to the
absolute value of the weights. Incorrect. This describes L1 regularization
(Lasso), not L2.
It results in sparse solutions for w. Incorrect. L2 regularization does not lead
to sparse solutions (i.e., it does not force weights to be exactly zero). Instead, it
shrinks weights toward zero but usually keeps them nonzero. L1 regularization is
the one that encourages sparsity.
It is equivalent to adding Gaussian noise to the weights. Incorrect. While
L2 regularization can be interpreted as a prior in a Bayesian framework (i.e.,
assuming a Gaussian prior on weights), it does not mean that Gaussian noise is
explicitly added to the weights during training.

Common Data Q2-Q3


Consider two models:
fˆ1 (x) = w0 + w1 x

fˆ2 (x) = w0 + w1 x2 + w2 x2 + w4 x4 + w5 x5

2. Which of these models has higher complexity?

(a) fˆ1 (x)


(b) fˆ2 (x)
(c) It is not possible to decide without knowing the true distribution of data points
in the dataset.

Correct Answer: (b)


Solution: Model fˆ2 (x) has higher complexity compared to Model fˆ1 (x). The com-
plexity of a model generally increases with the degree of the polynomial terms. Model
fˆ1 (x) is a linear model, whereas Model fˆ2 (x) includes higher-degree polynomial terms
(specifically x2 and x5 ), making it capable of capturing more complex patterns.
Therefore, fˆ2 (x) is more complex.

3. We generate the data using the following model:

y = 7x3 + 12x + x + 2.

We fit the two models fˆ1 (x) and fˆ2 (x) on this data and train them using a neural
network.

(a) fˆ1 (x) has a higher bias than fˆ2 (x).


(b) fˆ2 (x) has a higher bias than fˆ1 (x).
(c) fˆ2 (x) has a higher variance than fˆ1 (x).
(d) fˆ1 (x) has a higher variance than fˆ2 (x).

Correct Answer: (a),(c)


Solution: fˆ1 (x) has a higher bias than fˆ2 (x). (Because fˆ1 (x) is simpler and cannot
capture the true complexity of the data.) fˆ2 (x) has a higher variance than fˆ1 (x).
(Because fˆ2 (x) is more complex and may fit the training data too closely.)

4. Suppose that we apply Dropout regularization to a feed forward neural network.


Suppose further that mini-batch gradient descent algorithm is used for updating the
parameters of the network. Choose the correct statement(s) from the following state-
ments.

(a) The dropout probability p can be different for each hidden layer
(b) Batch gradient descent cannot be used to update the parameters of the network
(c) Dropout with p = 0.5 acts as a ensemble regularize
(d) The weights of the neurons which were dropped during the forward propagation
at tth iteration will not get updated during t + 1th iteration

Correct Answer: (a),(c)


Solution:

(a) The dropout probability p can be different for each hidden layer:
• True. It is common practice to apply different dropout rates to different
hidden layers, which allows for more control over the regularization strength
applied to each layer.
(b) Batch gradient descent cannot be used to update the parameters of
the network:
• False. Batch gradient descent, as well as mini-batch gradient descent, can
be used to update the parameters of a network with dropout regularization.
Dropout affects the training phase by randomly dropping neurons but does
not prevent the use of gradient descent algorithms for parameter updates.
(c) Dropout with p = 0.5 acts as an ensemble regularizer:
• True. Dropout with p = 0.5 can be seen as an ensemble method in the sense
that, during training, different subsets of neurons are active, which can
be interpreted as training a large number of “thinned” networks. During
testing, the full network is used but with the weights scaled to account for
the dropout, effectively acting as an ensemble of these thinned networks.
(d) The weights of the neurons which were dropped during the forward
propagation at t-th iteration will not get updated during t + 1-th it-
eration:
• False. During training, dropout randomly drops neurons in each mini-batch
iteration, but this does not mean that the weights of dropped neurons are
not updated. The update process occurs based on the backpropagation of
the loss through the network, and weights are updated according to the
gradients computed from the dropped and non-dropped neurons.

5. We have trained four different models on the same dataset using various hyperparam-
eters. The training and validation errors for each model are provided below. Based
on this information, which model is likely to perform best on the test dataset?
Model Training error Validation error
1 0.8 1.4
2 2.5 0.5
3 1.7 1.7
4 0.2 0.6

(a) Model 1
(b) Model 2
(c) Model 3
(d) Model 4

Correct Answer: (d)


Solution: Model 4 has both low training loss and low validation loss. Hence Model 4
will give you best results.

Common Data Q6-Q9


Consider a function L(w, b) = 0.4w2 + 7b2 + 1 and its contour plot given below:
6. What is the value of L(w∗ , b∗ ) where w∗ and b∗ are the values that minimize the
function.
Correct Answer: 1
Solution: To find the value of L(w∗ , b∗ ) where w∗ and b∗ are the values that minimize
the function

L(w, b) = 0.4w2 + 7b2 + 1,

We follow these steps:


1. Find the Minimum Values of w and b:
The partial derivatives of L with respect to w and b are:

∂L
= 0.8w
∂w
∂L
= 14b
∂b
Setting these partial derivatives to zero:

0.8w = 0 =⇒ w = 0
14b = 0 =⇒ b = 0

Therefore, the values that minimize the function are w∗ = 0 and b∗ = 0.


2. Evaluate L at w∗ and b∗ :
Substitute w∗ = 0 and b∗ = 0 into the function L(w, b):
L(w∗ , b∗ ) = L(0, 0) = 0.4(0)2 + 7(0)2 + 1 = 1

Thus, the value of L(w∗ , b∗ ) is 1.

7. What is the sum of the elements of ∇L(w∗ , b∗ )?


Correct Answer: 0
Solution: The gradient ∇L(w, b) is:
 
∂L ∂L
∇L(w, b) = , = (0.8w, 14b) .
∂w ∂b

At w∗ = 0 and b∗ = 0, the gradient is:

∇L(w∗ , b∗ ) = (0, 0) .

The sum of the elements of ∇L(w∗ , b∗ ) is:

0 + 0 = 0.

8. What is the determinant of HL (w∗ , b∗ ), where H is the Hessian of the function?


Correct Answer: 11.2
Solution: The Hessian matrix HL (w, b) is:
" #
∂2L ∂2L
∂w 2 ∂w∂b
HL (w, b) = ∂2L ∂2L .
∂b∂w ∂b2

Compute the second-order partial derivatives:

∂2L
= 0.8
∂w2
∂2L
= 14
∂b2
∂2L ∂2L
= =0
∂w∂b ∂b∂w
Thus, the Hessian matrix is:

0.8 0
HL (w, b) = .
0 14

The determinant of this matrix is:

Determinant = (0.8 · 14) − (0 · 0) = 11.2.

9. Compute the Eigenvalues and Eigenvectors of the Hessian. According to the eigen-
values of the Hessian, which parameter is the loss more sensitive to?
(a) b
(b) w

Correct Answer: (a)


Solution: The Hessian matrix is:
 
0.8 0
HL (w, b) = .
0 14
 
1
The eigenvalues are λ1 = 0.8 and λ2 = 14, with corresponding eigenvectors and
0
 
0
, respectively. The larger eigenvalue λ2 = 14 corresponds to the parameter b.
1

10. Consider the problem of recognizing an alphabet (in upper case or lower case) of
English language in an image. There are 26 alphabets in the language. Therefore,
a team decided to use CNN network to solve this problem. Suppose that data aug-
mentation technique is being used for regularization. Then which of the following
transformation(s) on all the training images is (are) appropriate to the problem

(a) Rotating the images by ±10◦


(b) Rotating the images by ±180◦
(c) Translating image by 1 pixel in all direction
(d) Cropping

Correct Answer: (a),(c),(d)


Solution:
Cropping:
Appropriate. Cropping is useful for augmenting data by varying the parts of the
image that are used for training. This can help the model learn to recognize letters
even if they are partially obscured or not centered perfectly. It ensures that the model
is robust to variations in the position of the letter within the image.

Rotating the images by ±10◦ :


Appropriate. Rotating images slightly (such as ±10◦ ) helps the model become in-
variant to small rotational changes. This is useful because in practical scenarios,
characters might be slightly tilted, and the model should be able to recognize them
regardless of minor rotations.

Rotating the images by 180◦ :


Not Appropriate. Rotating images by 180◦ is generally not useful for character recog-
nition because it might lead to images that are completely inverted. For example, ’A’
would become ’Λ’ and ’B’ would become ’q’. Such rotations do not usually represent
valid variations in the context of character recognition.

Translating the image by 1 pixel in all directions:


Appropriate. Translating images by small amounts (such as 1 pixel) helps the model
become robust to slight positional shifts. This can improve the model’s ability to
recognize characters that are not perfectly aligned or are slightly shifted.
Deep Learning - Week 8

1. What are the challenges associated with using the Tanh(x) activation function?

(a) It is not zero centered


(b) Computationally expensive
(c) Non-differentiable at 0
(d) Saturation

Correct Answer: (b),(d)


Solution: Tanh(x) is zero-centered but the problem of saturation still persists. It is
computationally expensive to do this operation.

2. Which of the following problems makes training a neural network harder while using
sigmoid as the activation function?

(a) Not-continuous at 0
(b) Not-differentiable at 0
(c) Saturation
(d) Computationally expensive

Correct Answer: (c),(d)


Solution: Sigmoid is computationally expensive due to the exponentiation process.
They saturate easily and since their range is [0,1], weight update directions are limited.

3. Consider the Exponential ReLU (ELU) activation function, defined as:


(
x, x>0
f (x) = x
a(e − 1), x ≤ 0

where a ̸= 0. Which of the following statements is true?

(a) The function is discontinuous at x = 0.


(b) The function is non-differentiable at x = 0.
(c) Exponential ReLU can produce negative values.
(d) Exponential ReLU is computationally less expensive than ReLU.

Correct Answer: (c)


Solution:
1. Discontinuity at x = 0:

(a) Right-hand limit: lim f (x) = 0.


x→0+
(b) Left-hand limit: lim a(ex − 1) = a(1 − 1) = 0.
x→0−
(c) Since both limits and f (0) are equal, the function is continuous at x = 0.
2. Non-differentiability at x = 0:

(a) Right derivative: lim f ′ (x) = 1.


x→0+
(b) Left derivative: lim aex = a.
x→0−
(c) The function is differentiable at x = 0 only if a = 1.
(d) Since a ̸= 0 but not necessarily 1, differentiability depends on a, making this
statement inconclusive.

3. Computational expense compared to ReLU:

(a) ReLU uses max(0, x), which is a simple comparison.


(b) ELU involves an exponential operation, which is more computationally expen-
sive.

4. Possibility of negative values:

(a) For x < 0,


f (x) = a(ex − 1).

(b) Since ex − 1 < 0 for x < 0, f (x) is negative if a > 0.

4. We have observed that the sigmoid neuron has become saturated. What might be
the possible output values at this neuron?

(a) 0.0666
(b) 0.589
(c) 0.9734
(d) 0.498
(e) 1

Correct Answer: (a),(c),(e)


Solution: Since the neuron has saturated its output values are close to 0 or 1.

5. What is the gradient of the sigmoid function at saturation?


Correct Answer: 0
Solution: At saturation, the sigmoid function outputs 0 or 1, and its gradient becomes
zero, causing vanishing gradients.

6. Which of the following are common issues caused by saturating neurons in deep
networks?

(a) Vanishing gradients


(b) Slow convergence during training
(c) Overfitting
(d) Increased model complexity
Correct Answer: (a),(b)
Solution: Saturating neurons, especially in sigmoid activation functions, cause vanish-
ing gradients, making it hard to propagate error signals back and slow down learning.

7. Given a neuron initialized with weights w1 = 0.9, w2 = 1.7, and inputs x1 = 0.4,
x2 = −0.7, calculate the output of a ReLU neuron.

Correct Answer: 0
Solution: The weighted sum is 0.9 × 0.4 + 1.7 × (−0.7) = 0.36 − 1.19 = −0.83. ReLU
outputs the max of 0 and the input, so the result is max(0, −0.83) = 0.

8. Which of the following is incorrect with respect to the batch normalization process
in neural networks?

(a) We normalize the output produced at each layer before feeding it into the next
layer
(b) Batch normalization leads to a better initialization of weights.
(c) Backpropagation can be used after batch normalization
(d) Variance and mean are not learnable parameters.

Correct Answer: (d)


Solution:
1. ”We normalize the output produced at each layer before feeding it into the next
layer.”
Batch Normalization (BN) normalizes activations by adjusting them to have zero
mean and unit variance before passing them to the next layer.
The formula for batch normalization is:
x−µ
x̂ = √
σ2 + ϵ

This helps stabilize learning and speeds up convergence.


2. ”Batch normalization leads to a better initialization of weights.”
BN helps mitigate issues like internal covariate shift, making the training less depen-
dent on careful weight initialization.
It allows training with higher learning rates and stabilizes deep networks.
3. ”Backpropagation can be used after batch normalization.”
BN is differentiable, and gradients can flow through it during backpropagation.
During training, gradients are computed normally, taking into account the transfor-
mation applied by BN.
4. ”Variance and mean are not learnable parameters.” (Incorrect)
BN initially normalizes using batch statistics (mean µ and variance σ 2 ).
However, batch normalization introduces learnable parameters: - γ (scaling parame-
ter) - β (shifting parameter)
These parameters allow the model to learn an optimal representation instead of always
enforcing zero mean and unit variance.
9. Which of the following is an advantage of unsupervised pre-training in deep learning?

(a) It helps in reducing overfitting


(b) Pre-trained models converge faster
(c) It requires fewer computational resources
(d) It improves the accuracy of the model

Correct Answer: (a),(b),(d)


Solution: Unsupervised pre-training helps in reducing overfitting in deep neural net-
works by initializing the weights in a better way. This technique requires more com-
putational resources than supervised learning, but it can improve the accuracy of
the model. Additionally, the pre-trained model is shown to converge faster than
non-pre-trained models

10. How can you tell if your network is suffering from the Dead ReLU problem?

(a) The loss function is not decreasing during training


(b) A large number of neurons have zero output
(c) The accuracy of the network is not improving
(d) The network is overfitting to the training data

Correct Answer: (b)


Solution: The Dead ReLU problem can be detected by checking the output of each
neuron in the network. If a large number of neurons have zero output, then the
network may be suffering from the Dead ReLU problem. This can indicate that the
bias term is too high, causing a large number of dead neurons.
Deep Learning - Week 9

1. What is the disadvantage of using Hierarchical Softmax?

(a) It requires more memory to store the binary tree


(b) It is slower than computing the softmax function directly
(c) It is less accurate than computing the softmax function directly
(d) It is more prone to overfitting than computing the softmax function directly

Correct Answer: (c)


Solution: The primary drawback is that the hierarchical softmax approximation can
lead to less accurate probability estimates compared to the full softmax. This is
because the binary tree structure imposes a fixed dependency on the vocabulary,
which can negatively impact the quality of the learned representations. Therefore,
the correct answer is:

2. Consider the following corpus: “AI driven user experience optimization. Perception
of AI decision making speed. Intelligent interface adaptation system. AI system
engineering for enhanced processing efficiency”. What is the size of the vocabulary
of the above corpus?

(a) 18
(b) 20
(c) 22
(d) 19

Correct Answer: (d)


Solution: There are 19 distinct words: [ai, driven, user, experience, optimization,
perception, of, decision, making, speed, intelligent, interface, adaptation, system,
engineering, for, enhanced, processing, efficiency] Therefore, the size of the vocabulary
is 19.

3. We add incorrect pairs into our corpus to maximize the probability of words that occur
in the same context and minimize the probability of words that occur in different
contexts. This technique is called:

(a) Negative sampling


(b) Hierarchical softmax
(c) Contrastive estimation
(d) Glove representations

Correct Answer: (a)


Solution: Negative sampling is a technique used in word2vec and other word embed-
ding models where:
For each positive example (words that actually occur together in context) We generate
several negative examples (incorrect word pairs that don’t occur together) The model
is trained to:
Maximize probability for words that occur together in real contexts Minimize prob-
ability for the randomly sampled negative pairs
This helps the model learn meaningful word representations more efficiently

4. Let X be the co-occurrence matrix such that the (i, j)-th entry of X captures the
PMI between the i-th and j-th word in the corpus. Every row of X corresponds to the
representation of the i-th word in the corpus. Suppose each row of X is normalized
(i.e., the L2 norm of each row is 1) then the (i, j)-th entry of XX T captures the:

(a) PMI between word i and word j


(b) Euclidean distance between word i and word j
(c) Probability that word i
(d) Cosine similarity between word i

Correct Answer: (d)


Solution:
Since each row of X is normalized (i.e., L2 -norm is 1), the dot product of two nor-
malized vectors corresponds to the cosine similarity between them.
The matrix XX T computes the dot products between the rows of X, which repre-
sents the similarity between the word vectors (in terms of PMI). Since the rows are
normalized, the dot product gives the cosine similarity between the word representa-
tions.
Thus, the (i, j)-th entry of XX T captures the cosine similarity between word i and
word j.
The correct answer is: cosine similarity between word i.

5. Suppose that we use the continuous bag of words (CBOW) model to find vector rep-
resentations of words. Suppose further that we use a context window of size 3 (that
is, given the 3 context words, predict the target word P (wt |(wi , wj , wk ))). The size
of word vectors (vector representation of words) is chosen to be 100 and the vocabu-
lary contains 20,000 words. The input to the network is the one-hot encoding (also
called 1-of-V encoding) of word(s). How many parameters (weights), excluding bias,
are there in Wword ? Enter the answer in thousands. For example, if your answer is
50,000, then just enter 50.

Correct Answer: 2000

Solution: In the CBOW model, we have:

• Vocabulary size = 20,000 words


• Embedding size (dimension of word vectors) = 100
The weight matrix Wword has dimensions 100 × 20, 000, where:

• The number of rows corresponds to the embedding size (100),


• The number of columns corresponds to the vocabulary size (20,000).

Thus, the total number of parameters (weights) is:

100 × 20, 000 = 2, 000, 000 parameters

Since the question asks for the answer in thousands, the answer is:

2000

6. You are given the one hot representation of two words below:
GEMINI= [1, 0, 0, 0, 1], CLAUDE= [0, 0, 0, 1, 0]
What is the Euclidean distance between CAR and BUS?
Correct Answer: range(1.7,1.74)

Solution:
The Euclidean distance between two vectors A and B is given by the formula:
v
u n
uX
d(A, B) = t (Ai − Bi )2
i=1

Where:

• Ai and Bi are the components of vectors A and B, respectively.


• n is the number of dimensions (in this case, 5).

We are given:
A = [1, 0, 0, 0, 1] (for GEMINI)
B = [0, 0, 0, 1, 0] (for CLAUDE)

Now, applying the Euclidean distance formula:


p
d(A, B) = (1 − 0)2 + (0 − 0)2 + (0 − 0)2 + (0 − 1)2 + (1 − 0)2
p
d(A, B) = 12 + 02 + 02 + (−1)2 + 12

d(A, B) = 1 + 0 + 0 + 1 + 1

d(A, B) = 3
d(A, B) ≈ 1.732

Therefore, the Euclidean distance between the one-hot representations of GEMINI


and CLAUDE is approximately 1.732.
7. Let count(w, c) be the number of times the words w and c appear together in the
corpus (i.e., occur within a window of few words around each other). Further, let
count(w) and count(c) be the total number of times the word w and c appear in the
corpus respectively and let N be the total number of words in the corpus. The PMI
between w and c is then given by:

(a) log count(w,c)∗count(w)


N ∗count(c)

(b) log count(w,c)∗count(c)


N ∗count(w)
count(w,c)∗N
(c) log count(w)∗count(c)

Correct Answer: (c)


count(w,c)∗N
Solution: The correct answer is option (c): log count(w)∗count(c) Explanation: Pointwise
Mutual Information (PMI) is a measure of association between two words, defined
as the log of the ratio of their joint probability to the product of their individual
probabilities.
The formula for PMI is: P M I(w, c) = log PP(w)P
(w,c)
(c)
Where:
P (w, c) is the probability of words w and c occurring together P (w) is the probability
of word w occurring P (c) is the probability of word c occurring
We can estimate these probabilities using counts:
count(w,c) count(w) count(c)
P (w, c) ≈ N P (w) ≈ N P (c) ≈ N
Substituting these into the PMI formula:
count(w,c)
P M I(w, c) = log N
count(w) count(c)
N
· N
count(w,c)·N
Simplifying: P M I(w, c) = log count(w)·count(c)
This matches the expression in option (c), which is therefore the correct answer.

8. Consider a skip-gram model trained using hierarchical softmax for analyzing scientific
literature. We observe that the word embeddings for ‘Neuron’ and ‘Brain’ are highly
similar. Similarly, the embeddings for ‘Synapse’ and ‘Brain’ also show high similarity.
Which of the following statements can be inferred?

(a) ‘Neuron’ and ‘Brain’ frequently appear in similar contexts


(b) The model’s learned representations will indicate a high similarity between ’Neu-
ron’ and ‘Synapse’
(c) The model’s learned representations will not show a high similarity between
‘Neuron’ and ‘Synapse’
(d) According to the model’s learned representations, ‘Neuron’ and ‘Brain’ have a
low cosine similarity

Correct Answer: (a),(b)


Solution: In skip-gram models, words appearing in similar contexts are given similar
representations. Therefore, ‘Neuron’ and ‘Synapse’ will likely have high similarity in
the learned representations.
9. Suppose we are learning the representations of words using Glove representations. If
we observe that the cosine similarity between two representations vi and vj for words
‘i’ and ‘j’ is very high. which of the following statements is true?( parameter bi =
0.02 and bj = 0.07)

(a) Xij = 0.04


(b) Xij = 0.17
(c) Xij = 0
(d) Xij = 0.95

Correct Answer: (d)


Solution: GloVe representations mean:
GloVe learns word vectors such that their dot product relates to the logarithm of
words’ co-occurrence probabilities Xij represents the co-occurrence count between
words i and j bi and bj are bias terms for words i and j
Cosine similarity between vectors tells us:
If it’s very high, the words appear in very similar contexts This means they fre-
quently co-occur with the same words In GloVe, this suggests they have high direct
co-occurrence as well
0.02: Too small for high similarity vectors
0.2: Still relatively small
0.95: High value consistent with high similarity
0: Zero co-occurrence would not result in high similarity
Therefore, option 3 (Xij = 0.95) must be correct. This high co-occurrence value would
explain the high cosine similarity between the word vectors, while being consistent
with the given bias terms. The other values (0.04, 0.17, 0) are too small to result
in vectors with very high cosine similarity, as they would indicate rare or no co-
occurrence between the words.

10. Which of the following is an advantage of using the skip-gram method over the bag-
of-words approach?

(a) The skip-gram method is faster to train


(b) The skip-gram method performs better on rare words
(c) The bag-of-words approach is more accurate
(d) The bag-of-words approach is better for short texts

Correct Answer: (b)


Solution: The skip-gram method performs better on rare words is the correct answer.
Here’s why:
Skip-gram predicts context words given a target word It learns better representations
for rare words because:
Each occurrence of a rare word gets multiple training examples (one for each context
word) Each occurrence creates several context pairs This means more learning oppor-
tunities from limited data The model updates weights more times for each rare word
appearance
a) The skip-gram method is faster to train
This is incorrect Skip-gram is actually typically slower to train than bag-of-words It
generates multiple context pairs per word, increasing training time
c) The bag-of-words approach is more accurate
This is incorrect Neither approach is universally more accurate Each has different
strengths and use cases Skip-gram often produces better quality word vectors
d) The bag-of-words approach is better for short texts
This is incorrect Bag-of-words can struggle with short texts due to sparsity Skip-gram
can capture more meaningful relationships even in short texts
Deep Learning - Week 10
1. Consider an input image of size 1000 × 1000 × 7 where 7 refers to the number of
channels (Such images do exist!). Suppose we want to apply a convolution operation
on the entire image by sliding a kernel of size 1 × 1 × d. What should be the depth d
of the kernel?

Correct Answer: 7

Solution: To apply a convolution operation on an image of size 1000 × 1000 × 7 with a


kernel of size 1 × 1 × d, the depth d of the kernel must match the number of channels
in the input image.
The number of channels in the input image is 7. For the convolution to be valid, the
kernel’s depth d should also be 7, as each kernel needs to process all channels of the
input image simultaneously.
Conclusion: The depth d of the kernel should be 7.

2. For the same input image in Q1, suppose that we apply the following kernels of
differing sizes.
K1 :5×5
K2 :7×7
K3 : 25 × 25
K4 : 41 × 41
K5 : 51 × 51

Assume that stride s = 1 and no zero padding. Among all these kernels which one
shrinks the output dimensions the most?

(a) K1
(b) K2
(c) K3
(d) K4
(e) K5

Correct Answer: (e)


Solution: To determine which kernel shrinks the output dimensions the most, we
can calculate the output dimensions after applying each kernel. The formula for the
output size of a convolution operation without padding and with a stride of 1 is:

Output Size = (Input Size − Kernel Size) + 1

Given the input image size is 1000 × 1000 and stride s = 1, we can calculate the
output dimensions for each kernel size.
Kernel K1 : 5 × 5
Output Size = (1000 − 5) + 1 = 996
So, the output size will be 996 × 996.
Kernel K2 : 7 × 7
Output Size = (1000 − 7) + 1 = 994
So, the output size will be 994 × 994.
Kernel K3 : 25 × 25

Output Size = (1000 − 25) + 1 = 976

So, the output size will be 976 × 976.


Kernel K4 : 41 × 41

Output Size = (1000 − 41) + 1 = 960

So, the output size will be 960 × 960.


Kernel K4 : 51 × 51

Output Size = (1000 − 51) + 1 = 950

So, the output size will be 950 × 950.


Summary:

• K1 : 996 × 996
• K2 : 994 × 994
• K3 : 976 × 976
• K4 : 960 × 960
• K5 : 950 × 950

Among all these kernels, Kernel K5 (51×51) shrinks the output dimensions the most,
resulting in an output size of 950 × 950.

3. Which of the following is a technique used to fool CNNs in Deep Learning?

(a) Transfer learning


(b) Dropout
(c) Batch normalization
(d) Adversarial examples

Correct Answer: (d)


Solution: Adversarial examples are images that have been specifically designed to
trick a CNN into misclassifying them. They are created by making small, impercep-
tible changes to an image that cause the CNN to output the wrong classification.
Transfer learning is a technique where a pre-trained model is fine-tuned on a new
dataset to improve performance. It is not used to fool CNNs.
Dropout is a regularization technique used to prevent overfitting in neural networks.
Batch normalization is a method used to stabilize training by normalizing activations
across mini-batches.
4. What is the motivation behind using multiple filters in one Convolution layer?

(a) Reduced complexity of the network


(b) Reduced size of the convolved image
(c) Insufficient information
(d) Each filter captures some feature of the image separately

Correct Answer: (d)


Solution: Increasing the number of filters at each layer creates more trainable param-
eters and increases the image’s dimensions. However, we believe that each filter learns
to capture some important image aspects. This is analogous to feature engineering
in classical machine learning.

5. Which of the following statements about CNN is (are) true?

(a) CNN is a feed-forward network


(b) Weight sharing helps CNN layers to reduce the number of parameters
(c) CNN is suitable only for natural images
(d) The shape of the input to the CNN network should be square

Correct Answer: (a),(b)


Solution: Let’s analyze each statement about Convolutional Neural Networks (CNNs):
1. CNN is suitable only for natural images False. CNNs are not limited to natural
images. They can be applied to a wide variety of data types such as medical im-
ages, time-series data (like EEG signals), audio signals (spectrograms), text (in word
embeddings or character-level), and more. CNNs are effective whenever local spatial
patterns or hierarchical features are important, regardless of the data type.
2. The shape of the input to the CNN network should be square False. The input to
a CNN does not have to be square. While many datasets like image data often use
square inputs (e.g., 28 × 28 or 224 × 224), CNNs can handle rectangular inputs as
well (e.g., 1280 × 720) as long as they maintain consistent height and width.
3. CNN is a feed-forward network True. CNN is a type of feed-forward neural
network. The information flows in one direction: from the input layer, through the
convolutional and fully connected layers, to the output layer. There is no feedback
or looping, as is common in recurrent neural networks (RNNs).
4. Weight sharing helps CNN layers to reduce the number of parameters True. Weight
sharing is a key feature of CNNs, particularly in convolutional layers. A single filter
(kernel) is applied across the entire input image, and this filter’s weights are shared
across different locations. This significantly reduces the number of parameters com-
pared to fully connected layers, where each weight is unique.

6. Consider an input image of size 100 × 100 × 1. Suppose that we used kernel of size
3×3, zero padding P = 1 and stride value S = 3. What will be the output dimension?
(a) 100 × 100 × 1
(b) 3 × 3 × 1
(c) 34 × 34 × 1
(d) 97 × 97 × 1

Correct Answer: (c)


Solution: To calculate the output dimensions after applying a convolution operation,
the formula is:
 
Input size − Kernel size + 2P
Output size = +1
S

Where:

• Input size is the spatial dimensions of the input image.


• Kernel size is the size of the kernel/filter.
• P is the amount of zero padding.
• S is the stride.

Given:

• Input size = 100 × 100 × 1 (we only care about the spatial dimensions, i.e.,
100 × 100).
• Kernel size = 3 × 3.
• Zero padding P = 1.
• Stride S = 3.

Let’s calculate the output dimensions for both the height and width:
 
100 − 3 + 2(1)
Output size = +1
3

Simplifying:
   
100 − 3 + 2 99
Output size = +1= + 1 = 33 + 1 = 34
3 3

Therefore, the output dimensions are:

34 × 34 × 1

So, the output dimension is 34 × 34 × 1.

7. Consider an input image of size 100 × 100 × 3. Suppose that we use 8 kernels (filters)
each of size 1 × 1, zero padding P = 1 and stride value S = 2. How many parameters
are there? (assume no bias terms)
(a) 3
(b) 24
(c) 10
(d) 8
(e) 100

Correct Answer: (b)


Solution: To calculate the number of parameters in the convolutional layer, we need
to consider the size of each kernel (filter) and the number of kernels. Let’s break it
down step by step:
Given Information:

• Input image size: 100 × 100 × 3


• Number of kernels: 8
• Kernel size: 1 × 1
• Zero padding P = 1
• Stride S = 2

1. Number of Parameters per Kernel: Each kernel has a size of 1 × 1 and operates
on all the input channels (3 channels for the input image). Therefore, the number of
parameters in each kernel is:

Parameters per kernel = 1 × 1 × 3 = 3

2. Total Number of Parameters: Since there are 8 kernels, each with 3 parameters,
the total number of parameters is:

Total parameters = 8 × 3 = 24

8. What is the purpose of guided backpropagation in CNNs?

(a) To train the CNN to improve its accuracy on a given task.


(b) To reduce the size of the input images in order to speed up computation.
(c) To visualize which pixels in an image are most important for a particular class
prediction.
(d) None of the above.

Correct Answer: (c)


Solution: Guided backpropagation is a technique used to visualize the parts of an
input image that are most important for a particular class prediction. It achieves
this by backpropagating the gradients of the output class with respect to the input
image, but only allowing positive gradients to flow through the network.

9. Which of the following statements is true regarding the occlusion experiment in a


CNN?
(a) It is a technique used to prevent overfitting in deep learning models.
(b) It is used to increase the number of filters in a convolutional layer.
(c) It is used to determine the importance of each feature map in the output of the
network.
(d) It involves masking a portion of the input image with a patch of zeroes.

Correct Answer: (c),(d)


Solution: In the occlusion experiment, a patch of zeroes is placed over a portion of
the input image to observe the effect on the output of the network. This helps to
determine the importance of each region of the image in the network’s prediction.

10. Which of the following architectures has the highest no of layers?

(a) AlexNet
(b) GoogleNet
(c) ResNet
(d) VGG

Correct Answer: (c)


Solution: Among the listed architectures, the one with the highest number of layers
is ResNet.
Here’s a brief comparison:
1. AlexNet: - Has 8 layers (5 convolutional layers followed by 3 fully connected
layers).
2. GoogleNet (Inception v1): - Has 22 layers (not counting pooling layers).
3. ResNet: - Comes in different versions, with the most common being ResNet-50,
ResNet-101, and ResNet-152. The number of layers for these are 50, 101, and 152
layers respectively, making it the deepest architecture on this list.
4. VGG: - VGG-16 has 16 layers, and VGG-19 has 19 layers.
Conclusion: ResNet, especially ResNet-152, has the highest number of layers among
the architectures listed. Therefore, the answer is:

ResNet
Deep Learning - Week 11

1. For which of the following problems are RNNs suitable?

(a) Generating a description from a given image.


(b) Forecasting the weather for the next N days based on historical weather data.
(c) Converting a speech waveform into text.
(d) Identifying all objects in a given image.

Correct Answer: (a),(b),(c)


Solution: Given an image, generate a description about it: RNNs can be used for
this task, but typically as part of a larger architecture. The image would first be pro-
cessed by a Convolutional Neural Network (CNN), and then an RNN would generate
the text description. This is called an encoder-decoder or sequence-to-sequence model.

Given the historical weather data, forecast the weather for the next N days: This is
very suitable for RNNs. Weather data is a time series, and RNNs are excellent at
processing sequential data and capturing temporal dependencies.

Given a speech waveform, convert it into text: This is also highly suitable for RNNs.
Speech recognition involves processing a sequence of audio features and outputting a
sequence of characters or words. RNNs (especially when combined with techniques
like CTC loss) are very effective for this task.

Given an image, find all objects in the image: This task is primarily suited for
Convolutional Neural Networks (CNNs), not RNNs. Object detection in images is
typically done using architectures like R-CNN, YOLO, or SSD, which are based on
CNNs.

2. Suppose that we need to develop an RNN model for sentiment classification. The
input to the model is a sentence composed of five words and the output is the sen-
timents (positive or negative). Assume that each word is represented as a vector of
length 100 × 1 and the output labels are one-hot encoded. Further, the state vector
st is initialized with all zeros of size 30 × 1. How many parameters (including bias)
are there in the network?

Correct Answer: 3992

Solution: Solution: To compute the number of parameters in the RNN for sentiment
classification, we need to consider the parameters for the following components:

(a) Input to Hidden State Weights Wxh : This weight matrix maps the input word
vector to the hidden state.
(b) Hidden to Hidden State Weights Whh : This weight matrix maps the previous
hidden state to the next hidden state.
(c) Hidden State to Output Weights Why : This weight matrix maps the hidden
state to the output.
(d) Biases: Bias vectors for both the hidden state and the output.

Given:

• Input word vector size: 100 × 1


• Hidden state size: 30 × 1
• Output size: 2 × 1 (since the sentiment classification is binary: positive or
negative)
• Sentence length: 5 words (though it does not affect the parameter count as the
same weights are used across all time steps)

Step 1: Compute Wxh (Input to Hidden State Weights)

• The input vector is of size 100 × 1, and the hidden state is of size 30 × 1.
• Therefore, the weight matrix Wxh has dimensions 30 × 100.
• Total parameters in Wxh : 30 × 100 = 3000

Step 2: Compute Whh (Hidden to Hidden State Weights)

• The hidden state at time t − 1 is of size 30 × 1, and the hidden state at time t
is also of size 30 × 1.
• Therefore, the weight matrix Whh has dimensions 30 × 30.
• Total parameters in Whh : 30 × 30 = 900

Step 3: Compute bh (Bias for Hidden State)

• The bias for the hidden state is a vector of size 30 × 1.


• Total parameters in bh : 30

Step 4: Compute Why (Hidden State to Output Weights)

• The hidden state is of size 30 × 1, and the output is of size 2 × 1 (binary


classification).
• Therefore, the weight matrix Why has dimensions 2 × 30.
• Total parameters in Why : 2 × 30 = 60

Step 5: Compute by (Bias for Output)

• The bias for the output is a vector of size 2 × 1.


• Total parameters in by : 2
Step 6: Total Parameters Now, let’s sum up all the parameters:

Total parameters = Wxh + Whh + bh + Why + by

Total parameters = 3000 + 900 + 30 + 60 + 2 = 3992

Conclusion: The total number of parameters in the RNN is 3992 .

3. Select the correct statements about GRUs

(a) GRUs have fewer parameters compared to LSTMs


(b) GRUs use a single gate to control both input and forget mechanisms
(c) GRUs are less effective than LSTMs in handling long-term dependencies
(d) GRUs are a type of feedforward neural network

Correct Answer: (a),(b)


Solution: a) GRUs have fewer parameters compared to LSTMs This is correct. GRUs
have a simpler structure with fewer gates than LSTMs, resulting in fewer parameters.
b) GRUs use a single gate to control both input and forget mechanisms This is correct.
GRUs have an update gate that combines the functions of the input and forget gates
found in LSTMs.
c) GRUs are less effective than LSTMs in handling long-term dependencies This is
generally not correct. GRUs can be as effective as LSTMs in many tasks, including
those involving long-term dependencies. Their performance is often comparable to
LSTMs.
d) GRUs are a type of feedforward neural network This is incorrect. GRUs are a type
of recurrent neural network (RNN), not a feedforward neural network.

4. What is the main advantage of using GRUs over traditional RNNs?

(a) They are simpler to implement


(b) They solve the vanishing gradient problem
(c) They require less computational power
(d) They can handle non-sequential data

Correct Answer: (b)


Solution: True. GRUs, like LSTMs, help mitigate the vanishing gradient problem by
using gating mechanisms (update and reset gates) that regulate the flow of informa-
tion and gradients over time, allowing them to learn long-term dependencies more
effectively than traditional RNNs.

5. The statement that LSTM and GRU solves both the problem of vanishing and ex-
ploding gradients in RNN is

(a) True
(b) False

Correct Answer: (b)


Solution: False. While LSTM (Long Short-Term Memory) and GRU (Gated Re-
current Unit) significantly mitigate the vanishing and exploding gradient problems
compared to vanilla RNNs, they do not completely solve these issues. They use gat-
ing mechanisms (like the forget, input, and output gates in LSTM, or the reset and
update gates in GRU) that help preserve and control the flow of information over long
sequences, reducing the impact of vanishing and exploding gradients. However, under
certain conditions, especially with very deep networks or extremely long sequences,
these problems can still occur.

6. What is the vanishing gradient problem in training RNNs?

(a) The weights of the network converge to zero during training


(b) The gradients used for weight updates become too large
(c) The network becomes overfit to the training data
(d) The gradients used for weight updates become too small

Correct Answer: (d)


Solution: The vanishing gradient problem is a common issue in training RNNs where
the gradients used for weight updates become too small, making it difficult to learn
long-term dependencies in the input sequence. This can lead to poor performance
and slow convergence during training.

7. What is the role of the forget gate in an LSTM network?

(a) To determine how much of the current input should be added to the cell state.
(b) To determine how much of the previous time step’s cell state should be retained.
(c) To determine how much of the current cell state should be output.
(d) To determine how much of the current input should be output.

Correct Answer: (b)


Solution: The forget gate in an LSTM network determines how much of the previous
cell state to forget and how much to keep for the current time step.

8. How does LSTM prevent the problem of vanishing gradients?

(a) Different activation functions, such as ReLU, are used instead of sigmoid in
LSTM.
(b) Gradients are normalized during backpropagation.
(c) The learning rate is increased in LSTM.
(d) Forget gates regulate the flow of gradients during backpropagation.
Correct Answer: (d)
Solution: Due to forget gates controlling the flow, the gradient will only vanish if the
previous states didn’t contribute during the forward pass. So if the information flows
during the forward pass gradient doesn’t vanish.

9. We are given an RNN with ||W || = 2.5. The activation function used in the RNN is
logistic. What can we say about ∇ = ∂s 20
∂s1

(a) Value of ∇ is very high.


(b) Value of ∇ is close to 0
(c) Value of ∇ is 2.5
(d) Insufficient information to say anything.

Correct Answer: (b)


Solution: The gradient ∇ = ∂s 20
∂s1 is influenced by the repeated multiplication of
the weight matrix W over multiple time steps. Specifically, it can be approximated
as:

20
Y
∇≈ f ′ (st )W
t=1

where f ′ (s) is the derivative of the logistic activation function, which is:

f ′ (s) = σ(s)(1 − σ(s))

where σ(s) is the sigmoid function. The maximum value of f ′ (s) occurs at s = 0,
giving f ′ (s) ≈ 0.25.
Now, approximating the gradient magnitude:

∇ ≈ (0.25 × 2.5)19

= (0.625)19

Since 0.625 < 1, exponentiating it to a large power (like 19) results in a very small
value, approaching 0. This suggests that the gradients will diminish significantly,
leading to the vanishing gradient problem.
Conclusion:
Since ∇ is very small, the correct answer is:

(a) Value of ∇ is close to 0.

10. Select the true statements about BPTT?


(a) The gradients of Loss with respect to parameters are added across time steps
(b) The gradients of Loss with respect to parameters are subtracted across time
steps
(c) The gradient may vanish or explode, in general,if timesteps are too large
(d) The gradient may vanish or explode if timesteps are too small

Correct Answer: (a),(c)


Solution:
Here are the correct statements about Backpropagation Through Time (BPTT):
1. The gradients of Loss with respect to parameters are added across time steps:
- True. In BPTT, gradients are accumulated (added) over all the time steps when
calculating the total gradient of the loss with respect to parameters. This is because
the same parameters are used at each time step in the recurrent network.
2. The gradients of Loss with respect to parameters are subtracted across time steps:
- False. Gradients are not subtracted; they are added across time steps, as described
above.
3. The gradient may vanish or explode, in general, if timesteps are too large: - True.
The vanishing or exploding gradient problem is a common issue in RNNs, especially
when the number of time steps is large. As the gradient is propagated back through
many time steps, it can either shrink exponentially (vanish) or grow exponentially
(explode).
4. The gradient may vanish or explode if timesteps are too small: - False. The
problem of vanishing or exploding gradients typically arises when the number of time
steps is large, not when they are small.
Deep Learning - Week 12

1. What is the primary purpose of the attention mechanism in neural networks?

(a) To reduce the size of the input data


(b) To increase the complexity of the model
(c) To eliminate the need for recurrent connections
(d) To focus on specific parts of the input sequence

Correct Answer: (d)


Solution: The attention mechanism allows the model to weigh and prioritize different
parts of the input sequence dynamically, giving more focus to the most relevant
portions when making predictions or generating outputs. This is particularly useful
in tasks like machine translation and image captioning, where different parts of the
input may have varying levels of importance.

2. Which of the following are the benefits of using attention mechanisms in neural net-
works?

(a) Improved handling of long-range dependencies


(b) Enhanced interpretability of model predictions
(c) Ability to handle variable-length input sequences
(d) Reduction in model complexity

Correct Answer: (a),(b),(c)


Solution: Attention mechanisms help maintain long-range dependencies in input se-
quences, unlike traditional RNNs.
Attention weights can sometimes be interpreted to understand which parts of the
input influenced the model’s decisions.
Attention is inherently flexible to variable-length input sequences, especially in tasks
like machine translation.

3. If we make the vocabulary for an encoder-decoder model using the given sentence.
What will be the size of our vocabulary?
Sentence: Attention mechanisms dynamically identify critical input components, en-
hancing contextual understanding and boosting performance

(a) 13
(b) 14
(c) 15
(d) 16

Correct Answer: (c)


Solution: There are 13 unique words in our vocabulary. We will add two additional
words < GO > and < ST OP > to the vocabulary. Hence the size of the vocabulary
will be 15.
4. We are performing the task of Machine Translation using an encoder-decoder model.
Choose the equation representing the Encoder model.

(a) s0 = CN N (xi )
(b) s0 = RN N (st−1 , e(ŷt−1 ))
(c) s0 = RN N (xit )
(d) s0 = RN N (ht−1 , xit )

Correct Answer: (d)


Solution: The encoder in an encoder-decoder model processes the input sequence by
recursively updating its hidden state using the current input xit and the previous
hidden state ht−1 . This is expressed by the equation:

s0 = RN N (ht−1 , xit )

5. Which of the following attention mechanisms is most commonly used in the Trans-
former model architecture?

(a) Additive attention


(b) Dot product attention
(c) Multiplicative attention
(d) None of the above

Correct Answer: (b)


Solution: In the Transformer model architecture, the most commonly used attention
mechanism is dot-product attention (also referred to as scaled dot-product atten-
tion). This method calculates the attention scores by taking the dot product of the
query and key vectors, which allows for efficient computation and parallelization.
While additive attention and multiplicative attention are also valid attention mecha-
nisms, they are not the primary focus in the Transformer architecture.

6. Which of the following is NOT a component of the attention mechanism?

(a) Decoder
(b) Key
(c) Value
(d) Query
(e) Encoder

Correct Answer: (a),(e)


Solution: In the attention mechanism, the only components are the Query, Key, and
Value. The encoder and the decoder are parts of the overall architecture that use
attention but are not components of the attention mechanism itself.

7. In a hierarchical attention network, what are the two primary levels of attention?
(a) Character-level and word-level
(b) Word-level and sentence-level
(c) Sentence-level and document-level
(d) Paragraph-level and document-level

Correct Answer: (b)


Solution: Character-level and word-level: While character-level attention can be used
in some NLP tasks, it’s not typically one of the primary levels in a hierarchical
attention network. Word-level and sentence-level: This is the correct answer. In a
typical hierarchical attention network:
Word-level attention helps to identify important words within each sentence. Sentence-
level attention then works on these sentence representations to identify important
sentences within the document.
Sentence-level and document-level: While sentence-level attention is correct, document-
level typically refers to the overall output rather than a separate level of attention.
Paragraph-level and document-level: Paragraph-level attention is not commonly used
as one of the primary levels in standard hierarchical attention networks.
Therefore, the correct answer is:
Word-level and sentence-level
This structure allows the network to build a hierarchical representation of the docu-
ment, first by attending to important words to create sentence embeddings, and then
by attending to important sentences to create the document embedding. This mimics
the natural structure of many text documents and allows the model to capture both
local (word-level) and global (sentence-level) contextual information.

8. Which of the following are the advantages of using attention mechanisms in encoder-
decoder models?

(a) Reduced computational complexity


(b) Ability to handle variable-length input sequences
(c) Improved gradient flow during training
(d) Automatic feature selection
(e) Reduced memory requirements

Correct Answer: (b),(c),(d)


Solution: Explanation: B) Attention allows handling of variable-length inputs
C) Attention provides shorter paths for gradient flow
D) Attention automatically selects relevant features
A and E are incorrect as attention often increases computational and memory re-
quirements.

9. In the encoder-decoder architecture with attention, where is the context vector typi-
cally computed?

(a) In the encoder


(b) In the decoder
(c) Between the encoder and decoder
(d) After the decoder

Correct Answer: (c)


Solution: In the encoder-decoder architecture with attention, the context vector is
typically computed between the encoder and decoder. The attention mechanism uses
the encoder’s hidden states and calculates attention weights that highlight relevant
parts of the input. The weighted sum of the encoder’s hidden states is then used to
form the context vector, which is passed to the decoder to assist in generating the
output sequence.

10. Which of the following output functions is most commonly used in the decoder of an
encoder-decoder model for translation tasks?

(a) Softmax
(b) Sigmoid
(c) ReLU
(d) Tanh

Correct Answer: (a)


Solution: In an encoder-decoder model for translation, the decoder’s output layer
typically produces a probability distribution over the target vocabulary, and Softmax
is used to convert raw scores (logits) into probabilities that sum to 1.
The Softmax function is applied after the decoder’s output passes through a linear
layer. It converts the raw scores (logits) into a probability distribution over the target
vocabulary. Each value in the output represents the probability of a specific word
being the next token in the sequence. This is essential in translation tasks, as it allows
the model to select the most likely word at each step of the sequence generation.

You might also like