DL Assignment Solutions
DL Assignment Solutions
C
−1 1
G
−1
(a) True
(b) False
3. Suppose that we multiply the weight vector w by −1. Then the same points G and
C will be classified as?
4. Which of the following can be achieved using the perceptron algorithm in machine
learning?
(a) Grouping similar data points into clusters, such as organizing customers based
on purchasing behavior.
(b) Solving optimization problems, such as finding the maximum profit in a business
scenario.
(c) Classifying data, such as determining whether an email is spam or not.
(d) Finding the shortest path in a graph, such as determining the quickest route
between two cities.
6. We know from the lecture that the decision boundary learned by the perceptron is a
line in R2 . We also observed that it divides the entire space of R2 into two regions,
suppose that the input vector x ∈ R4 , then the perceptron decision boundary will
divide the whole R4 space into how many regions?
(a) It depends on whether the data points are linearly separable or not.
(b) 3
(c) 4
(d) 2
(e) 5
8. Consider
the following table, where x1 and x2 are features (packed into a single vector
x
x = 1 ) and y is a label:
x2
x1 x2 y
0 0 0
0 1 1
1 0 1
1 1 1
Suppose that the perceptron model is used to classify
the data points. Suppose
1
further that the weights w are initialized to w = . The following rule is used for
1
classification,
(
1 if wT x > 0
y=
0 if wT x ≤ 0
The perceptron learning algorithm is used to update the weight vector w. Then, how
many times the weight vector w will get updated during the entire training process?
(a) 0
(b) 1
(c) 2
(d) Not possible to determine
(a) 1
(b) 2
(c) 3
(d) 4
(e) 5
Correct Answer: (c)
Solution: suppose, we set θ = 4, then summing all the input never exceeds 3, therefore,
the neuron won’t fire, And suppose we set θ < 3 then it won’t satisfy the AND
operator.
−1
10. Consider points shown in the picture. The vector w = . As per this weight
1
vector, the Perceptron algorithm will predict which classes for the data points x1 and
x2 .
NOTE: (
1 if wT x > 0
y=
−1 if wT x ≤ 0
x1 (1.5, 2)
w
x
x2 (−2.5, −2)
(a) x1 = −1
(b) x1 = 1
(c) x2 = −1
(d) x2 = 1
1. Which of the following statements is(are) true about the following function?
σ(z) = 1+e1−(z)
2. How many weights does a neural network have if it consists of an input layer with 2
neurons, two hidden layers each with 5 neurons, and an output layer with 2 neurons?
Assume there are no bias terms in the network.
Correct Answer: 45
Solution: Number of weights = (2 ∗ 5) + (5 ∗ 5) + (5 ∗ 2) = 45.
3. A function f (x) is approximated using 100 tower functions. What is the minimum
number of neurons required to construct the network that approximates the function?
(a) 99
(b) 100
(c) 101
(d) 200
(e) 201
(f) 251
4. Suppose we have a Multi-layer Perceptron with an input layer, one hidden layer and
an output layer. The hidden layer contains 32 perceptrons. The output layer contains
one perceptron. Choose the statement(s) that are true about the network.
(a) Each perceptron in the hidden layer can take in only 32 Boolean inputs
(b) Each perceptron in the hidden layer can take in only 5 Boolean inputs
(c) The network is capable of implementing 25 Boolean functions
(d) The network is capable of implementing 232 Boolean functions
Correct Answer: (d)
Solution: In the lecture, we have seen that, if the hidden layer contains 2n neurons,
where n is a number of inputs, then the network should be able to implement all
n
Boolean functions that take in n inputs. There are 22 Boolean functions.
5. Consider a function f (x) = x3 − 5x2 + 5. What is the updated value of x after 2nd
iteration of the gradient descent update, if the learning rate is 0.1 and the initial value
of x is 5?
Correct Answer: range(3.1,3.2)
Solution: We are tasked to find the updated value of x after the second iteration of
gradient descent for the function:
f (x) = x3 − 5x2 + 5
Step 2: Gradient Descent Update Rule The update rule for gradient descent is:
• Initial x = 5
• Learning rate η = 0.1
Step 4: Iteration 1 At x = 5:
Update x:
xnew = 5 − 0.1 · 25 = 5 − 2.5 = 2.5
Update x:
xnew = 2.5 − 0.1 · (−6.25) = 2.5 + 0.625 = 3.125
Final Answer The updated value of x after the second iteration is:
x = 3.125
1
6. Consider the sigmoid function 1+e−(wx+b) , where w is a positive value. Select all the
correct statements regarding this function.
(a) Increasing the value of b shifts the sigmoid function to the right (i.e., towards
positive infinity)
(b) Increasing the value of b shifts the sigmoid function to the left (i.e., towards
negative infinity)
(c) Increasing the value of w decreases the slope of the sigmoid function
(d) Increasing the value of w increases the slope of the sigmoid function
7. You are training a model using the gradient descent algorithm and notice that the
loss decreases and then increases after each successive epoch (pass through the data).
Which of the following techniques would you employ to enhance the likelihood of the
gradient descent algorithm converging? (Here, η refers to the step size.)
(a) Set η = 1
(b) Set η = 0
(c) Decrease the value of η
(d) Increase the value of η
8. The diagram below shows three functions f , g and h. The function h is obtained by
combining the functions f and g. Choose the right combination that generated h.
f (x) g(x)
1 1
0.8 0.8
0.6 0.6
f g
0.4 0.4
0.2 0.2
0 0
−1 −0.75−0.5−0.25 0 0.25 0.5 0.75 1 −1 −0.75−0.5−0.25 0 0.25 0.5 0.75 1
h(x)
0.5
0.4
0.3
h
0.2
0.1
0
(a) h = f − g
(b) h = 0.5 ∗ (f + g)
(c) h = 0.5 ∗ (f − g)
(d) h = 0.5 ∗ (g − f )
Derivation of h = 0.5 · (g − f ):
Thus, the equation h = 0.5 · (g − f ) perfectly describes the observed behavior of h(x).
The function h(x) is correctly given by:
1.2
1.0
(0.4, 0.95)
0.8
0.6
0.4
(0.06, 0.4)
0.2
Suppose that the sigmoid function given below is used to fit these data points.
1
1+e−(20x+1)
Compute the Mean Square Error (MSE) loss L(w, b)
(a) 0
(b) 0.126
(c) 1.23
(d) 1
Correct Answer: (b)
Solution: The given sigmoid function is:
1
f (x) =
1+ e−(20x+1)
and the Mean Square Error (MSE) loss is defined as:
n
1X
L(w, b) = (f (xi ) − yi )2
n
i=1
1
L(w, b) = · 0.2527 = 0.12635
2
L(w, b) ≈ 0.12635
10. Suppose that we implement the XOR Boolean function using the network shown
below. Consider the statement that “A hidden layer with two neurons is suffice to
implement XOR”. The statement is
w = −1 (red edge)
y w = +1 (blue edge)
w1 w2 w3 w4
bias = -2
x1 x2
(a) True
(b) False
Conclusion: While the given network uses 4 neurons, it’s overparameterized for the
XOR problem. Two neurons are mathematically sufficient because:
1. How many parameters (including biases) are there in the entire network?
Correct Answer: 2274
Solution:
Number of Parameters
Input Layer to h1 : 200 × 10 + 10 = 2010
h1 to h2 : 10 × 10 + 10 = 110
h2 to h3 : 10 × 10 + 10 = 110
h3 to Output Layer: 10 × 4 + 4 = 44
Total Parameters: 2010 + 110 + 110 + 44 = 2274
2. Suppose all elements in the input vector are zero, and the corresponding true label is
also 0. Further, suppose that all the parameters (weights and biases) are initialized
to zero. What is the loss value if the cross-entropy loss function is used? Use the
natural logarithm (ln).
Correct Answer: Range(1.317,1.455)
Solution:
Loss with Zero Inputs and Parameters Input: x = 0, weights and biases = 0.
Hidden Layers: σ(0) = 0.5.
Output Layer Logits: [0, 0, 0, 0].
Softmax: Softmax(zi ) = 14 , ∀i.
Cross-Entropy Loss: − ln 14 = ln(4) ≈ 1.386.
a1 h(1)
1
(1)
h2 Hidden layer 2
(1) a2 (2)
Input layer h3 h1
(1) (2)
h7 h5
(1)
W1 h8 W2 W3
(1)
h9
In the diagram, W1 is a matrix and x, a1 , h1 , and O are all column vectors. The
notation Wi [j, :] denotes the j th row of the matrix Wi , Wi [:, j] denotes the j th column
of the matrix Wi and Wkij denotes an element at ith row and j th column of the matrix
Wk .
(a) W1 ∈ R3×9
(b) a1 ∈ R9×5
(c) W1 ∈ R9×3
(d) a1 ∈ R1×9
(e) W1 ∈ R1×9
(f) a1 ∈ R9×1
(a) Logistic
(b) Step function
(c) Softmax
(d) linear
7. Given two probability distributions p and q, under what conditions is the cross entropy
between them minimized?
8. Given that the probability of Event A occurring is 0.18 and the probability of Event
B occurring is 0.92, which of the following statements is correct?
The following network doesn’t contain any biases and the weights of the network are
given below:
1 1 3
1 1 2
W1 =2 −1 1 W2 = W3 = 1 2
3 1 1
1 2 −2
1
The input to the network is: x = 2
1
The target value y is: y = 5
9. What is the predicted output for the given input x after doing the forward pass?
Correct Answer: Range(2.9,3.0)
Solution:
Doing the forward
pass in thenetwork
we
get
1 1 3 1 6
h1 = W1 · x1 = 2 −1 1 · 2 = 1
1 2 −2 1 3
0.997
a1 = sigmoid(h1 ) =0.731
0.952
0.997
1 1 2 3.632
h2 = W2 · a 1 = . 0.731 =
3 1 1 4.674
0.952
0.974
a2 = sigmoid(h2 ) =
0.990
0.974
y= 1 2 · = 2.954
0.990
10. Compute and enter the loss between the output generated by input x and the true
output y.
Correct Answer: Range(3.97,4.39)
Solution: Loss=(5 − 2.954)2 = 4.1861
Deep Learning - Week 4
1. Using the Adam optimizer with β1 = 0.9, β2 = 0.999, and ϵ = 10−8 , what would be
the bias-corrected first moment estimate after the first update if the initial gradient
is 4?
(a) 0.4
(b) 4.0
(c) 3.6
(d) 0.44
mt = β1 ∗ mt−1 + (1 − β1 ) ∗ gt
mcorrected
t = mt /(1 − β1t )
mcorrected
1 = 0.4/(1 − 0.91 ) = 0.4/0.1 = 4
Therefore, the bias-corrected first moment estimate after the first update is 4.
(a) 5,000
(b) 50,000
(c) 500
(d) 5
3. In a stochastic gradient descent algorithm, the learning rate starts at 0.1 and decays
exponentially with a decay rate of 0.1 per epoch. What will be the learning rate after
5 epochs?
(a) 0.09
(b) 0.059
(c) 0.05
(d) 0.061
ηt = η0 ∗ e−kt
where η0 is the initial learning rate, k is the decay rate, and t is the number of epochs.
Plugging in the values:
(a) True
(b) False
(c) Cannot say
6. What is the primary benefit of using Adagrad compared to other optimization algo-
rithms?
7. What are the benefits of using stochastic gradient descent compared to vanilla gra-
dient descent?
8. Select the true statements about the factor β used in the momentum based gradient
descent algorithm.
(a) Setting β = 0.1 allows the algorithm to move faster than the vanilla gradient
descent algorithm
(b) Setting β = 0 makes it equivalent to the vanilla gradient descent algorithm
(c) Setting β = 1 makes it equivalent to the vanilla gradient descent algorithm
(d) Oscillation around the minimum will be less if we set β = 0.1 than setting
β = 0.99
where:
−vt is the velocity (momentum term).
−β is the momentum factor.
−∇wt is the gradient of the loss with respect to the weight at time t.
−η is the learning rate.
Setting β = 0.1 allows the algorithm to move faster than the vanilla (plain)
gradient descent algorithm: - When β is set to a small positive value like 0.1, the
algorithm incorporates some momentum, which can help accelerate convergence by
navigating more effectively through shallow regions of the loss surface. This statement
is generally true.
Setting β = 1 makes it equivalent to vanilla gradient descent algorithm: -
If β = 1, the velocity term vt+1 would solely depend on the previous velocity vt and
would not incorporate the current gradient ∇wt . This effectively stalls the learning
process. However, the claim that it makes it equivalent to vanilla gradient descent
(which does not use momentum) is incorrect. Vanilla gradient descent updates weights
purely based on the gradient without momentum.
Setting β = 0 makes it equivalent to vanilla gradient descent algorithm:
- When β = 0, the velocity term vt+1 is directly proportional to the current gradi-
ent ∇wt . This reduces the momentum-based gradient descent to the plain gradient
descent update rule. Thus, this statement is true.
Oscillation around the minimum will be less if we set β = 0.1 than setting
β = 0.99: - Higher values of β (close to 1) result in more momentum, which can cause
larger oscillations around the minimum due to the higher inertia. A lower value of
β like 0.1 results in less momentum, leading to reduced oscillations. Therefore, this
statement is true.
9. What is the advantage of using mini-batch gradient descent over batch gradient de-
scent?
(a) Mini-batch gradient descent is more computationally efficient than batch gradi-
ent descent.
(b) Mini-batch gradient descent leads to a more accurate estimate of the gradient
than batch gradient descent.
(c) Mini batch gradient descent gives us a better solution.
(d) Mini-batch gradient descent can converge faster than batch gradient descent.
10. In the Nesterov Accelerated Gradient (NAG) algorithm, the gradient is computed at:
1. Which of the following is the most appropriate description of the method used in
PCA to achieve dimensionality reduction?
(a) PCA achieves this by discarding a random subset of features in the dataset
(b) PCA achieves this by selecting those features in the dataset along which the
variance of the dataset is maximised
(c) PCA achieves this by retaining the those features in the dataset along which
the variance of the dataset is minimised
(d) PCA achieves this by looking for those directions in the feature space along
which the variance of the dataset is maximised
(a) 1
(b) 3
(c) 9
(d) 5
(e) 8
The singular values of A are the positive square roots of the eigenvalues of AT A.
Therefore,
σ1 = 10 and σ2 = 5.
Correct Answer: b)
0.3333 −2.33333 1.66667 0.666667
Solution: x̄ = x −x̄ = x2 −x̄ = x −x̄ =
0.3333 1 1.66667 −2.33333 3 0.666667
Now, let’s calculate (x − x̄)(x − x̄)T for each point:
−2.33333 5.44444 −3.88889
For x1 : −2.33333 1.66667 =
1.66667 −3.88889 2.77778
1.66667 2.77778 −3.88889
For x2 : 1.66667 −2.33333 =
−2.33333 −3.88889 5.44444
0.666667 0.444444 0.444444
For x3 : 0.666667 0.666667 =
0.666667 0.444444 0.444444
Sum these matrices: ni=1 (x − x̄)(x − x̄)T
P
Now, multiply by n1 = 31 :
2.88889 −2.44444
C=
−2.44444 2.88889
2.88889 −2.44444
Therefore, the correct covariance matrix is
−2.44444 2.88889
(a) 1
(b) 5.33
(c) 0.44
(d) 0.5
Correct Answer:
b)
2.88889 −2.44444
Solution: C =
−2.44444 2.88889
This gives us two eigenvalues: λ1 = 0.44445 λ2 = 5.33333
The maximum eigenvalue is λ2 = 5.33.
9. The eigenvector corresponding to the maximum eigenvalue of the given matrix C is:
1
(a)
1
−1
(b)
1
0.67
(c)
0
−1.48
(d)
1
Correct Answer: b)
Solution: Using the maximum eigenvalue found earlier, we solve the equation (C − λI)v = 0
to
find
the eigenvector v. The eigenvector corresponding to the maximum eigenvalue is
−1
.
1
10. Given that A is a 2 × 2 matrix, what is the determinant of A, if its eigenvalues are 6 and 7?
Correct Answer: 42
Solution: The determinant of a matrix is defined as the product of its eigenvalues. Therefore,
if a matrix has eigenvalues λ1 and λ2, its determinant is given by det(A) = λ1 ∗ λ2.
Deep Learning - Week 6
Solution: Overcomplete autoencoders have more hidden units in the encoder than
in the decoder, which can increase the capacity of the network and allow it to learn
more complex and nonlinear representations of the input data.
4. Suppose we build a neural network for a 5-class classification task. Suppose for a
single training example, the true label is [0 1 0 0 1] while the predictions by the
neural network are [0.4 0.25 0.2 0.1 0.6]. What would be the value of cross-entropy
loss for this example? (Answer up to two decimal places, Use base 2 for log-related
calculations)
Correct Answer: range(2.7, 2.8)
Solution: Cross entropy loss is given by − 5i=1 (yi ) log2 (yˆi )
P
= −0 · log2 (0.4) − 1 · log2 (0.25) − 0 · log2 (0.2) − 0 · log2 (0.1) − 1 · log2 (0.6)
= −1 · log2 (0.25) − 1 · log2 (0.6)
= −1 · −2 − 1 · −0.7369
= 2.7369
(a) 5
(b) 4
(c) 2
(d) 0
(e) 6
x1
(1)
h1
x2 ŷ1
(1)
h2
x3
(b)
Input Hidden Output
layer layer 3 layer
(1)
h1
x1
(1)
h2 ŷ1
x2
(1)
h3 ŷ2
x3
(1)
h4
(c)
Input Hidden Output
layer layer 1 layer
x1 ŷ1
x2 (1)
h1 ŷ2
x3 (1)
h2 ŷ3
x4 ŷ4
(d)
7. What is the primary reason for adding corruption to the input data in a denoising
autoencoder?
Correct Answer: b)
Solution: Adding corruption to the input data in a denoising autoencoder serves the
purpose of forcing the model to learn robust features that can reconstruct the original
input even when parts of it are missing or noisy. This process prevents the model
from merely memorizing the training data, thereby enhancing its ability to generalize
to new, unseen data. This generalization is crucial for the model’s performance on
real-world tasks where the input may not always be clean or complete.
8. Suppose for one data point we have features x1 , x2 , x3 , x4 , x5 as −4, 6, 2.8, 0, 17.3 then,
which of the following function should we use on the output layer(decoder)?
(a) Linear
(b) Logistic
(c) Relu
(d) Tanh
(a) It adds a penalty term to the loss function that is proportional to the absolute
value of the weights.
(b) It results in sparse solutions for w.
(c) It adds a penalty term to the loss function that is proportional to the square of
the weights.
(d) It is equivalent to adding Gaussian noise to the weights.
fˆ2 (x) = w0 + w1 x2 + w2 x2 + w4 x4 + w5 x5
y = 7x3 + 12x + x + 2.
We fit the two models fˆ1 (x) and fˆ2 (x) on this data and train them using a neural
network.
(a) The dropout probability p can be different for each hidden layer
(b) Batch gradient descent cannot be used to update the parameters of the network
(c) Dropout with p = 0.5 acts as a ensemble regularize
(d) The weights of the neurons which were dropped during the forward propagation
at tth iteration will not get updated during t + 1th iteration
(a) The dropout probability p can be different for each hidden layer:
• True. It is common practice to apply different dropout rates to different
hidden layers, which allows for more control over the regularization strength
applied to each layer.
(b) Batch gradient descent cannot be used to update the parameters of
the network:
• False. Batch gradient descent, as well as mini-batch gradient descent, can
be used to update the parameters of a network with dropout regularization.
Dropout affects the training phase by randomly dropping neurons but does
not prevent the use of gradient descent algorithms for parameter updates.
(c) Dropout with p = 0.5 acts as an ensemble regularizer:
• True. Dropout with p = 0.5 can be seen as an ensemble method in the sense
that, during training, different subsets of neurons are active, which can
be interpreted as training a large number of “thinned” networks. During
testing, the full network is used but with the weights scaled to account for
the dropout, effectively acting as an ensemble of these thinned networks.
(d) The weights of the neurons which were dropped during the forward
propagation at t-th iteration will not get updated during t + 1-th it-
eration:
• False. During training, dropout randomly drops neurons in each mini-batch
iteration, but this does not mean that the weights of dropped neurons are
not updated. The update process occurs based on the backpropagation of
the loss through the network, and weights are updated according to the
gradients computed from the dropped and non-dropped neurons.
5. We have trained four different models on the same dataset using various hyperparam-
eters. The training and validation errors for each model are provided below. Based
on this information, which model is likely to perform best on the test dataset?
Model Training error Validation error
1 0.8 1.4
2 2.5 0.5
3 1.7 1.7
4 0.2 0.6
(a) Model 1
(b) Model 2
(c) Model 3
(d) Model 4
∂L
= 0.8w
∂w
∂L
= 14b
∂b
Setting these partial derivatives to zero:
0.8w = 0 =⇒ w = 0
14b = 0 =⇒ b = 0
∇L(w∗ , b∗ ) = (0, 0) .
0 + 0 = 0.
∂2L
= 0.8
∂w2
∂2L
= 14
∂b2
∂2L ∂2L
= =0
∂w∂b ∂b∂w
Thus, the Hessian matrix is:
0.8 0
HL (w, b) = .
0 14
9. Compute the Eigenvalues and Eigenvectors of the Hessian. According to the eigen-
values of the Hessian, which parameter is the loss more sensitive to?
(a) b
(b) w
10. Consider the problem of recognizing an alphabet (in upper case or lower case) of
English language in an image. There are 26 alphabets in the language. Therefore,
a team decided to use CNN network to solve this problem. Suppose that data aug-
mentation technique is being used for regularization. Then which of the following
transformation(s) on all the training images is (are) appropriate to the problem
1. What are the challenges associated with using the Tanh(x) activation function?
2. Which of the following problems makes training a neural network harder while using
sigmoid as the activation function?
(a) Not-continuous at 0
(b) Not-differentiable at 0
(c) Saturation
(d) Computationally expensive
4. We have observed that the sigmoid neuron has become saturated. What might be
the possible output values at this neuron?
(a) 0.0666
(b) 0.589
(c) 0.9734
(d) 0.498
(e) 1
6. Which of the following are common issues caused by saturating neurons in deep
networks?
7. Given a neuron initialized with weights w1 = 0.9, w2 = 1.7, and inputs x1 = 0.4,
x2 = −0.7, calculate the output of a ReLU neuron.
Correct Answer: 0
Solution: The weighted sum is 0.9 × 0.4 + 1.7 × (−0.7) = 0.36 − 1.19 = −0.83. ReLU
outputs the max of 0 and the input, so the result is max(0, −0.83) = 0.
8. Which of the following is incorrect with respect to the batch normalization process
in neural networks?
(a) We normalize the output produced at each layer before feeding it into the next
layer
(b) Batch normalization leads to a better initialization of weights.
(c) Backpropagation can be used after batch normalization
(d) Variance and mean are not learnable parameters.
10. How can you tell if your network is suffering from the Dead ReLU problem?
2. Consider the following corpus: “AI driven user experience optimization. Perception
of AI decision making speed. Intelligent interface adaptation system. AI system
engineering for enhanced processing efficiency”. What is the size of the vocabulary
of the above corpus?
(a) 18
(b) 20
(c) 22
(d) 19
3. We add incorrect pairs into our corpus to maximize the probability of words that occur
in the same context and minimize the probability of words that occur in different
contexts. This technique is called:
4. Let X be the co-occurrence matrix such that the (i, j)-th entry of X captures the
PMI between the i-th and j-th word in the corpus. Every row of X corresponds to the
representation of the i-th word in the corpus. Suppose each row of X is normalized
(i.e., the L2 norm of each row is 1) then the (i, j)-th entry of XX T captures the:
5. Suppose that we use the continuous bag of words (CBOW) model to find vector rep-
resentations of words. Suppose further that we use a context window of size 3 (that
is, given the 3 context words, predict the target word P (wt |(wi , wj , wk ))). The size
of word vectors (vector representation of words) is chosen to be 100 and the vocabu-
lary contains 20,000 words. The input to the network is the one-hot encoding (also
called 1-of-V encoding) of word(s). How many parameters (weights), excluding bias,
are there in Wword ? Enter the answer in thousands. For example, if your answer is
50,000, then just enter 50.
Since the question asks for the answer in thousands, the answer is:
2000
6. You are given the one hot representation of two words below:
GEMINI= [1, 0, 0, 0, 1], CLAUDE= [0, 0, 0, 1, 0]
What is the Euclidean distance between CAR and BUS?
Correct Answer: range(1.7,1.74)
Solution:
The Euclidean distance between two vectors A and B is given by the formula:
v
u n
uX
d(A, B) = t (Ai − Bi )2
i=1
Where:
We are given:
A = [1, 0, 0, 0, 1] (for GEMINI)
B = [0, 0, 0, 1, 0] (for CLAUDE)
8. Consider a skip-gram model trained using hierarchical softmax for analyzing scientific
literature. We observe that the word embeddings for ‘Neuron’ and ‘Brain’ are highly
similar. Similarly, the embeddings for ‘Synapse’ and ‘Brain’ also show high similarity.
Which of the following statements can be inferred?
10. Which of the following is an advantage of using the skip-gram method over the bag-
of-words approach?
Correct Answer: 7
2. For the same input image in Q1, suppose that we apply the following kernels of
differing sizes.
K1 :5×5
K2 :7×7
K3 : 25 × 25
K4 : 41 × 41
K5 : 51 × 51
Assume that stride s = 1 and no zero padding. Among all these kernels which one
shrinks the output dimensions the most?
(a) K1
(b) K2
(c) K3
(d) K4
(e) K5
Given the input image size is 1000 × 1000 and stride s = 1, we can calculate the
output dimensions for each kernel size.
Kernel K1 : 5 × 5
Output Size = (1000 − 5) + 1 = 996
So, the output size will be 996 × 996.
Kernel K2 : 7 × 7
Output Size = (1000 − 7) + 1 = 994
So, the output size will be 994 × 994.
Kernel K3 : 25 × 25
• K1 : 996 × 996
• K2 : 994 × 994
• K3 : 976 × 976
• K4 : 960 × 960
• K5 : 950 × 950
Among all these kernels, Kernel K5 (51×51) shrinks the output dimensions the most,
resulting in an output size of 950 × 950.
6. Consider an input image of size 100 × 100 × 1. Suppose that we used kernel of size
3×3, zero padding P = 1 and stride value S = 3. What will be the output dimension?
(a) 100 × 100 × 1
(b) 3 × 3 × 1
(c) 34 × 34 × 1
(d) 97 × 97 × 1
Where:
Given:
• Input size = 100 × 100 × 1 (we only care about the spatial dimensions, i.e.,
100 × 100).
• Kernel size = 3 × 3.
• Zero padding P = 1.
• Stride S = 3.
Let’s calculate the output dimensions for both the height and width:
100 − 3 + 2(1)
Output size = +1
3
Simplifying:
100 − 3 + 2 99
Output size = +1= + 1 = 33 + 1 = 34
3 3
34 × 34 × 1
7. Consider an input image of size 100 × 100 × 3. Suppose that we use 8 kernels (filters)
each of size 1 × 1, zero padding P = 1 and stride value S = 2. How many parameters
are there? (assume no bias terms)
(a) 3
(b) 24
(c) 10
(d) 8
(e) 100
1. Number of Parameters per Kernel: Each kernel has a size of 1 × 1 and operates
on all the input channels (3 channels for the input image). Therefore, the number of
parameters in each kernel is:
2. Total Number of Parameters: Since there are 8 kernels, each with 3 parameters,
the total number of parameters is:
Total parameters = 8 × 3 = 24
(a) AlexNet
(b) GoogleNet
(c) ResNet
(d) VGG
ResNet
Deep Learning - Week 11
Given the historical weather data, forecast the weather for the next N days: This is
very suitable for RNNs. Weather data is a time series, and RNNs are excellent at
processing sequential data and capturing temporal dependencies.
Given a speech waveform, convert it into text: This is also highly suitable for RNNs.
Speech recognition involves processing a sequence of audio features and outputting a
sequence of characters or words. RNNs (especially when combined with techniques
like CTC loss) are very effective for this task.
Given an image, find all objects in the image: This task is primarily suited for
Convolutional Neural Networks (CNNs), not RNNs. Object detection in images is
typically done using architectures like R-CNN, YOLO, or SSD, which are based on
CNNs.
2. Suppose that we need to develop an RNN model for sentiment classification. The
input to the model is a sentence composed of five words and the output is the sen-
timents (positive or negative). Assume that each word is represented as a vector of
length 100 × 1 and the output labels are one-hot encoded. Further, the state vector
st is initialized with all zeros of size 30 × 1. How many parameters (including bias)
are there in the network?
Solution: Solution: To compute the number of parameters in the RNN for sentiment
classification, we need to consider the parameters for the following components:
(a) Input to Hidden State Weights Wxh : This weight matrix maps the input word
vector to the hidden state.
(b) Hidden to Hidden State Weights Whh : This weight matrix maps the previous
hidden state to the next hidden state.
(c) Hidden State to Output Weights Why : This weight matrix maps the hidden
state to the output.
(d) Biases: Bias vectors for both the hidden state and the output.
Given:
• The input vector is of size 100 × 1, and the hidden state is of size 30 × 1.
• Therefore, the weight matrix Wxh has dimensions 30 × 100.
• Total parameters in Wxh : 30 × 100 = 3000
• The hidden state at time t − 1 is of size 30 × 1, and the hidden state at time t
is also of size 30 × 1.
• Therefore, the weight matrix Whh has dimensions 30 × 30.
• Total parameters in Whh : 30 × 30 = 900
5. The statement that LSTM and GRU solves both the problem of vanishing and ex-
ploding gradients in RNN is
(a) True
(b) False
(a) To determine how much of the current input should be added to the cell state.
(b) To determine how much of the previous time step’s cell state should be retained.
(c) To determine how much of the current cell state should be output.
(d) To determine how much of the current input should be output.
(a) Different activation functions, such as ReLU, are used instead of sigmoid in
LSTM.
(b) Gradients are normalized during backpropagation.
(c) The learning rate is increased in LSTM.
(d) Forget gates regulate the flow of gradients during backpropagation.
Correct Answer: (d)
Solution: Due to forget gates controlling the flow, the gradient will only vanish if the
previous states didn’t contribute during the forward pass. So if the information flows
during the forward pass gradient doesn’t vanish.
9. We are given an RNN with ||W || = 2.5. The activation function used in the RNN is
logistic. What can we say about ∇ = ∂s 20
∂s1
20
Y
∇≈ f ′ (st )W
t=1
where f ′ (s) is the derivative of the logistic activation function, which is:
where σ(s) is the sigmoid function. The maximum value of f ′ (s) occurs at s = 0,
giving f ′ (s) ≈ 0.25.
Now, approximating the gradient magnitude:
∇ ≈ (0.25 × 2.5)19
= (0.625)19
Since 0.625 < 1, exponentiating it to a large power (like 19) results in a very small
value, approaching 0. This suggests that the gradients will diminish significantly,
leading to the vanishing gradient problem.
Conclusion:
Since ∇ is very small, the correct answer is:
2. Which of the following are the benefits of using attention mechanisms in neural net-
works?
3. If we make the vocabulary for an encoder-decoder model using the given sentence.
What will be the size of our vocabulary?
Sentence: Attention mechanisms dynamically identify critical input components, en-
hancing contextual understanding and boosting performance
(a) 13
(b) 14
(c) 15
(d) 16
(a) s0 = CN N (xi )
(b) s0 = RN N (st−1 , e(ŷt−1 ))
(c) s0 = RN N (xit )
(d) s0 = RN N (ht−1 , xit )
s0 = RN N (ht−1 , xit )
5. Which of the following attention mechanisms is most commonly used in the Trans-
former model architecture?
(a) Decoder
(b) Key
(c) Value
(d) Query
(e) Encoder
7. In a hierarchical attention network, what are the two primary levels of attention?
(a) Character-level and word-level
(b) Word-level and sentence-level
(c) Sentence-level and document-level
(d) Paragraph-level and document-level
8. Which of the following are the advantages of using attention mechanisms in encoder-
decoder models?
9. In the encoder-decoder architecture with attention, where is the context vector typi-
cally computed?
10. Which of the following output functions is most commonly used in the decoder of an
encoder-decoder model for translation tasks?
(a) Softmax
(b) Sigmoid
(c) ReLU
(d) Tanh