CST414-SCHEME
CST414-SCHEME
PART A
Answer all questions, each carries 3 marks. Marks
1 (3)
Train with more data, Data augmentation, Addition of noise to the input data, Feature
selection, Cross-validation, Regularization, Ensembling, Early stopping, Adding dropout
layers
Any 3 methods- 3 marks
2 X=X1*W1+X2*W2+X3*W3+b (3)
Given b=0, Assume initial weights.
Output, Y= σ(X)
ie: 1/(1 + e-x)
Equation- 1 mark
Solution-2 marks
3 a) Dataset augmentation – 3 marks (3)
4 L2 Regularization is a commonly used technique in ML systems is also sometimes (3)
referred to as “Weight Decay”. It works by adding a quadratic term to the Cross
Entropy Loss Function L, called the Regularization Term, which results in a new
Loss Function LR given by:
The Regularization Term consists of the sum of the squares of all the link weights
in the DLN, multiplied by a parameter λ called the Regularization Parameter. This
is another Hyperparameter whose appropriate value needs to be chosen as part of
the training process by using the validation data set. By choosing a value for this
parameter, we decide on the relative importance of the Regularization Term vs the
Loss Function term. Note that the Regularization Term does not include the biases,
since in practice it has been found that their inclusion does not make much of a
difference to the final result. The value of the λ governs the relative importance of
the Cross Entropy term(L) vs the regularization term and as λ increases, the system
tends to favour smaller and smaller weight values.
Page 1of 8
0400CST414052303
……………Eqn 1
Explanation-3 marks
5 Because consecutive layers are only partially connected and because it heavily (3)
reuses its weights, a CNN has many fewer parameters than a fully connected DNN,
which makes it much faster to train, reduces the risk of overfitting, and requires
much less training data.
Any two advantages- 3 marks
6 Output: [(n+2p-f)/s+1] X [(n+2p-f)/s+1]xK (3)
ie. 62x62x2.
Equation-1 mark
Solution-2 marks
7 (3)
Basic formula of RNN is shown below:
It basically says the current hidden state h(t) is a function f of the previous hidden
state h(t-1) and the current input x(t). The theta are the parameters of the function f.
The network typically learns to use h(t) as a kind of lossy summary of the task-relevant
aspects of the past sequence of inputs up to t. Unfolding maps the left to the right in
the figure below.
where the black square indicates that an interaction takes place with a delay of 1 time
step, from the state at time t to the state at time t + 1. Unfolding/parameter sharing is
better than using different parameters per position: less parameters to estimate,
generalize to various length.
Diagram- 2 marks
Page 2of 8
0400CST414052303
Explanation -1 mark
8 A recursive network has a computational graph that generalizes that of the recurrent (3)
network from a chain to a tree.
Diagram- 2 marks
Explanation -1 mark
9 Representation learning- 3 marks (3)
10 Importance of deep learning in natural language processing- 3 marks (3)
PART B
Answer one full question from each module, each carries 14 marks.
Module I
11 a) (10)
Page 3of 8
0400CST414052303
b) (4)
(2 marks)
Bipolar sigmoid(Y)= 0.697/1.303=0.5349 (2 marks)
b) Importance of step size in neural networks (6)
• determines the subset of the local optima that the algorithm can converge to
• Large step sizes can cause you to overstep local minima.
6 marks
Module II
13 a) Stochastic gradient descent (SGD) in contrast performs a parameter update (8)
for each training example x(i) and label y(i):
Page 4of 8
0400CST414052303
Momentum is a method that helps accelerate SGD in the relevant direction and
dampens oscillation. It does this by adding a fraction γ of the update vector of
the past time step to the current update vector:
Essentially, when using momentum, we push a ball down a hill. The ball
accumulates momentum as it rolls downhill, becoming faster and faster on the
way (until it reaches its terminal velocity if there is air resistance, i.e. γ<1). The
same thing happens to our parameter updates: The momentum term increases
for dimensions whose gradients point in the same directions and reduces
updates for dimensions whose gradients change directions. As a result, we gain
faster convergence and reduced oscillation.
Equations- 4 marks
Explanation- 4 marks
b) Early Stopping is one of the most popular, and also effective, techniques to prevent (6)
overfitting. Use the validation data set to compute the loss function at the end of
each training epoch, and once the loss stops decreasing, stop the training and use
the test data to compute the final classification accuracy. In practice it is more
robust to wait until the validation loss has stopped decreasing for four or five
successive epochs before stopping. The point at which the validation loss starts to
increase is when the model starts to overfit the training data, since from this point
onwards its generalization ability starts to decrease. Early Stopping can be used by
iteself or in combination with other Regularization techniques.
Explanation- 6 marks
OR
14 a) Explanation- 7 marks (7)
b) • Can handle sparse gradients on noisy datasets. (7)
• Default hyperparameter values do well on most problems.
• Computationally efficient.
• Requires little memory, thus memory efficient.
• Works well on large datasets.
Advantages- 7 marks
Module III
Page 5of 8
0400CST414052303
15 a) (10)
Diagram-5 marks
Explanation- 5 marks
b) Convolution networks can be used to output a high dimensional structured object, (4)
rather than just predicting a class label for a classification task or a real value for
regression tasks. Eg: The model might emit a tensor S where Si,j,k is the probability
that pixel (j, k) of the input belongs to class i.
Explanation- 4 marks
OR
16 a) Dilated convolution (8)
Transposed Convolution
Seperable convolution
Variants- 8 marks
b) Sparse representation (6)
Equivariance to translation
Parameter sharing
Explanation- 2 marks each
Module IV
17 a) (8)
Diagram- 4 marks
Explanation- 4 marks
Page 6of 8
0400CST414052303
b) (6)
Diagram- 4 marks
Explanation- 2 marks
OR
18 a) (9)
Diagram- 5 marks
Explanation- 4 marks
b) Explanation with necessary equations-5 marks (5)
Module V
19 a) Any 2 methods(Word2Vec , GloVe )- 5 marks each (10)
b) Application of deep learning in Speech Recognition- 4 marks (4)
OR
20 a) Merits- 3.5 marks (7)
Demerits- 3.5 marks
b) Boltzmann Machine – 3.5 marks (7)
Deep Belief Network – 3.5 marks
Page 7of 8
0400CST414052303
****
Page 8of 8