0% found this document useful (0 votes)
11 views8 pages

CST414-SCHEME

This document is a draft scheme of valuation and answer key for the Deep Learning course (CST414) at APJ Abdul Kalam Technological University for the October 2023 examination. It outlines the structure of the exam, including parts A and B, with specific questions and marks allocated for each. The document includes various topics related to deep learning, such as regularization, CNNs, RNNs, and evaluation techniques.

Uploaded by

sonyajimon306
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views8 pages

CST414-SCHEME

This document is a draft scheme of valuation and answer key for the Deep Learning course (CST414) at APJ Abdul Kalam Technological University for the October 2023 examination. It outlines the structure of the exam, including parts A and B, with specific questions and marks allocated for each. The document includes various topics related to deep learning, such as regularization, CNNs, RNNs, and evaluation techniques.

Uploaded by

sonyajimon306
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

0400CST414052303

DRAFT Scheme of Valuation/Answer Key


(Scheme of evaluation (marks in brackets) and answers of problems/key)
APJ ABDUL KALAM TECHNOLOGICAL UNIVERSITY
EIGHTH SEMESTER B.TECH DEGREE(S) EXAMINATION, OCTOBER2023(2019 SCHEME)
Course Code: CST414
Course Name: DEEP LEARNING
Max. Marks: 100 Duration: 3 Hours

PART A
Answer all questions, each carries 3 marks. Marks

1 (3)
Train with more data, Data augmentation, Addition of noise to the input data, Feature
selection, Cross-validation, Regularization, Ensembling, Early stopping, Adding dropout
layers
Any 3 methods- 3 marks
2 X=X1*W1+X2*W2+X3*W3+b (3)
Given b=0, Assume initial weights.
Output, Y= σ(X)
ie: 1/(1 + e-x)

Equation- 1 mark
Solution-2 marks
3 a) Dataset augmentation – 3 marks (3)
4 L2 Regularization is a commonly used technique in ML systems is also sometimes (3)
referred to as “Weight Decay”. It works by adding a quadratic term to the Cross
Entropy Loss Function L, called the Regularization Term, which results in a new
Loss Function LR given by:

The Regularization Term consists of the sum of the squares of all the link weights
in the DLN, multiplied by a parameter λ called the Regularization Parameter. This
is another Hyperparameter whose appropriate value needs to be chosen as part of
the training process by using the validation data set. By choosing a value for this
parameter, we decide on the relative importance of the Regularization Term vs the
Loss Function term. Note that the Regularization Term does not include the biases,
since in practice it has been found that their inclusion does not make much of a
difference to the final result. The value of the λ governs the relative importance of
the Cross Entropy term(L) vs the regularization term and as λ increases, the system
tends to favour smaller and smaller weight values.

The weight update rule for L2 regularization is

Page 1of 8
0400CST414052303

……………Eqn 1
Explanation-3 marks

5 Because consecutive layers are only partially connected and because it heavily (3)
reuses its weights, a CNN has many fewer parameters than a fully connected DNN,
which makes it much faster to train, reduces the risk of overfitting, and requires
much less training data.
Any two advantages- 3 marks
6 Output: [(n+2p-f)/s+1] X [(n+2p-f)/s+1]xK (3)

N=64, f=5, p=1,s=1, K=2

Output dimension= (64+2-5/1+1)x(64+2-5/1+1)x2

ie. 62x62x2.
Equation-1 mark
Solution-2 marks
7 (3)
Basic formula of RNN is shown below:

It basically says the current hidden state h(t) is a function f of the previous hidden
state h(t-1) and the current input x(t). The theta are the parameters of the function f.
The network typically learns to use h(t) as a kind of lossy summary of the task-relevant
aspects of the past sequence of inputs up to t. Unfolding maps the left to the right in
the figure below.

where the black square indicates that an interaction takes place with a delay of 1 time
step, from the state at time t to the state at time t + 1. Unfolding/parameter sharing is
better than using different parameters per position: less parameters to estimate,
generalize to various length.
Diagram- 2 marks

Page 2of 8
0400CST414052303

Explanation -1 mark
8 A recursive network has a computational graph that generalizes that of the recurrent (3)
network from a chain to a tree.

It generalizes a recurrent network from a chain to a tree. A variable sequence


x(1),x(2)…..x(t) can be mapped to a fixed size representation (the output o), with a
fixed set of parameters (the weight matrices U,V,W).

Diagram- 2 marks
Explanation -1 mark
9 Representation learning- 3 marks (3)
10 Importance of deep learning in natural language processing- 3 marks (3)
PART B
Answer one full question from each module, each carries 14 marks.
Module I
11 a) (10)

Multilayer perceptron- Diagram- 3 marks, Explanation- 2 marks


Weight updation rule using gradient descent- 5 marks

Page 3of 8
0400CST414052303

b) (4)

Activation functions- 4 marks


OR
12 a) (8)
Y=0.3*-0.21+0.5*0.53+0.8*0.31+0.6*0.82+1*0.25
= -0.063+0.265+0.248+0.492+0.25=1.192 (4 marks)
Bipolar sigmoid function,

(2 marks)
Bipolar sigmoid(Y)= 0.697/1.303=0.5349 (2 marks)
b) Importance of step size in neural networks (6)
• determines the subset of the local optima that the algorithm can converge to
• Large step sizes can cause you to overstep local minima.
6 marks
Module II
13 a) Stochastic gradient descent (SGD) in contrast performs a parameter update (8)
for each training example x(i) and label y(i):

Batch gradient descent performs redundant computations for large datasets, as


it recomputes gradients for similar examples before each parameter update. SGD
does away with this redundancy by performing one update at a time. It is
therefore usually much faster. SGD performs frequent updates with a high
variance that cause the objective function to fluctuate heavily

Page 4of 8
0400CST414052303

Momentum is a method that helps accelerate SGD in the relevant direction and
dampens oscillation. It does this by adding a fraction γ of the update vector of
the past time step to the current update vector:

Essentially, when using momentum, we push a ball down a hill. The ball
accumulates momentum as it rolls downhill, becoming faster and faster on the

way (until it reaches its terminal velocity if there is air resistance, i.e. γ<1). The

same thing happens to our parameter updates: The momentum term increases
for dimensions whose gradients point in the same directions and reduces
updates for dimensions whose gradients change directions. As a result, we gain
faster convergence and reduced oscillation.
Equations- 4 marks
Explanation- 4 marks
b) Early Stopping is one of the most popular, and also effective, techniques to prevent (6)
overfitting. Use the validation data set to compute the loss function at the end of
each training epoch, and once the loss stops decreasing, stop the training and use
the test data to compute the final classification accuracy. In practice it is more
robust to wait until the validation loss has stopped decreasing for four or five
successive epochs before stopping. The point at which the validation loss starts to
increase is when the model starts to overfit the training data, since from this point
onwards its generalization ability starts to decrease. Early Stopping can be used by
iteself or in combination with other Regularization techniques.

Explanation- 6 marks

OR
14 a) Explanation- 7 marks (7)
b) • Can handle sparse gradients on noisy datasets. (7)
• Default hyperparameter values do well on most problems.
• Computationally efficient.
• Requires little memory, thus memory efficient.
• Works well on large datasets.
Advantages- 7 marks
Module III

Page 5of 8
0400CST414052303

15 a) (10)

Diagram-5 marks
Explanation- 5 marks
b) Convolution networks can be used to output a high dimensional structured object, (4)
rather than just predicting a class label for a classification task or a real value for
regression tasks. Eg: The model might emit a tensor S where Si,j,k is the probability
that pixel (j, k) of the input belongs to class i.

Explanation- 4 marks
OR
16 a) Dilated convolution (8)
Transposed Convolution
Seperable convolution
Variants- 8 marks
b) Sparse representation (6)
Equivariance to translation
Parameter sharing
Explanation- 2 marks each
Module IV
17 a) (8)

Diagram- 4 marks
Explanation- 4 marks

Page 6of 8
0400CST414052303

b) (6)

Diagram- 4 marks
Explanation- 2 marks
OR
18 a) (9)

Diagram- 5 marks
Explanation- 4 marks
b) Explanation with necessary equations-5 marks (5)
Module V
19 a) Any 2 methods(Word2Vec , GloVe )- 5 marks each (10)
b) Application of deep learning in Speech Recognition- 4 marks (4)

OR
20 a) Merits- 3.5 marks (7)
Demerits- 3.5 marks
b) Boltzmann Machine – 3.5 marks (7)
Deep Belief Network – 3.5 marks

Page 7of 8
0400CST414052303

****

Page 8of 8

You might also like