0% found this document useful (0 votes)
36 views

Shallow Vs Deep Nns Dse 3151 Deep Learning

Uploaded by

anikashukla01
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views

Shallow Vs Deep Nns Dse 3151 Deep Learning

Uploaded by

anikashukla01
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 591

Shallow vs Deep NNs

DSE 3151 Deep Learning


DSE 3151, B.Tech Data Science & Engineering
August 2023
Rohini R Rao & Abhilash Pai
Department of Data Science and Computer Applications
MIT Manipal
Slide -1 of 5
Contents
• Shallow Networks
• Neural Network Representation
• Back Propagation
• Vectorization
• Deep Neural Networks
• Example: Learning XOR
• Architecture Design
• Loss Functions
• Metrics
• Gradient-Based Learning
• Optimization
• Diagnosing Learning Curves
• Strategies for overfitting
• Learning rate scheduling

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 2


Deep Learning and Machine Learning

(Source: softwaretestinghelp.com)

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 3


Learning Multiple Components

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 4


Neural Network Examples

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 5


Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 6
Applications of Deep Learning

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 7


Neural Network
• is a set of connected input/output
units in which each connection has
• a weight associated with it.
• During the learning phase, the
network learns by adjusting
• the weights so as to be able to
predict the correct class label of
the input tuples.
• Also referred to as Connectionist
learning

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 8


Neuron is a computational , logistic unit

Rohini R Rao, Manjunath Hegde 9


Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 10
Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 11
Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 12
Multilayer Feed-Forward Neural Network
• consists of an input layer, one or
more hidden layers, and an
output layer
• units in
• input layer are input units
• hidden and output layer are
neurodes
• feed-forward network since
none of the weights cycles back

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 13


Back Propagation Algorithm

Jiawei Han and Micheline Kamber,Rohini


“Data Mining Concepts And Techniques”, 3rd Edition, Morgan Kauffmann14
R Rao & Abhilash Pai, Dept of Data Science and CA
Back Propagation Algorithm
• learns using a gradient descent method to search for a set of weights that fits the training data so as to
minimize the mean squared error
• L is the learning rate, a constant typically having a value between 0.0 and 1.0.
• helps avoid getting stuck at a local minimum in decision space and encourages finding the global minimum.
• If l is
• too small, then learning will occur at a very slow pace.
• too large, then oscillation between inadequate solutions may occur
• rule of thumb is to set the learning rate to 1=/t
• where t is the number of iterations through the training set so far
• one iteration through the training set is an epoch.
• Case Updating
• updating the weights and biases after the presentation of each tuple.
• Alternatively,
• Epoch updating
• the weight and bias increments could be accumulated in variables
• so that the weights and biases are updated after all the tuples in the training set have been presented.

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 15


Back Propagation Algorithm
• Terminating condition:
• Training stops when
• All changes in wij in the previous epoch are so small as to be below some specified
threshold
• Or The percentage of tuples misclassified in the previous epoch is below some threshold,
• Or A prespecified number of epochs has expired.
• The computational efficiency depends on the
• time spent training the network.
• Given |D| tuples and w weights, each epoch requires O(|D|*w) time.
• In the worst-case scenario, the number of epochs can be exponential in n, the
number of inputs.
• In practice, the time required for the networks to converge is highly variable.

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 16


Example

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 17


Example

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 18


Simple AND and Simple OR operations

Course Notes – Deep Learning , Andrew NG

Rohini R Rao, Manjunath Hegde 19


XOR Problem - Perceptron Learning

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 20


XOR Problem
Solving XOR
Network Diagram

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 21


Neural Network Representation

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 22


Neural Network Representation

Course Notes – Deep Learning , Andrew NG


Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 23
The Shallow Neural Network
Vectorizing across multiple examples

Course Notes – Deep Learning , Andrew NG


Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 24
The Shallow Neural Network
Vectorizing across multiple examples

Course Notes – Deep Learning , Andrew NG


Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 25
Deep vs Shallow Network

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 26


Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 27
Neural Network learning as optimization
• Cannot calculate the perfect weights for a neural network since there are too many
unknowns
• Instead, the problem of learning is as a search or optimization problem
• An algorithm is used to navigate the space of possible sets of weights the model may use
in order to make good or good enough predictions

• Objective of optimizer is to minimize the loss function or the error term.


• Loss function
• gives the difference between observed value from the predicted value.
• must be
• Continuous
• Differentiable at each point (allows the use of gradient-based optimization)
• To minimize the loss generated from any model compute
• The magnitude that is by how much amount to decrease or increase, and
• direction in which to move

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 28


Machine Learning Setup

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 29


Activation Functions
• To make the network robust use of Activation or Transfer functions.
• Activations functions introduce non-linear properties in the neural networks.
• A good Activation function has the following properties:
• Monotonic Function:
• should be either entirely non-increasing or non-decreasing.
• If not monotonic then increasing the neuron's weight might cause it to have less influence on
reducing the error of the cost function.
• Differential:
• mathematically means the change in y with respect to change in x.
• should be differential because we want to calculate the change in error with respect to given
weights at the time of gradient descent.
• Quickly Converging:
• Should reach its desired value fast.

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 30


Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 31
Gradient Descent

• Generic Optimization Algorithm


• Starts with random values
• Improves gradually , in an attempt to
decrease loss function
• Until algorithm converges to minimum
• Learning Rate-
• Hyperparameter
• Indicates size of steps

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 32


Gradient Descent & Learning Rate

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 33


Gradient Descent Pitfalls

• If it starts on left can reach local


minimum
• If it stars from right can hit plateau
• So pick Cost Functions which are
convex functions
• Has no local minimum
• Continuous function with slope that does
not change abruptly
• Then Gradient Descent will approach
close to global minimum

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 34


Activation Functions and their derivatives

Aurelien Geron, “Hands-On Machine Learning with Scikit-Learn , Keras & Tensorflow, OReilly Publications
Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 35
Activation Functions – Sigmoid vs Tanh
• Sigmoid
• s(x) = 1/(1 + e−x) where e ≈ 2.71 is the base of the
natural logarithm
• to predict the probability
• between the range of 0 and 1, sigmoid is the
right choice.
• is differentiable-, we can find the slope of the
sigmoid curve at any two points.
• The function is monotonic but function’s
derivative is not.
• can cause a neural network to get stuck at the
training time.
• Tanh
• Range is from (-1 to 1)
• Centering data – mean = 0
• is also sigmoidal (s - shaped)
• Advantage is that the -ve inputs will be mapped
strongly negative
• Disadvantage – If x is very small or very large
slope or gradient becomes 0 which slows
down gradient descent
Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 36
Activation Functions – Sigmoid vs ReLU

• ReLU is half rectified


• f(z) is 0 when z is < 0 and f(z) is equal to z when z is >= zero.
• Derivative =1 when z is +ve and 0 when z is 0-ve
• The function and its derivative both are monotonic
• Alternate to ReLU is softplus activation function
• Softplus(z) = log(1+exp(z), Close to 0 when z is –ve and close to z when z is +ve
Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 37
Activation Functions- ReLU vs Leaky ReLU

• The leak helps to increase the range of the ReLU function.


• Usually, the value of a is 0.01.
• When a is not 0.01 then it is called Randomized ReLU.
• range of the Leaky ReLU is (-infinity to infinity)
Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 38
Activation Functions-
SoftMax
• Usually last activation function
• to normalize the output of a
network to a probability
distribution over predicted
output classes

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 39


Activation Function Cheat Sheet

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 40


https://patrickhoo.wixsite.com/diveindatascience/single-post/2019/06/13/activation-functions-and-when-to-use-them
Output Functions

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 41


Regression problems - Output Function

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 42


Regression problems - Loss Function

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 43


Classification problem- Output Function

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 44


Classification Problems- Loss Function

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 45


Typical MLP architecture

Regression Classification

Aurelien Geron, “Hands-On Machine Learning with Scikit-Learn , Keras & Tensorflow, OReilly Publications
Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 46
Regression Loss Functions
• Mean Squared Error (MSE)
• values with a large error are penalized.
• is a convex function with a clearly defined
global minimum
• Can be used in gradient descent
optimization to set the weight values
• Very sensitive to outliers , will significantly
increase the loss.
• Mean Absolute Error (MAE)
• used in cases when the training data has a
large number of outliers
as the average distance approaches 0,
gradient descent optimization will not work
• Huber Loss
• Based on absolute difference between the
actual and predicted value and threshold
value, 𝛿
• Is quadratic when error is smaller than 𝛿
but linear when error is larger than 𝛿
Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 47
Classification Loss Functions
Cross entropy measures entropy between two probability distributions
• Binary Cross-Entropy/Log Loss
• Compares the actual value (0 or 1)
with the probability that the input
aligns with that category
• p(i) = probability that the category is 1
• 1 — p(i) = probability that the
category is 0)
• Categorical Cross-Entropy Loss
• In cases where the number of
classes is greater than two

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 48


Keras Metrics
• model.compile(..., metrics=['mse'])
• Metric values are recorded at the end of each epoch on the training
dataset.
• If a validation dataset is also provided, then is also calculated for the
validation dataset.
• All metrics are reported in verbose output and in the history object
returned from calling the fit() function.

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 49


.

Keras Metrics
• Accuracy metrics
• Accuracy
• Calculates how often predictions equal labels.
• Binary Accuracy
• Calculates how often predictions match binary labels.
• Categorical Accuracy
• Calculates how often predictions match one-hot labels
• Sparse Categorical Accuracy
• Calculates how often predictions match integer labels.
• TopK Categorical Accuracy
• calculates the percentage of records for which the targets are in the
top K predictions
• rank the predictions in the descending order of probability values.
• If the rank of the yPred is less than or equal to K, it is considered
accurate.
• Sparse TopK Categorical Accuracy class
• Computes how often integer targets are in the top K predictions.

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 50


Keras Metrics
• Regression metrics
• Mean Squared Error
• Computes the mean squared error between y_true and y_pred
• Root Mean Squared Error
• Computes root mean SE metric between y_true and y_pred
• Mean Absolute Error
• Computes the mean absolute error between the labels and
predictions
• Mean Absolute Percentage Error
• MAPE = (1/n) * Σ(|actual – prediction| / |actual|) * 100
• Average difference between the predicted and the actual in %
• Mean Squared Logarithmic Error
• measure of the ratio between the true and predicted values.

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 51


Keras Metrics
• AUC • Probabilistic metrics
• Precision • Binary Crossentropy
• Recall
• Categorical Crossentropy
• TruePositives
• TrueNegatives • Sparse Categorical Crossentropy
• FalsePositives • KLDivergence
• FalseNegatives • a measure of how two probability
• PrecisionAtRecall distributions are different from each
• Computes best precision where recall is >= specified other
value
• SensitivityAtSpecificity • Poisson
• Computes best sensitivity where specificity is >= • if dataset comes from a Poisson
specified value
distribution
• SpecificityAtSensitivity
• Computes best specificity where sensitivity is >=
specified value

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 52


Gradient based learning
• Gradient Based Learning
• seeks to change the weights so
that the next evaluation reduces
the error
• is navigating down the gradient (or
slope) of error
• The process repeats until the
global minimum is reached.
• works well for convex functions
• It is expensive to calculate the
gradients if the size of the data
is huge.
Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 53
The Vanishing/Exploding Gradient Problems
• Algorithm computes the gradient of the cost function with regard to each parameter in
the network
• Problems include
• Vanishing Gradients
• Gradients become very small as algorithm progresses down to lower layers
• So connection weights remain unchanged and training never converges to a solution
• Exploding Gradients
• Gradients become so large until layers get huge weight updates and algorithm diverges
• Deep Neural Networks suffer from unstable gradients, different layers may learn at
different speeds.
• Reasons for unstable gradients Glorot & Bengio (2010)
• Because of combination of Sigmoid activation and weight initialization (normal distribution with
mean 0 and std dev 1)
• Variance of outputs of each layer is much greater than the variance of its inputs
• Variance goes on increasing after each layer
• until the activation function saturates(0 or 1, with derivative close to 0) at the top layers

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 54


Solutions include : 1. Weight Initialization
• The variance of the outputs of each layer has to be equal to the
variance of its inputs
• Xavier’s Initialization

• Weight Matrix W of a particular layer l


• picked randomly from a normal distribution with
• mean μ= 0
• variance sigma² = multiplicative inverse of the number of neurons in layer
l−1.
• The bias b of all layers is initialized with 0

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 55


Solutions Include : Weight Initialization
1. Xavier’s or Glorot Initialization
Variance of outputs of each layer to be equal to the variance of its inputs
Gradients to have equal variance before and after flowing through a layer in the reverse
direction
Fan-in – Number of inputs
Fan-out- Number of neurons

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 56


Solutions include
2. Using Non-saturating Activation functions
1. ReLU does not saturate for +ve values
2. Suffer from problem of dying ReLU- neurons output only 0
• Weighted sum of its inputs are negative for all instances in training set
3. Use Leaky ReLU instead – (only go into coma don’t die, may wake up)
1. α = 0.01 , sometimes 0.2
4. Flavors include
Randomized leaky ReLU(RReLU)
α is picked randomly in a given range during training and is fixed to an average
value during testing
Parametric Leaky Relu (PReLU)
α is authorised to be learned during training as a parameter
Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 57
SELU
• Scaled Exponential Linear Units
• Induce self-normalization
• the output of each layer will tend to
preserve mean 0 and standard deviation
1 during training
• f(x) = λx if x>= 0
• f(x) = λα(exp(x)-1) if x < 0
• α = 1.6733 , λ= 1.0507
• conditions for self-normalization to
happen:
• The input features must be standardized
(mean 0 and standard deviation 1).
• Every hidden layer’s weights must also be
initialized using normal initialization.
• The network’s architecture must be
sequential.

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 58


Solutions Include
3. Batch Normalization -Ioffe and Szegedy (2015)
• designed to solve the vanishing/exploding gradients problems, is also a good regularizer
• BN layer performs the standardizing and normalizing operations on the input of a layer coming
from a previous layer.
• Normalization
• brings the numerical data to a common scale without distorting its shape.
• (mean = 0, std dev= 1)
• BN adds extra operations in the model , before activation
• Operation zero centres and normalizes each input
• Then scale and shift the result using two new parameter vectors per layer
• Each BN Layer learn 4 parameter vectors
• Output Scale Vector
• Output offset vector
• Input mean vector
• Input standard deviation
• To zero-centre and normalize the inputs , mean and standard deviation of input needs to be
computed
• Current mini-batch is used to evaluate mean and standard deviation
Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 59
Solutions Include
3. Batch Normalization -Ioffe and Szegedy (2015)

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 60


Solutions include
4. Gradient Clipping
• Clip gradient during back
propagation so that they never
exceed some threshold
• All the partial derivatives of the
loss will be clipped between -0.1
to 0.1
• Threshold can also be a
hyperparameter to tune
5. Reusing Pretrained Layers
• Transfer Learning

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 61


6. Faster Optimizers
• Optimizer
• is a function or an algorithm that modifies the attributes of the neural network, such
as weights and learning rate.
• Terminology
• Weights/ Bias – The learnable parameters in a model that controls the signal
between two neurons.
• Epoch – The number of times the algorithm runs on the whole training dataset.
• Sample – A single row of a dataset.
• Batch –denotes the number of samples to be taken for updating the model
parameters.
• Learning rate –defines a scale of how much model weights should be updated.
• Cost Function/Loss Function -is used to calculate the cost that is the difference
between the predicted value and the actual value.

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 62


Mini Batch Gradient Descent Deep Learning
Optimizer
• Batch gradient descent:
• gradient is average of gradients computed from ALL the samples in dataset
• Mini Batch GD:
• subset of the dataset is used for calculating the loss function, therefore fewer
iterations are needed.
• batch size of 32 is considered to be appropriate for almost every case.
• Yann Lecun (2018) – “Friends don’t let friends use mini batches larger than 32”
• is faster , more efficient and robust than the earlier variants of gradient
descent.
• the cost function is noisier than the batch GD but smoother than SDG.
• Provides a good balance between speed and accuracy.
• It needs a hyperparameter that is “mini-batch-size”, which needs to be
tuned to achieve the required accuracy.

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 63


Stochastic GD & SGD with Momentum DL
• stochastic means randomness on which the
algorithm is based upon.
• Instead of taking the whole dataset for each
iteration, randomly select the batches of data
• The path taken is full of noise as compared to
the gradient descent algorithm.
• Uses a higher number of iterations to reach
the local minima, thereby the overall
computation time increases.
• The computation cost is still less than that of
the gradient descent optimizer.
• If the data is enormous and computational
time is an essential factor, SGD should be
preferred over batch gradient descent
algorithm.
• Stochastic Gradient Descent with
Momentum Deep Learning Optimizer
• momentum helps in faster
convergence of the loss function.

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 64


SGD with Momentum Optimizer
• GD takes small , regular steps down
the slope so algorithm takes more
time to reach the bottom
• adding a fraction of the previous
update to the current update will
make the process a bit faster.
• Hyperparameter β - Momentum
• To simulate a friction mechanism • β - Momentum
and prevent momentum from • set between 0 (high friction) and 1 (low friction)
becoming too large • Typically 0.9
• Also rolls past local minima
• learning rate should be decreased
with a high momentum term.

Aurelien Geron, “Hands-On Machine Learning with Scikit-Learn , Keras & Tensorflow, OReilly Publications
Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 65
SGD with Nesterov Momentum Optimization
• Yurii Nesterov in 1983
• to measure the gradient of the cost
function not at the local position but
slightly ahead in the direction of the
momentum
• the momentum vector will be
pointing in the right direction (i.e.,
toward the optimum)
• it will be slightly more accurate to
use the gradient measured a bit
farther in that direction rather than
using the gradient at the original
position

Aurelien Geron, “Hands-On Machine Learning with Scikit-Learn , Keras & Tensorflow, OReilly Publications
Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 66
Adagrad (Adaptive Gradient Descent) Deep
Learning Optimizer
• Adaptive Learning Rate
• Scaling down the gradient vector along the
steepest dimension
• If the cost function is steep along the ith
dimension, then s will get larger and larger at each
iteration
• No need to modify the learning rate manually
• more reliable than gradient descent algorithms,
and it reaches convergence at a higher speed.

• Disadvantage
• it decreases the learning rate aggressively and
monotonically.
• Due to small learning rates, the model eventually
becomes unable to acquire more knowledge, and
hence the accuracy of the model is compromised.

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 67


RMS Prop(Root Mean Square) Deep Learning
Optimizer
• The problem with the gradients some are small while others may be huge
• Defining a single learning rate might not be the best idea.
• accumulating only the gradients from the most recent iterations (as
opposed to all the gradients since the beginning of training).

• β= 0.9

• Works better than Adagrad , was popular until ADAM

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 68


Adam Deep Learning Optimizer
• is derived from adaptive moment estimation.
• inherit the features of both Adagrad and RMS
prop algorithms.
• like Momentum optimization keeps track of
an exponentially decaying average of past
gradients,
• like RMSProp it keeps track of an
exponentially decaying average of past
squared gradients
• β1 and β2 represent the decay rate of the
average of the gradients.
• Advantages
• Is straightforward to implement
• faster running time t represents iteration
• low memory requirements, and requires less tuning β1= 0.9
• Disadvantages β 2=0.999
• Focusses on faster computation time, whereas Smoothing term ϵ is=10–7
SGD focus on data points.
• Therefore SGD generalize the data in a better
manner at the cost of low computation speed.

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 69


Curve Fitting – True Function is Sinusoidal

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 70


Curve Fitting – True Function is Sinusoidal

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 71


Bias

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 72


Variance

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 73


Mean Square Error

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 74


Train vs Test Error

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 75


Train vs Test Error

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 76


Learning Curves
• Line plot of learning (y-axis) over experience (x-axis)
• The metric used to evaluate learning could be
• Optimization Learning Curves:
• calculated on the metric by which the parameters of the model are being optimized, e.g. loss.
• Minimizing, such as loss or error
• Performance Learning Curves:
• calculated on the metric by which the model will be evaluated and selected, e.g. accuracy.
• Maximizing metric , such as classification accuracy
• Train Learning Curve:
• calculated from the training dataset that gives an idea of how well the model is learning.
• Validation Learning Curve:
• calculated from a hold-out validation dataset that gives an idea of how well the model is
generalizing.

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 77


Underfit Learning Curves

A plot of learning curves shows underfitting if:


•The training loss remains flat regardless of training.
•The training loss continues to decrease until the end of training.
Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 78
Overfitting Curves
• Overfitting
• Model specialized on training data, it is not
able to generalize to new data
• Results in increase in generalization error.
• generalization error can be measured by
the performance of the model on the
validation dataset.
• A plot of learning curves shows
overfitting if:
• The plot of training loss continues to
decrease with experience.
• The plot of validation loss decreases to a
point and begins increasing again.
• The inflection point in validation loss may
be the point at which training could be
halted as experience after that point shows
the dynamics of overfitting.

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 79


Good Fit Learning Curves
• A good fit is the goal of the learning
algorithm and exists between an overfit
and underfit model.
• A plot of learning curves shows a good fit
if:
• Plot of training loss decreases to a point of
stability
• Plot of validation loss decreases to a point
of stability and has a small gap with the
training loss.
• Loss of the model will almost always be
lower on the training than the validation
dataset.
• We should expect some gap between the
train and validation loss learning curves.
• This gap is referred to as the
“generalization gap.”
Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 80
Avoiding Overfitting
• With so many parameters, DNN can fit
complex datasets.
• But also prone to overfitting the
training set.
• Early Stopping
• stop training as soon as the validation
error reaches a minimum
• With Stochastic and Mini-batch Gradient
Descent, the curves are not so smooth,
and it may be hard to know whether you
have reached the minimum or not.
• Stop only after the validation error has
been above the minimum for some time

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 81


Avoiding overfitting
• Dropout
• Proposed by Geoffrey Hinton in 2012
• at every training step, every neuron has a
probability p of being temporarily
“dropped out,”
• it will be entirely ignored during this
training step, but it may be active during
the next step
• p is called the dropout rate, usually 50%.
• After training, neurons don’t get dropped
• If p = 50%
• during testing a neuron will be connected
to twice as many input neurons as it was
(on average) during training.
• multiply each input connection weight by
the keep probability (1 – p) after training
• Alternatively, divide each neuron’s output
by the keep probability during training

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 82


Overfitting using Regularization
• L1, l2 regularization
• Regularization can be used to constrain the NN weights
• Lasso – (l1) least absolute shrinkage and selection operator
• adds the “absolute value of magnitude” of the coefficient as a penalty term to the loss function.
• Ridge regression (l2)
• adds the “squared magnitude” of the coefficient as the penalty term to the loss function.
• Use l1(), l2(), l1_l2() function
• which returns a regularizer that will compute the regularization loss at each step during training
• Regularization loss is added to final loss
• Max-Norm Normalization
• It constrains the weights w of the incoming connections

• r is max-norm hyper parameter and ∥ · ∥2 is the ℓ2 norm


• Reducing r increases the amount of regularization and helps reduce overfitting
• Can also help alleviate the unstable gradients problem if we are not using Batch normalization

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 83


Constant learning rate not ideal
• Better to start with a high learning
rate
• then reduce it once it stops
making fast progress
• can reach a good solution faster
• Learning Schedule strategies can
be applied
• Power Scheduling
• Exponential Scheduling
• Piecewise Constant Scheduling
• Performance Scheduling

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 84


Learning Rate Scheduling
• Power scheduling
• Learning rate set to a function of the iteration number ‘t’
• t: η(t) = η0 / (1 + t/k)c
• The initial learning rate η0, the power c (typically set to 1)
• The learning rate drops at each step, and after s steps it is down to η0 / 2 and so on
• schedule first drops quickly, then more and more slowly
• optimizer = keras.optimizers.SGD(lr=0.01, decay=1e-4)
• The decay is the number of steps it takes to divide the learning rate by one more unit,
• Keras assumes that c is equal to 1.
• Exponential scheduling
• Set the learning rate to: η(t) = η0 0.1t/s
• learning rate will gradually drop by a factor of 10 every s steps.

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 85


Learning Rate Scheduling
• Piecewise constant scheduling
• Constant learning rate for a number of epochs
• e.g., η0 = 0.1 for 5 epochs
• then a smaller learning rate for another number of epochs
• e.g., η1 = 0.001 for 50 epochs and so on
• Performance scheduling
• Measure the validation error every N steps (just like for early stopping)
• reduce the learning rate by a factor of λ when the error stops dropping

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 86


References
• Ian Goodfellow, Yoshua Bengio and Aaron Courville, “Deep Learning”,
MIT Press 2016
• Swayam NPTEL Notes- Deep Learning, Mitesh Khapra
• Course Notes – Neural Networks and Deep Learning, Andrew NG
• Aurelien Geron, “Hands-On Machine Learning with Scikit-Learn ,
Keras & Tensorflow, OReilly Publications
• Jiawei Han and Micheline Kamber, “Data Mining Concepts And
Techniques”, 3rd Edition, Morgan Kauffmann

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 87


DSE 3151 DEEP LEARNING
Convolutional Neural Networks

Dr. Rohini Rao & Dr. Abhilash K Pai


Dept. of Data Science and Computer Applications
MIT Manipal
The Convolution Operation - 1D
▪ Convolution is a linear operation on two functions of a real-valued argument, where one function is applied over
the other to yield element-wise dot products.

▪ Example: Consider a discrete signal ‘xt’ which represents the position of a spaceship at time ‘t’
recorded by a laser sensor.

▪ Now, suppose that this sensor is noisy.


x0

▪ To obtain a less noisy measurement we would like to average several measurements.

▪ Considering that, the most recent measurements are more important, we would like to take
a weighted average over ‘xt’. The new estimate at time ‘t’ is computed as follows: x1
convolution

𝑠𝑡 = ෍ 𝑥𝑡−𝑎 𝑤−𝑎 = 𝑥 ∗ 𝑤 𝑡
𝑎=0

input Filter/Mask/Kernel
x2
▪ Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 2
The Convolution Operation - 1D
▪ In practice, we would sum only over a small window.
6

For example: 𝑠𝑡 = ෍ 𝑥𝑡−𝑎 𝑤−𝑎


𝑎=0

▪ We just slide the filter over the input and compute the value of st based on a window around xt

w-6 w-5 w-4 w-3 w-2 w-1 w0


w 0.01 0.01 0.02 0.02 0.04 0.4 0.5

* * * * * * *
x 1.0 1.10 1.20 1.40 1.70 1.80 1.90 2.10 2.20

s 1.80

Content adapted from : CS7015 Deep Learning, Dept. of CSE, IIT Madras

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 3
The Convolution Operation - 1D
▪ In practice, we would sum only over a small window.
6

For example: 𝑠𝑡 = ෍ 𝑥𝑡−𝑎 𝑤−𝑎


𝑎=0

▪ We just slide the filter over the input and compute the value of st based on a window around xt

w-6 w-5 w-4 w-3 w-2 w-1 w0


w 0.01 0.01 0.02 0.02 0.04 0.4 0.5

* * * * * * *
x 1.0 1.10 1.20 1.40 1.70 1.80 1.90 2.10 2.20

s 1.80 1.96

Content adapted from : CS7015 Deep Learning, Dept. of CSE, IIT Madras

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 4
The Convolution Operation - 1D
▪ In practice, we would sum only over a small window.
6

For example: 𝑠𝑡 = ෍ 𝑥𝑡−𝑎 𝑤−𝑎


𝑎=0

▪ We just slide the filter over the input and compute the value of st based on a window around xt

w-6 w-5 w-4 w-3 w-2 w-1 w0


w 0.01 0.01 0.02 0.02 0.04 0.4 0.5

* * * * * * *
x 1.0 1.10 1.20 1.40 1.70 1.80 1.90 2.10 2.20

s 1.80 1.96 2.11

▪ Use cases of 1-D convolution : Audio signal processing, stock market analysis, time series analysis etc.
Content adapted from : CS7015 Deep Learning, Dept. of CSE, IIT Madras

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 5
Convolution in 2-D using Images : What is an Image?

What we see

What a computer sees


Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 6
Convolution in 2-D using Images : What is an Image?

▪ An image can be represented mathematically as a function f(x,y) which gives the intensity value at
position (x,y), where, f(x,y) ε {0,1,….,Imax-1} and x,y ε {0,1,…..,N-1}.

▪ Larger the value of N, more is the clarity of the picture (larger resolution), but more data to be analyzed
in the image.

▪ If the image is a Gray-scale (8-bit per pixel) image, then it requires N2 Bytes for storage.

▪ If the image is color - RGB, each pixel requires 3 Bytes of storage space.

N is the resolution of the image and Imax is the level of discretized brightness value.
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 7
Convolution in 2-D using Images : What is an Image?

Digital camera

[Source: D. Hoiem]
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 8
Convolution in 2-D using Images : What is an Image?

▪ Sample the 2-D space on a regular grid.


▪ Quantize each sample, i.e., the photons arriving at each active cell are
integrated and then digitized.

[Source: D. Hoiem]

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 9
Convolution in 2-D using Images : What is an Image?

▪ A grid (matrix) of intensity values.

255 255 255 255 255 255 255 255 255 255 255 255
255 255 255 255 255 255 255 255 255 255 255 255
255 255 255 0 0 255 255 255 255 255 255 255
255 255 255 75 75 75 255 255 255 255 255 255
255 255 75 95 95 75 255 255 255 255 255 255
255 255 96 127 145 175 255 255 255 255 255 255
255 255 127 145 175 175 175 255 95 255 255 255
255 255 127 145 200 200 175 175 95 255 255 255
255 255 127 145 145 175 127 127 95 47 255 255
255 255 127 145 145 175 127 127 95 47 255 255
255 255 74 127 127 127 95 95 95 47 255 255
255 255 255 74 74 74 74 74 74 255 255 255
255 255 255 255 255 255 255 255 255 255 255 255
255 255 255 255 255 255 255 255 255 255 255 255

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 10
Convolution in 2-D using Images : What is an Image?

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 11
The Convolution Operation - 2D
▪ Images are good examples of 2-D inputs.

▪ A 2-D convolution of an Image ‘I’ using a filter ‘K’ of size ‘m x n’ is now defined as (looking at previous pixels):

𝑚−1 𝑛−1

𝑆𝑖𝑗 = 𝐼 ∗ 𝐾 𝑖𝑗 = ෍ ෍ 𝐼𝑖−𝑎,𝑗−𝑏 𝐾𝑎,𝑏


𝑎=0 𝑏=0

▪ In practice, one of the way is to look at the succeeding pixels:


𝑚−1 𝑛−1

𝑆𝑖𝑗 = 𝐼 ∗ 𝐾 𝑖𝑗 = ෍ ෍ 𝐼𝑖+𝑎,𝑗+𝑏 𝐾𝑎,𝑏


𝑎=0 𝑏=0

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 12
The Convolution Operation - 2D
▪ Another way is to consider center pixel as reference pixel, and then look at its surrounding pixels:
𝑚/2 𝑛/2

𝑆𝑖𝑗 = 𝐼 ∗ 𝐾 𝑖𝑗 = ෍ ෍ 𝐼 𝑖−𝑎,𝑗−𝑏 ∗ 𝐾 𝑚/2 +𝑎, 𝑛/2 +𝑏


𝑎= −𝑚/2 𝑏= −𝑛/2

Pixel of interest

0 1 0 0 1
0 0 1 1 0
1 0 0 0 1
0 1 0 0 1
0 0 1 0 1
Content adapted from : CS7015 Deep Learning, Dept. of CSE, IIT Madras

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 13
The Convolution Operation - 2D

Source: https://developers.google.com/

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 14
The Convolution Operation - 2D

Input Image

Source: https://developers.google.com/

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 15
The Convolution Operation - 2D

Input Image

Source: https://developers.google.com/

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 16
The Convolution Operation - 2D

Smoothening Filter

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 17
The Convolution Operation - 2D

Sharpening Filter

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 18
The Convolution Operation - 2D

Filter for edge


detection

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 19
The Convolution Operation – 2D : Various filters (edge detection)
Prewitt

-1 0 1 1 1 1
-1 0 1 0 0 0
-1 0 1 -1 -1 -1

Sx Sy After applying
Horizontal edge
detection filter
Sobel

-1 0 1 1 2 1
-2 0 2 0 0 0
-1 0 1 -1 -2 -1
Sx Sy Input image After applying
Vertical edge
Laplacian Roberts detection filter

0 1 0 0 1 1 0
1 -4 1 -1 0 0 -1

0 1 0 Sx Sy
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 20
The Convolution Operation - 2D

1 -1 -1
-1 1 -1 Filter 1
-1 -1 1
stride=1

1 0 0 0 0 1 Dot
product
0 1 0 0 1 0 3 -1
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
Input image: 6 x 6
Note: Stride is the number of “unit” the kernel is shifted per slide over rows/ columns
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 21
The Convolution Operation - 2D

1 -1 -1
-1 1 -1 Filter 1
-1 -1 1
If stride=2

1 0 0 0 0 1
0 1 0 0 1 0 3 -3
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
Input image: 6 x 6
Note: Stride is the number of “unit” the kernel is shifted per slide over rows/ columns
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 22
The Convolution Operation - 2D

1 -1 -1
-1 1 -1 Filter 1
-1 -1 1
stride=1

1 0 0 0 0 1
3 -1 -3 -1
0 1 0 0 1 0
0 0 1 1 0 0 -3 1 0 -3
1 0 0 0 1 0 4 x 4 Feature Map
0 1 0 0 1 0 -3 -3 0 1
0 0 1 0 1 0
3 -2 -2 -1
Input image: 6 x 6

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 23
The Convolution Operation - 2D
-1 1 -1
-1 1 -1 Filter 2
-1 1 -1
stride=1

1 0 0 0 0 1 Repeat for each filter!


3 -1 -3 -1
0 1 0 0 1 0 -1 -1 -1 -1
0 0 1 1 0 0 -3 1 0 -3
-1 -1 -2 1
1 0 0 0 1 0 Feature
0 1 0 0 1 0 -3 -3 Map0 1
-1 -1 -2 1
0 0 1 0 1 0 Two 4 x 4 images
3 -2 -2 -1 Forming 4 x 4 x 2 matrix
Input image: 6 x 6 -1 0 -4 3

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 24
The Convolution Operation –RGB Images

R G B

Apply the filter to R, G, and B channels of


the image and combine the resultant
feature maps to obtain a 2-D feature map.

Source: Intuitively Understanding Convolutions for Deep Learning | by Irhum Shafkat | Towards Data Science

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 25
The Convolution Operation –RGB Images multiple filters
11 -1-1 -1-1 -1-1 11 -1-1 -1-1 11 -1-1
1 -1 -1 -1 0 -1 -1 1 -1
-1 1 -1 -1-1 11 -1-1 -1-1 11 -1-1
-1-1 11 -1-1 Filter 1 -1 0 -1 Filter 2
0 0 0 Filter K
-1-1 -1-1 11 -1-1 11 -1-1 -1-1 11 -1-1
-1 -1 1 -1 0 -1 -1 1 -1

1 0 0 0 0 1
1 0 0 0 0 1
0 11 00 00 01 00 1
0 1 0 0 1 0
0 00 11 01 00 10 0
0 0 1 1 0 0 K-filters = K-Feature Maps
1 00 00 10 11 00 0
1 0 0 0 1 0
0 11 00 00 01 10 0
0 1 0 0 1 0
0 00 11 00 01 10 0
0 0 1 0 1 0
0 0 1 0 1 0 Depth of feature map = No. of feature maps = No. of filters

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 26
The Convolution Operation : Terminologies

1 0 0 0 0 1
1 0 0 0 0 1
0 11 00 00 01 00 1
0 1 0 0 1 0
0 00 11 01 00 10 0 -1-1 11 -1-1
0 0 1 1 0 0 -1 1 -1
1 00 00 10 11 00 0 -1-1 11 -1-1
1 0 0 0 1 0 0 0 0 Filter
0 11 00 00 01 10 0 -1-1 11 -1-1
0 1 0 0 1 0 -1 1 -1
0 00 11 00 01 10 0
0 0 1 0 1 0
0 0 1 0 1 0

1. Depth of an Input Image = No. of channels in the Input Image = Depth of a filter

2. Assuming square filters, Spatial Extent (F) of a filter is the size of the filter

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 27
The Convolution Operation : Zero Padding

conv3x3
2x2
4x4

Pad Zeros and then convolve to obtain a


feature map with dimension = input image dimension

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 28
The Convolution Operation : Zero Padding

Feature map size: 5x5

Input image size: 5x5

Source: Intuitively Understanding Convolutions for Deep Learning | by Irhum Shafkat | Towards Data Science

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 29
Convolutional Neural Network (CNN) : At a glance

cat | dog
Convolution

Pooling
Can repeat Fully Connected
many times Feedforward network
Convolution

Pooling

Source: CS 898: Deep Learning and Its Applications, University of Flattened


Waterloo, Canada.
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 30
Pooling

1 -1 -1 -1 1 -1
-1 1 -1 Filter 1 -1 1 -1 Filter 2
-1 -1 1 -1 1 -1
• Max Pooling

3 -1 -3 -1 -1 -1 -1 -1 • Average Pooling

-3 1 0 -3 -1 -1 -2 1

-3 -3 0 1 -1 -1 -2 1

3 -2 -2 -1 -1 0 -4 3
Source: CS 898: Deep Learning and Its Applications, University of Waterloo, Canada.

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 31
Pooling

Max. Pooling Average Pooling

Stride ?

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 32
Why Pooling ?

▪ Subsampling pixels will not change the object

bird
bird

Subsampling

▪ We can subsample the pixels to make image smaller

▪ Therefore, fewer parameters to characterize the image


Source: CS 898: Deep Learning and Its Applications, University of Waterloo, Canada.
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 33
Relation between i/p size, feature map size, filter size
Input Image
-1-1 11 -1-1
-1 0 -1 Stride length = S
1 0 0 0 0 1 -1-1 11 -1-1 No. of Filters = K
1 0 0 0 0 1 -1 0 -1 Padding = P
0 11 00 00 01 00 1 -1-1 11 -1-1
0 1 0 0 1 0 -1 0 -1 Filter
0 00 11 01 00 10 0
0 0 1 1 0 0 -1-1 11 -1-1
1 00 00 10 11 00 0 H1 -1 1 -1
1 0 0 0 1 0 -1-1 11 -1-1
0 11 00 00 01 10 0 0 0 0 F H2
0 1 0 0 1 0 -1-1 11 -1-1
0 00 11 00 01 10 0 -1 1 -1
0 0 1 0 1 0
D1
0 0 1 0 1 0 D1
D2
F

1 -1 -1
W1
11 -1-1 -1-1 W2

-1-1 11 -1-1
𝑾𝟏 − 𝑭 + 𝟐𝑷 -1 1 -1 𝑯𝟏 − 𝑭 + 𝟐𝑷
𝑾𝟐 = +𝟏 -1-1 -1-1 11 𝑯𝟐 = +𝟏 𝑫𝟐 = 𝑲
𝑺 -1 -1 1 𝑺

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 34
Important properties of CNN

▪ Sparse Connectivity

▪ Shared weights

▪ Equivariant representation

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 35
Properties of CNN
1 1 1
1 -1 -1 Filter 1 2 0 -1

-1 1 -1 3 0 -1

-1 -1 1 4 0 -1
3

..
7 0 1
1 0 0 0 0 1 8 1 -1

0 1 0 0 1 0 9 0
0 0 1 1 0 0 10 0 -1
-1

..
1 0 0 0 1 0 1
Fewer parameters!
13 0
0 1 0 0 1 0 Only connect to 9 inputs, not fully
14 0 connected (Sparse Connectivity)
0 0 1 0 1 0
15 1
6 x 6 Image 16 1

..
Source: CS 898: Deep Learning and Its Applications, University of Waterloo, Canada.
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 36
Properties of CNN

Is sparse connectivity good?

Ian Goodellow et al. 2016

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 37
Properties of CNN
1 1
1 -1 -1 2 0
-1 1 -1 3 0
-1 -1 1 4 0 3

..
7 0
1 0 0 0 0 1 8 1
0 1 0 0 1 0 9 0
0 0 1 1 0 0 10 0
-1

..
1 0 0 0 1 0
13 0
0 1 0 0 1 0 Even Fewer parameters!
14 0
0 0 1 0 1 0 Fewer parameters!
15 1
6 x 6 Image Shared weights
16 1

..
Source: CS 898: Deep Learning and Its Applications, University of Waterloo, Canada.
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 38
Equivariance to translation

▪ A function f is equivariant to a function g if f(g(x)) = g(f(x)) or if the output changes in the same way as the
input.

▪ This is achieved by the concept of weight sharing.

▪ As the same weights are shared across the images, hence if an object occurs in any image, it will be detected
irrespective of its position in the image.

Source: Translational Invariance Vs Translational Equivariance | by Divyanshu Mishra | Towards Data Science

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 39
CNN vs Fully Connected NN

▪ A CNN compresses the fully connected NN in two ways:

▪ Reducing the number of connections

▪ Shared weights

▪ Max pooling further reduces the parameters to characterize an image.

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 40
Convolutional Neural Network (CNN) : Non-linearity with activation

cat | dog
Convolution +
ReLU

Pooling
Fully Connected
Feedforward network
Convolution+
ReLu

Pooling

Source: CS 898: Deep Learning and Its Applications, University of Flattened


Waterloo, Canada.
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 41
LeNet-5 Architecture for handwritten text recognition

#Param.
#Param. ((5*5*16)*120 +
#Param. #Param. 120 = 48120 #Param.
((5*5*6)+1) * 16 = 2416
((5*5*1)+1) * 6 = 156 =0 84*120 + 84=
#Param. 10164
=0 #Param.
84*10 + 10= 850

tanh tanh

sigmoid

S =1, F=5, S =2, F=2, S =1, F=5, S =2, F=2,


K=6, P=2 K=6, P=0 K=16, P=0 K=16, P=0

LeCun, Y., Bottou, L., Bengio, Y., Haffner, P., & others. (1998). Gradient-based learning applied to document recognition.
Proceedings of the IEEE, 86(11), 2278–2324.

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 42
LeNet-5 Architecture for handwritten number recognition

Source: http://yann.lecun.com/

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 43
ImageNet Dataset

More than 14 million images. 22,000 Image categories

Deng, Jia, et al. "Imagenet: A large-scale hierarchical image database."


IEEE conference on computer vision and pattern recognition. IEEE, 2009.
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 44
ImageNet Large Scale Visual Recognition Challenge
• 1000 ImageNet Categories

ZFNet

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 45
AlexNet (2012)

▪ Used ReLU activation function instead of


sigmoid and tanh.

▪ Used data augmentation techniques


that consisted of image translations,
horizontal reflections, and patch
extractions.

▪ Implemented dropout layers.

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 46
AlexNet Architecture

#Param. = 0 #Param. = 0
#Param. #Param.
#Param.
((5*5*96)+1) * 256 = 614656 ((3*3*256)+1) * 384 =
((11*11*3)+1) * 96 = 34944
885120

#Param. = 0
#Param.
((3*3*384)+1) * 256 =884992
Total #Param.
#Param. 62M
((3*3*384)+1) * 384 =
1327488
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012).
Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25.
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 47
ZFNet Architecture (2013)

• Used filters of size 7x7 instead of 11x11 in AlexNet

• Used Deconvnet to visualize the intermediate results.

Zeiler, M. D., & Fergus, R. (2013). Visualizing and understanding convolutional networks.
In European conference on computer vision (pp. 818-833). Springer, Cham.
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 48
ZFNet

Visualizing and Understanding Deep Neural Networks by Matt Zeiler - YouTube

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 49
ZFNet

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 50
VGGNet Architecture (2014)

Image Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

• Used filters of size 3x3 in all the convolution layers.

• 3 conv layers back-to-back have an effective receptive field of 7x7.

• Also called VGG-16 as it has 16 layers.

• This work reinforced the notion that convolutional neural networks have to have a deep network of layers in order for
this hierarchical representation of visual data to work

Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition , International Conference on Learning Representations (ICLR14)

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 51
GoogleNet Architecture (2014)

• Most of the architectures discussed till now


apply either of the following after each
convolution operation:
• Max Pooling
• 3x3 convolution
• 5x5 convolution

• Idea: Why cant we apply them all together


at the same time and concatenate the
feature maps.

• Problem: This will result in large number of


computations.

• Specifically, each element of the output


required O(FxFxD) computations
Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., ... & Rabinovich, A. (2015). Going deeper with convolutions.
In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR’15)

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 52
GoogleNet Architecture (2014)

• Solution: Apply 1x1 convolutions

• 1x1 convolution aggregates along the depth.

• So, if we apply D1 1x1 convolutions (D1<D), we


will get an output of size W x H x D1

• So, the total number of computations will reduce to


O(FxFxD1)

• We could then apply subsequent 3x3, 5 x5 filters on


this reduced output

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., ... & Rabinovich, A. (2015). Going deeper with convolutions.
In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR’15)

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 53
GoogleNet Architecture (2014)

• Also, we might want to use different


dimensionality reductions (applying 1x1
convolutions of different sizes) before the
3x3 and 5x5 filters.

• We can also add the maxpooling layer


followed by 1x1 convolution.

• After this, we concatenate all these layers.

• This is called the Inception module.

• GoogleNet contains many such inception


The Inception module modules.

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras


Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., ... & Rabinovich, A. (2015). Going deeper with convolutions.
In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR’15)

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 54
GoogleNet Architecture (2014)

Global average pooling

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

• 12 times less parameters and 2 times more


computations than AlexNet

• Used Global Average Pooling instead of


Flattening.

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., & Rabinovich, A. (2015). Going deeper with convolutions.
In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR’15)

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 55
ResNet Architecture (2015)

Effect of increasing layers of shallow CNN when experimented over the CIFAR dataset

Source: Residual Networks (ResNet) - Deep Learning - GeeksforGeeks


Shallow CNN +
Shallow CNN Additional layers
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition.
In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 56
ResNet Architecture (2015)

ResNet-34

Source: Residual Networks (ResNet) - Deep Learning - GeeksforGeeks


He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition.
In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 57
ResNet Architecture (2015)

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition.
In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 58
Sequence Modeling using RNN

DSE 3151 DEEP LEARNING

Dr. Rohini Rao & Dr. Abhilash K Pai


Dept. of Data Science and Computer Applications
MIT Manipal
Examples of Sequence Data

▪ Speech Recognition Mary had a little lamb

▪ Music Generation

▪ Sentiment Classification

▪ DNA Sequence Analysis

▪ Machine Translation

▪ Video Activity Recognition

▪ Name Entity Recognition

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 2


Examples of Sequence Data

▪ Speech Recognition

▪ Music Generation La

▪ Sentiment Classification

▪ DNA Sequence Analysis

▪ Machine Translation

▪ Video Activity Recognition

▪ Name Entity Recognition

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 3


Examples of Sequence Data

▪ Speech Recognition

▪ Music Generation

▪ Sentiment Classification “Its an average movie”

▪ DNA Sequence Analysis

▪ Machine Translation

▪ Video Activity Recognition

▪ Name Entity Recognition

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 4


Examples of Sequence Data

▪ Speech Recognition

▪ Music Generation

▪ Sentiment Classification

▪ DNA Sequence Analysis AGCCCCTGTGAGGAACTAG AGCCCCTGTGAGGAACTAG

▪ Machine Translation

▪ Video Activity Recognition

▪ Name Entity Recognition

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 5


Examples of Sequence Data

▪ Speech Recognition

▪ Music Generation

▪ Sentiment Classification

▪ DNA Sequence Analysis

▪ Machine Translation ARE YOU FEELING SLEEPY क्या आपको न ींद आ रह है

▪ Video Activity Recognition

▪ Name Entity Recognition

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 6


Examples of Sequence Data

▪ Speech Recognition

▪ Music Generation

▪ Sentiment Classification

▪ DNA Sequence Analysis

▪ Machine Translation

▪ Video Activity Recognition WAVING

▪ Name Entity Recognition

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 7


Examples of Sequence Data

▪ Speech Recognition

▪ Music Generation

▪ Sentiment Classification

▪ DNA Sequence Analysis

▪ Machine Translation

▪ Video Activity Recognition

▪ Name Entity Recognition “Alice wants to discuss about “Alice wants to discuss about
Deep Learning with Bob” Deep Learning with Bob”

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 8


Issues with using ANN/CNN on sequential data

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 9


Issues with using ANN/CNN on sequential data

• In feedforward and convolutional neural networks, the size of the input was always fixed.

• In many applications with sequence data, the input is not of a fixed size.

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 10


Issues with using ANN/CNN on sequential data

• In feedforward and convolutional neural networks, the size of the input was always fixed.

• In many applications with sequence data, the input is not of a fixed size.

• Further, each input to the ANN/CNN network was independent of the previous or future inputs.

• With sequence data, successive inputs may not be independent of each other.

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 11


Modelling Sequence Learning Problems: Introduction

• The model needs to look at a sequence of inputs and produce an output (or outputs).

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 12


Modelling Sequence Learning Problems: Introduction

• The model needs to look at a sequence of inputs and produce an output (or outputs).

• For this purpose, lets consider each input to be corresponding to one time step.

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 13


Modelling Sequence Learning Problems: Introduction

Running

------- Don’t care --------

Task: Auto-complete Task: P-o-S tagging Task: Movie Review Task: Action Recognition
Legend

• The model needs to look at a sequence of inputs and produce an output (or outputs). Output
layer
• For this purpose, lets consider each input to be corresponding to one time step.
Hidden
layer
• Next, build a network for each time step/input, where each network performs the same task
(eg: Auto complete: input=character, output=character) Input layer

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 14


How to Model Sequence Learning Problems?

1. Model the dependence between inputs.


• Eg: The next word after an ‘adjective’ is most probably a ‘noun’.

2. Account for variable number of inputs.


• A sentence can have arbitrary no. of words.
• A video can have arbitrary no. of frames.

3. Make sure that the function executed at each time step is the same.
• Because at each time step we are doing the same task.

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 15


Modelling Sequence Learning Problems using Recurrent Neural Networks (RNN)

Introduction

Considering the network at each time step to be a fully connected


network, the general equation for the network at each time step is:

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 16


Modelling Sequence Learning Problems using Recurrent Neural Networks (RNN)

Introduction

Considering the network at each time step to be a fully connected


network, the general equation for the network at each time step is:

Since we want the same function to be executed at each timestep we


Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras should share the same network (i.e., same parameters at each timestep)

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 17


Recurrent Neural Networks (RNN): Introduction

• If the input sequence is of length ‘n’, we would create ‘n’ networks for each input, as seen previously.

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

By doing so, we have addressed the issue of variable input size!!

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 18


Recurrent Neural Networks (RNN): Introduction

• If the input sequence is of length ‘n’, we would create ‘n’ networks for each input, as seen previously.

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

But, how to model the dependencies between the inputs ?

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 19


Recurrent Neural Networks (RNN)

Solution: Add recurrent connection in the network.

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 20


Recurrent Neural Networks (RNN)

Solution: Add recurrent connection in the network.

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 21


Recurrent Neural Networks (RNN)
• So, the RNN equation:
Solution: Add recurrent connection in the network.

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 22


Recurrent Neural Networks (RNN)
• So, the RNN equation:
Solution: Add recurrent connection in the network.

U, W, V, b, c are parameters of the network

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 23


Recurrent Neural Networks (RNN)
• So, the RNN equation:
Solution: Add recurrent connection in the network.

U, W, V, b, c are parameters of the network

The dimensions of each term is as follows:

𝑋𝑖 -- [1 x no. of i/p neurons]


𝑠𝑖 -- [1 x no. of neurons in the hidden state]
W -- [no. of neurons in the hidden state x no. of neurons in the hidden state]
U -- [no. of i/p neurons x no. of neurons in the hidden state]
V -- [no. of neurons in the hidden state x no. of neurons in the o/p state]
b -- [1 x no. of neurons in the hidden state]
Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras c – [1 x no. of neurons in the o/p state]

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 24


Recurrent Neural Networks (RNN)
• So, the RNN equation:
Solution: Add recurrent connection in the network.

U, W, V, b, c are parameters of the network

The dimensions of each term is as follows:

𝑋𝑖 -- [1 x no. of i/p neurons]


𝑠𝑖 -- [1 x no. of neurons in the hidden state]
W -- [no. of neurons in the hidden state x no. of neurons in the hidden state]
U -- [no. of i/p neurons x no. of neurons in the hidden state]
V -- [no. of neurons in the hidden state x no. of neurons in the o/p state]
b -- [1 x no. of neurons in the hidden state]
Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras c – [1 x no. of neurons in the o/p state]

• At time step i=0 there are no previous inputs, so they are typically assumed to be all zeros.
• Since, the output of si at time step i is a function of all the inputs from previous time steps, we could say it has a form of memory.
• A part of a neural network that preserves some state across time steps is called a memory cell ( or simply a cell )

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 25


Recurrent Neural Networks (RNN)

Compact representation of a RNN:

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 26


Recurrent Neural Networks (RNN)

Unroll
Same representation as seen
previously

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

• Unrolling the network through time = representing network against time axis.

• At each time step t (also called a frame) RNN receives inputs xi as well as output from previous step yi-1

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 27


Input and Output Sequences

Seq-to-Seq
Vector-to-Seq

Seq-to-Vector

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 28


Recurrent Neural Networks (RNN) : Example
1
Temperature

0.5

Source: https://www.youtube.com/c/joshstarmer
Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 29
Recurrent Neural Networks (RNN) : Example
1
Temperature

0.5

Problem :Given the temperatures of yesterday and today predict


tomorrow’s temperature.

Source: https://www.youtube.com/c/joshstarmer
Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 30
Recurrent Neural Networks (RNN) : Example
1
Temperature

0.5

Problem :Given the temperatures of yesterday and today predict


tomorrow’s temperature.

Source: https://www.youtube.com/c/joshstarmer
Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 31
Recurrent Neural Networks (RNN) : Example

Source: https://www.youtube.com/c/joshstarmer
Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 32
Recurrent Neural Networks (RNN) : Example

Source: https://www.youtube.com/c/joshstarmer
Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 33
Recurrent Neural Networks (RNN) : Example

Unrolling the feedback loop by making a copy of NN for each input value

Source: https://www.youtube.com/c/joshstarmer
Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 34
Recurrent Neural Networks (RNN) : Example

Source: https://www.youtube.com/c/joshstarmer
Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 35
Recurrent Neural Networks (RNN) : Example

Problem : Given the temperature of 3 days (today,


yesterday and day before yesterday), Predict tomorrow’s
temperature?

Source: https://www.youtube.com/c/joshstarmer
Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 36
Recurrent Neural Networks (RNN) : Example

Problem : Given the temperature of 3 days (today,


yesterday and day before yesterday), Predict tomorrow’s
temperature?

Source: https://www.youtube.com/c/joshstarmer
Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 37
Recurrent Neural Networks (RNN) : Example

Problem : Given the temperature of 3 days (today,


yesterday and day before yesterday), Predict tomorrow’s
temperature?

So, the no. of networks = no. of inputs

Source: https://www.youtube.com/c/joshstarmer
Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 38
Backpropagation in ANN : Recap

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 39


Backpropagation in ANN : Recap

For simplicity, lets represent the above network as follows:

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 40


Backpropagation in ANN : Recap

For simplicity, lets represent the above network as follows:

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 41


Backpropagation in ANN : Recap

For simplicity, lets represent the above network as follows:

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 42


Backpropagation in ANN : Recap

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 43


Backpropagation in ANN : Recap

The Loss function: L(w1, b1, w2, b2, w3, b3)

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 44


Backpropagation in ANN : Recap

The Loss function: L(w1, b1, w2, b2, w3, b3)

w1, b1 w2, b2 w3, b3

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 45


Backpropagation in ANN : Recap

The Loss function: L(w1, b1, w2, b2, w3, b3)

w1, b1 w2, b2 w3, b3

By how much should the parameters be changed to make an efficient


decrease in the loss L ?

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 46


Backpropagation in ANN : Recap

w1, b1 w2, b2 w3, b3

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 47


Backpropagation in ANN : Recap

ak-3 ak-2 ak-1 ak


w1, b1 w2, b2 w3, b3

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 48


Backpropagation in ANN : Recap

ak-3 ak-2 ak-1 ak


w1, b1 w2, b2 w3, b3

ak = σ(wk ak-1 + bk)

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 49


Backpropagation in ANN : Recap

ak-3 ak-2 ak-1 ak C0 =(ak – y)2


w1, b1 w2, b2 w3, b3

ak = σ(wk ak-1 + bk)

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 50


Backpropagation in ANN : Recap

ak-3 ak-2 ak-1 ak C0 =(ak – y)2

ak = σ(wk ak-1 + bk)

Now, if zk = wk ak-1 + bk

Then, ak = σ(zk)

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 51


Backpropagation in ANN : Recap

wk ak-1 bk
ak-3 ak-2 ak-1 ak C0 =(ak – y)2
zk

ak = σ(wk ak-1 + bk)


y ak

Now, if zk = wk ak-1 + bk
C0
Dependency graph Then, ak = σ(zk)

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 52


Backpropagation in ANN : Recap

𝜕𝐶0
Aim is to compute :
𝜕𝑤𝑘

wk ak-1 bk
ak-3 ak-2 ak-1 ak C0 =(ak – y)2
zk

ak = σ(wk ak-1 + bk)


y ak

Now, if zk = wk ak-1 + bk
C0
Dependency graph Then, ak = σ(zk)

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 53


Backpropagation in ANN : Recap

𝜕𝐶0 𝜕𝑧𝑘 𝜕𝑎𝑘 𝜕𝐶0


As there is a dependency, we need to apply chain rule =
𝜕𝑤𝑘 𝜕𝑤𝑘 𝜕𝑧𝑘 𝜕𝑎𝑘

wk ak-1 bk
ak-3 ak-2 ak-1 ak C0 =(ak – y)2
zk

ak = σ(wk ak-1 + bk)


y ak

Now, if zk = wk ak-1 + bk
C0
Dependency graph Then, ak = σ(zk)

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 54


Backpropagation in ANN : Recap

𝜕𝐶0 𝜕𝑧𝑘 𝜕𝑎𝑘 𝜕𝐶0


= = ak-1 σ’(zk) * 2*(ak – y)
𝜕𝑤𝑘 𝜕𝑤𝑘 𝜕𝑧𝑘 𝜕𝑎𝑘

wk ak-1 bk
ak-3 ak-2 ak-1 ak C0 =(ak – y)2
zk

ak = σ(wk ak-1 + bk)


y ak

Now, if zk = wk ak-1 + bk
C0
Dependency graph Then, ak = σ(zk)

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 55


Training RNNs : Backpropagation through time (BPTT)

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 56


Training RNNs : Backpropagation through time (BPTT)

• For instance, consider the task of auto-completion


(predicting the next character).

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 57


Training RNNs : Backpropagation through time (BPTT)

• For instance, consider the task of auto-completion


(predicting the next character).

• For simplicity we assume that there are only 4


characters in our vocabulary (d, e, p, <stop>).

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 58


Training RNNs : Backpropagation through time (BPTT)

• For instance, consider the task of auto-completion


(predicting the next character).

• For simplicity we assume that there are only 4


characters in our vocabulary (d, e, p, <stop>).

• At each timestep we want to predict one of these


4 characters.

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 59


Training RNNs : Backpropagation through time (BPTT)

• For instance, consider the task of auto-completion


(predicting the next character).

0.1 0.1 0.1 0.1


• For simplicity we assume that there are only 4
0.7 0.7 0.1 0.1 characters in our vocabulary (d, e, p, <stop>).
0.1 0.1 0.7 0.7
0.1 0.1 0.1 0.1
• At each timestep we want to predict one of these
4 characters.

• Suppose we initialize U, V, W randomly and the


network predicts the probabilities (green block)

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 60


Training RNNs : Backpropagation through time (BPTT)

• For instance, consider the task of auto-completion


(predicting the next character).

0.1 0.1 0.1 0.1


• For simplicity we assume that there are only 4
0.7 0.7 0.1 0.1 characters in our vocabulary (d, e, p, <stop>).
0.1 0.1 0.7 0.7
0.1 0.1 0.1 0.1
• At each timestep we want to predict one of these
4 characters.

• Suppose we initialize U, V, W randomly and the


network predicts the probabilities (green block)

• And the true probabilities are as shown (red


block).

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 61


Training RNNs : Backpropagation through time (BPTT)

• For instance, consider the task of auto-completion


(predicting the next character).

0.1 0.1 0.1 0.1


• For simplicity we assume that there are only 4
0.7 0.7 0.1 0.1 characters in our vocabulary (d, e, p, <stop>).
0.1 0.1 0.7 0.7
0.1 0.1 0.1 0.1
• At each timestep we want to predict one of these
4 characters.

• Suppose we initialize U, V, W randomly and the


network predicts the probabilities (green block)

• And the true probabilities are as shown (red


block).

• At each time step, the loss Li(θ) is calculated,


Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras where θ = {U,V,W,b,c} is the set of parameters.

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 62


Training RNNs : Backpropagation through time (BPTT)

0.1 0.1 0.1 0.1


0.7 0.7 0.1 0.1
0.1 0.1 0.7 0.7
0.1 0.1 0.1 0.1

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 63


Training RNNs : Backpropagation through time (BPTT)

To train the RNNs we need to answer two questions:

0.1 0.1 0.1 0.1


0.7 0.7 0.1 0.1
0.1 0.1 0.7 0.7
0.1 0.1 0.1 0.1

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 64


Training RNNs : Backpropagation through time (BPTT)

To train the RNNs we need to answer two questions:

0.1 0.1 0.1 0.1 1) What is the total loss made by the model ?
0.7 0.7 0.1 0.1
0.1 0.1 0.7 0.7
0.1 0.1 0.1 0.1

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 65


Training RNNs : Backpropagation through time (BPTT)

To train the RNNs we need to answer two questions:

0.1 0.1 0.1 0.1 1) What is the total loss made by the model ?
0.7 0.7 0.1 0.1
0.1 0.1 0.7 0.7
0.1 0.1 0.1 0.1

2) How do we backpropagate this loss and update the


parameters of the network ?

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 66


Training RNNs : Backpropagation through time (BPTT)

To train the RNNs we need to answer two questions:

0.1 0.1 0.1 0.1 1) What is the total loss made by the model ?
0.7 0.7 0.1 0.1 Ans: the Sum of individual losses
0.1 0.1 0.7 0.7
0.1 0.1 0.1 0.1

2) How do we backpropagate this loss and update the


parameters of the network ?

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 67


Training RNNs : Backpropagation through time (BPTT)

To train the RNNs we need to answer two questions:

0.1 0.1 0.1 0.1 1) What is the total loss made by the model ?
0.7 0.7 0.1 0.1 Ans: the Sum of individual losses
0.1 0.1 0.7 0.7
0.1 0.1 0.1 0.1

2) How do we backpropagate this loss and update the


parameters of the network ?
Ans: BPTT by computing the partial
derivative of L w.r.t U, V, W, b, c

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 68


Training RNNs : Backpropagation through time (BPTT)

0.1 0.1 0.1 0.1


0.7 0.7 0.1 0.1
0.1 0.1 0.7 0.7
0.1 0.1 0.1 0.1

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 69


Training RNNs : Backpropagation through time (BPTT)

0.1 0.1 0.1 0.1


0.7 0.7 0.1 0.1
0.1 0.1 0.7 0.7
0.1 0.1 0.1 0.1

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 70


Training RNNs : Backpropagation through time (BPTT)

0.1 0.1 0.1 0.1


0.7 0.7 0.1 0.1
0.1 0.1 0.7 0.7
0.1 0.1 0.1 0.1

For example, if:

and

Ignoring bias and considering O as linear:

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 71


Training RNNs : Backpropagation through time (BPTT)

0.1 0.1 0.1 0.1


0.7 0.7 0.1 0.1
0.1 0.1 0.7 0.7
0.1 0.1 0.1 0.1

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 72


Training RNNs : Backpropagation through time (BPTT)

0.1 0.1 0.1 0.1


0.7 0.7 0.1 0.1
0.1 0.1 0.7 0.7
0.1 0.1 0.1 0.1

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 73


Training RNNs : Backpropagation through time (BPTT)

0.1 0.1 0.1 0.1


0.7 0.7 0.1 0.1
0.1 0.1 0.7 0.7
0.1 0.1 0.1 0.1

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 74


Training RNNs : Backpropagation through time (BPTT)

0.1 0.1 0.1 0.1


0.7 0.7 0.1 0.1
0.1 0.1 0.7 0.7
0.1 0.1 0.1 0.1

Ordered network
Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 75


Training RNNs : Backpropagation through time (BPTT)

computation is straight forward

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 76


Training RNNs : Backpropagation through time (BPTT)

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 77


Training RNNs : Backpropagation through time (BPTT)

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 78


Training RNNs : Backpropagation through time (BPTT)

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 79


Training RNNs : Backpropagation through time (BPTT)

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 80


Training RNNs : Backpropagation through time (BPTT)

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 81


Training RNNs : Backpropagation through time (BPTT)

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 82


Back Propagation through time – Vanishing & Exploding gradient

Recall that:

Therefore:

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 83


Back Propagation through time – Vanishing & Exploding gradient

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 84


Back Propagation through time – Vanishing & Exploding gradient

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 85


Back Propagation through time – Vanishing & Exploding gradient

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 86


Back Propagation through time – Vanishing & Exploding gradient

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 87


Back Propagation through time – Vanishing & Exploding gradient

A gist of the exploding gradient (same case with vanishing gradient if instead of 2 the value is less than 1)

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 88


Back Propagation through time in RNNs : Issues & Solutions

1. Gradient calculations are expensive


(slow training for long sequences)
Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras
▪ Solution: Truncated BPTT

2. Exploding gradients (long sequences)

3. Vanishing gradients (long sequences)


• Instead of looking at all ‘n’ time steps, we would look at lesser
time steps allowing us to estimate rather than calculate the
gradient used to update the weights.

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 89


Back Propagation through time in RNNs : Issues & Solutions

1. Gradient calculations are expensive


(slow training for long sequences) Let
▪ Solution: Truncated BPTT
I. Clipping by value:
if ‖g‖ ≥ max_threshold then:
2. Exploding gradients (long sequences) g ← threshold

▪ Solution: Gradient Clipping end if

II. Clipping by norm:


if ‖g‖ ≥ threshold then:
g ← threshold * g/‖g‖
3. Vanishing gradients (long sequences)
end if

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 90


Back Propagation through time in RNNs : Issues & Solutions

1. Gradient calculations are expensive


(slow training for long sequences)
▪ Solution: Truncated BPTT

2. Exploding gradients (long sequences)


▪ Solution: Gradient Clipping

3. Vanishing gradients (long sequences)


▪ Solution: Use alternate RNN architectures such as LSTM and GRU.

Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 91


LSTM & GRU

DSE 3151 DEEP LEARNING

Dr. Rohini Rao & Dr. Abhilash K Pai


Dept. of Data Science and Computer Applications
MIT Manipal
LSTM and GRU : Introduction

• The state (si) of an RNN records information from


all previous time steps.

• At each new timestep the old information gets


morphed by the current input.

• After ‘t’ steps the information stored at time step t-


k (for some k < t) gets completely morphed.

• It would be impossible to extract the original


Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras
information stored at time step t – k.

• Also, there is the Vanishing gradients problem!

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 2
LSTM and GRU : Introduction
• Consider a scenario where we have to evaluate the
The white board analogy: expression on a whiteboard:
Evaluate “ac(bd+a) + ad”
given that a= 1, b= 3, c= 5, d=11
• Normally, the evaluation in white board would look
like:
ac = 5
bd = 33
bd + a = 34
ac(bd + a) = 170
ad = 11
ac(bd + a) + ad = 181

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras • Now, if the white board has space to accommodate
only 3 steps, the above evaluation cannot fit in the
required space and would lead to loss of information.

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 3
LSTM and GRU : Introduction

• A solution is to do the following:


Evaluate “ac(bd+a) + ad”
given that a= 1, b= 3, c= 5, d=11

ac = 5
bd = 33
bd + a = 34
ac(bd + a) = 170
ad = 11
ac(bd + a) + ad = 181

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 4
LSTM and GRU : Introduction

• A solution is to do the following:


Evaluate “ac(bd+a) + ad”
given that a= 1, b= 3, c= 5, d=11
• Selectively write:
ac = 5
bd = 33
ac = 5
bd = 33
bd + a = 34
ac(bd + a) = 170
ad = 11
ac(bd + a) + ad = 181

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 5
LSTM and GRU : Introduction

• A solution is to do the following:


Evaluate “ac(bd+a) + ad”
given that a= 1, b= 3, c= 5, d=11
• Selectively write:
ac = 5
bd = 33
ac = 5
• Selectively read:
bd = 33 ac = 5
bd + a = 34 bd = 33
bd + a = 34
ac(bd + a) = 170
ad = 11
ac(bd + a) + ad = 181

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 6
LSTM and GRU : Introduction

• A solution is to do the following:


Evaluate “ac(bd+a) + ad”
given that a= 1, b= 3, c= 5, d=11
• Selectively write:
ac = 5
bd = 33
ac = 5
• Selectively read:
bd = 33 ac = 5
bd + a = 34 bd = 33
bd + a = 34
ac(bd + a) = 170
ad = 11 Now the board is full
ac(bd + a) + ad = 181

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 7
LSTM and GRU : Introduction

• A solution is to do the following:


Evaluate “ac(bd+a) + ad”
given that a= 1, b= 3, c= 5, d=11
• Selectively write:
ac = 5
bd = 33
ac = 5
• Selectively read:
bd = 33 ac = 5
bd + a = 34 bd = 33
bd + a = 34
ac(bd + a) = 170
ad = 11 Now the board is full
ac(bd + a) + ad = 181
• So, Selectively forget:
ac = 5

bd + a = 34

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 8
LSTM and GRU : Introduction

• A solution is to do the following:


Evaluate “ac(bd+a) + ad”
given that a= 1, b= 3, c= 5, d=11
• Selectively write:
ac = 5
bd = 33
ac = 5
• Selectively read:
bd = 33 ac = 5
bd + a = 34 bd = 33
bd + a = 34
ac(bd + a) = 170
ad = 11 Now the board is full
ac(bd + a) + ad = 181
• So, Selectively forget:
ac = 5
ac(bd + a) = 170
bd + a = 34

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 9
LSTM and GRU : Introduction

• A solution is to do the following:


Evaluate “ac(bd+a) + ad”
given that a= 1, b= 3, c= 5, d=11
• Selectively write:
ac = 5
bd = 33
ac = 5
• Selectively read:
bd = 33 ac = 5
bd + a = 34 bd = 33
bd + a = 34
ac(bd + a) = 170
ad = 11 Now the board is full
ac(bd + a) + ad = 181
• So, Selectively forget:
ac = 5
ac(bd + a) = 170
ad = 11

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 10
LSTM and GRU : Introduction

• A solution is to do the following:


Evaluate “ac(bd+a) + ad”
given that a= 1, b= 3, c= 5, d=11
• Selectively write:
ac = 5
bd = 33
ac = 5
• Selectively read:
bd = 33 ac = 5
bd + a = 34 bd = 33
bd + a = 34
ac(bd + a) = 170
ad = 11 Now the board is full
ac(bd + a) + ad = 181
• So, Selectively forget:
ac(bd + a) + ad = 181
ac(bd + a) = 170
ad = 11
Since the RNN also has a finite state size, we need to figure out a way to allow it to selectively read, write and forget

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 11
LSTM and GRU : Introduction

• RNN reads the document from left to right and after


every word updates the state.

• By the time we reach the end of the document the


information obtained from the first few words is
completely lost.

• In our improvised network, ideally, we would like to:

• Forget the information added by stop words (a, the,


etc.)

• Selectively read the information added by previous


sentiment bearing words (awesome, amazing, etc.)
Example: Predicting the sentiment of a review
• Selectively write new information from the current
Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras word to the state.

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 12
LSTM and GRU : Introduction
Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras
Selectively write:

• In an RNN, the state st is defined as follows:

• Instead of passing st-1 as it is, we need to pass


(write) only some portions of it.

• To do this, we introduce a vector ot-1 which


decides what fraction of each element of st-1
The RNN has to learn ot-1 along with other parameters
should be passed to the next state.
(W,U,V)
• Each element of ot-1 (restricted to be between 0
and 1) gets multiplied with st-1

New parameters to be learned are: Wo, Uo, bo


• How does RNN know what fraction of the state
Ot is called the output gate as it decides how much to pass to pass on?
(write) to the next time step.
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 13
LSTM and GRU : Introduction
Selectively read:

• We will now use ht-1 and xt to compute the


new state at the time step t :

• Again, to pass only useful information from


to st, we selectively read from it before
constructing the new cell state.
Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

• To do this we introduce another gate called as


the input gate:

• And use to selectively read the


information.
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 14
LSTM and GRU : Introduction

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

Selectively forget
• How do we combine st-1 and to get the new
state?

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 15
LSTM and GRU : Introduction

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

Selectively forget • But we may not want to use the whole of st-1
• How do we combine st-1 and to get the new but forget some parts of it.
state?
• To do this a forget gate is introduced:

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 16
LSTM (Long Short-Term Memory)

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

Long-term memory

Short-term memory
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 17
LSTM (Long Short-Term Memory)

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

• LSTM has many variants which include different number of gates and also different arrangement of gates.

• A popular variant of LSTM is the Gated Recurrent Unit (GRU).

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 18
LSTM Cell

▪ Neuron called a Cell

▪ FC are fully connected layers

▪ Long Term state c(t-1) traverses through forget gate


forgetting some memories and adding some new
memories

▪ Long term state c(t-1) is passed through tanh and


then filtered by an output gate, which produces
short term state h(t)

▪ Update gate- g(t) takes current input x(t) and


previous short term state h(t-1)

▪ Important parts of output g(t) goes to long term


• Note: ct is same as st state
Source: Aurélien Géron. Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow: Concepts, tools, and techniques to build intelligent systems. " O'Reilly Media, Inc.", 2019.

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 19
LSTM Cell

▪ Gating Mechanism- regulates information


that the network stores

Other 3 layers are gate controllers


▪ Forget gate f(t) controls which part of the
long—term state should be erased

▪ Input gate i(t) controls which part of g(t)


should be added to long term state

▪ Output gate o(t) controls which parts of long


term state should be read and output at this
time state
▪ both to h(t) and to y(t)

Source: Aurélien Géron. Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow: Concepts, tools, and techniques to build intelligent systems. " O'Reilly Media, Inc.", 2019.

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 20
LSTM computations

An LSTM cell can learn to recognize an important

input (role of the input gate), store it in long term

state, preserve it for as long as possible it is needed

(role of forget gate), and extract it whenever it is

needed.

Wxi, Wxf, Wxo, Wxg are the weight matrices of each of the four layers for their connection to the input
vector x(t).
Whi, Whf, Who, and Whg are the weight matrices of each of the four layers for their connection to the
previous short-term state h(t–1).
bi, bf, bo, and bg are the bias terms for each of the four layers.
Source: Aurélien Géron. Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow: Concepts, tools, and techniques to build intelligent systems. " O'Reilly Media, Inc.", 2019.

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 21
Gated Recurrent Unit (GRU)

Gates: States:

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 22
Gated Recurrent Unit (GRU)

-1

Gates: States:

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 23
Gated Recurrent Unit (GRU)

-1

Gates: States:

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 24
Gated Recurrent Unit (GRU)

-1

Gates: States:

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 25
Gated Recurrent Unit (GRU)

-1

Gates: States:

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 26
Gated Recurrent Unit (GRU)
Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

-1

Gates: States:

No explicit forget gate (the forget gate and input gates are tied)
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 27
Gated Recurrent Unit (GRU)
Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

-1

Gates: States:

The gates depend directly on st-1 and not the intermediate ht-1 as in the case of LSTMs
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 28
Gated Recurrent Unit CELL (Kyunghyun Cho et al, 2014)
The main simplifications of LSTM are:
• Both state vectors (short and long term) are
merged into a single vector h(t).

• Gate controller z(t) controller controls both the


forget gate and the input gate.
• If the gate controller outputs
• 1, the forget gate is open and the input gate is
closed.
• 0, the opposite happens
• whenever a memory must be written, the location
where it will be stored is erased first.

• No output gate, the full state vector is output at


every time step.

• Reset gate controller r(t) that controls which


part of the previous state will be shown to the
main layer g(t).

Source: Aurélien Géron. Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow: Concepts, tools, and techniques to build intelligent systems. " O'Reilly Media, Inc.", 2019.

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 29
LSTM vs GRU computation

• GRU Performance is good but may have a slight dip in the


accuracy
• But lesser number of trainable parameters which makes it
advantageous to use
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 30
Avoiding vanishing gradients with LSTMs: Intuition
• During forward propagation the gates control the flow of information.

• They prevent any irrelevant information from being written to the state.

• Similarly during backward propagation they control the flow of gradients.

• It is easy to see that during backward pass the gradients will get multiplied by the gate.

• If the state at time t-1 did not contribute much to the state at time t then during backpropagation the gradients
flowing into st-1 will vanish

• The key difference from vanilla RNNs is that the flow of information and gradients is controlled by the gates which
ensure that the gradients vanish only when they should.

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 31
Different RNNs

Vanilla RNNs

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 32
Different RNNs

Eg: Image Captioning


Image -> Sequence of words

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 33
Different RNNs

Eg: Sentiment classification


Sequence of words -> Sentiment

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 34
Different RNNs

Eg: Machine Translation


Sequence of words -> Sequence of words

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 35
Different RNNs

Eg: Video Classification on frame level

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 36
Deep RNNs

RNNs that are deep not only in the time


direction but also in the input-to-output
direction.

Source: Deep Recurrent Neural Networks — Dive into Deep Learning 1.0.0-alpha1.post0 documentation (d2l.ai)
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 37
Bi-Directional RNNs: Intuition

Source: codebasics - YouTube

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 38
Bi-Directional RNNs: Intuition

• The o/p at the third time step (where input is the string “apple”) depends on only previous two i/ps

Source: codebasics - YouTube

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 39
Bi-Directional RNNs

• Adding an additional backward layer with connections as shown above makes the o/p at a time step depend on both
previous as well as future i/ps.

Source: codebasics - YouTube

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 40
Bi-directional RNNs

▪ Example - speech detection


▪ I am ___.

▪ I am ___ hungry.

▪ I am ___ hungry, and I can eat half a


cake.
▪ Regular RNNs are causal
▪ look at past and present inputs to generate
output.
▪ Use 2 recurrent layers on the same inputs
▪ One reading words from left to right
▪ Another reading words from right to left
▪ Combine their outputs at each time step

Source: Bidirectional Recurrent Neural Networks — Dive into Deep Learning 1.0.0-alpha1.post0 documentation (d2l.ai)
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 41
Bi-directional RNN computation

Ht (Forward) = A(Xt * WXH (forward) + Ht-1 (Forward) * WHH (Forward) + bH (Forward))

Ht (Backward) = A(Xt * WXH (Backward) + Ht+1 (Backward) * WHH (Backward) + bH (Backward))

where,

A = activation function,

W = weight matrix

b = bias

The output at any given hidden state is :

Yt = Ht * WAY + by , where Ht is a concatenation of Ht (Forward) and Ht (Backward)


Bidirectional Recurrent Neural Network - GeeksforGeeks

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 42
Generating Shakespearan Text using a Character RNN

▪ “The unreasonable Effectiveness of ▪ Chop the Sequential dataset into


Recurrent Neural Networks” – Andrej multiple windows
Karpathy (2015)
▪ 3-layer RNN with 512 hidden nodes on
each layer
▪ Char-RNN was trained on Shakpeare’s
work used to generate novel text- one
character at a time
▪ PANDARUS:
Alas, I think he shall be come approached and
the day
When little srain would be attain'd into being
never fed,
And who is but a chain and subjects of his
death,
I should not sleep.
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 43
Stateful RNN

▪ Stateless RNNs
▪ at each training iteration the model starts
with a hidden state full of 0s
▪ Update this state at each time step
▪ Discards the output at the final state
when moving onto next training batch
▪ Stateful RNN
▪ Uses sequential nonoverlapping input
sequences
▪ Preserves the final state after processing
one training batch
▪ use it as initial state for next training
batch
▪ Model will learn long-term patterns
despite only backpropagating through
short sequences
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 44
Encoder Decoder Models

DSE 5251 DEEP LEARNING

Rohini R Rao & Abhilash K Pai


Department of Data Science and Computer Applications
MIT Manipal
Encoder Decoder Models : Introduction
▪ Language Modeling:
Given the t-i words predict the tth word

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 2
Encoder Decoder Models : Introduction
▪ Language Modeling:
Given the t-i words predict the tth word

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 3
Encoder Decoder Models : Introduction
▪ Language Modeling:
Given the t-i words predict the tth word

For simplicity let us denote

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 4
Encoder Decoder Models : Introduction
▪ Language Modeling:
Given the t-i words predict the tth word

For simplicity let us denote


=

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 5
Encoder Decoder Models : Introduction
▪ Language Modeling:
Given the t-i words predict the tth word

For simplicity let us denote


=

We are interested in:

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 6
Encoder Decoder Models : Introduction
▪ Language Modeling:
Given the t-i words predict the tth word

For simplicity let us denote


=

We are interested in:

jth word in the vocabulary

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 7
Encoder Decoder Models : Introduction
▪ Language Modeling:
Given the t-i words predict the tth word

For simplicity let us denote


=

We are interested in:

jth word in the vocabulary

Using an RNN we will compute this as:

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 8
Encoder Decoder Models : Introduction
▪ Language Modeling:
Given the t-i words predict the tth word

For simplicity let us denote


=

We are interested in:

jth word in the vocabulary

Using an RNN we will compute this as:

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 9
Encoder Decoder Models : Introduction
▪ Language Modeling:
Given the t-i words predict the tth word

For simplicity let us denote


=

We are interested in:

jth word in the vocabulary

Using an RNN we will compute this as:

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 10
Encoder Decoder Models : Introduction

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 11
Encoder Decoder Models : Introduction

• Shorthand notations:

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 12
Encoder Decoder Models : Introduction

Task: generate a sentence given an image

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 13
Encoder Decoder Models : Introduction

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 14
Encoder Decoder Models : Introduction

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 15
Encoder Decoder Models : Introduction

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 16
Encoder Decoder Models : Introduction

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 17
Encoder Decoder Models : Introduction

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 18
Encoder Decoder Models : Introduction

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 19
Encoder Decoder Models : Introduction

There are many ways of making P(yt = j) conditional on fc7(I)

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 20
Encoder Decoder Models : Introduction

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 21
Encoder Decoder Models : Introduction

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 22
Encoder Decoder Models : Introduction

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 23
Encoder Decoder Models : Introduction

Fully connected layer #7 in CNN (here, AlexNet)

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 24
Encoder Decoder Models : Introduction

Fully connected layer #7 in CNN (here, AlexNet)

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 25
Encoder Decoder Models : Introduction

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 26
Encoder Decoder Models : Introduction

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 27
Encoder Decoder Models : Introduction

The o/p of CNN is concatenated with the input


embedding and then fed to the RNN as input at each
time step.

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 28
Encoder Decoder Models : Introduction

The o/p of CNN is concatenated with the input


embedding and then fed to the RNN as input at each
time step.

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 29
Encoder Decoder Models : Introduction

• This is typical encoder decoder


architecture

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 30
Applications of Encoder Decoder Models: Image Captioning

Represents the input


embedding of the
output obtained at
time step t-1

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 31
Applications of Encoder Decoder Models: Textual Entailment

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 32
Applications of Encoder Decoder Models: Textual Entailment

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 33
Applications of Encoder Decoder Models: Transliteration

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 34
Applications of Encoder Decoder Models: Image Question Answering

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras


Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 35
Applications of Encoder Decoder Models: Document Summarization

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras


Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 36
Applications of Encoder Decoder Models: Video Captioning

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras


Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 37
Applications of Encoder Decoder Models: Video Classification

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras


Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 38
Applications of Encoder Decoder Models: Machine Translation

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras


Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 39
Applications of Encoder Decoder Models: Machine Translation

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras


Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 40
Applications of Encoder Decoder Models: Machine Translation

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras


Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 41
Machine Translation as Conditional Learning Problem

• Y= P(y1, y2, y3 …, yn | x1,x2. …., xm)

• Jane visitr l’Afrique en Septembre (x:French)

• Jane is visiting Africa in September (y: English)

• Jane is going to be visiting Africa in September September (y: English)

• In September, Jane will visit Africa.

• Yargmax(y1,..yn)= P(y1, y2, y3 …, yn | x1,x2. …., xm)

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 43
Beam Search

“Comment vas-tu?” – How are you?


“Comment vas-tu jouer?” – How will you play?
Model could output “How will you?”
▪ Since model greedily outputs the most likely word at each step, it ended with suboptimal translation
▪ Can model go back and correct mistakes?
▪ Beam Search
▪ Keeps track of a short list of the k most promising sentences
▪ Parameter k is beam width
▪ At each decoder step it tries to extend them by 1 word
▪ Step 1: First word could be “How” (Prob: 75%), “what” or “you”
▪ Step 2: 3 copies of model made to find the next word
▪ Step 3: First model will look for next word after “How” by computing conditional probabilities
▪ Output could be “will (Prob:36%) “are”, “do”,
▪ Step 4: Compute the probabilities of each of the two word sentences the model considered and so on
▪ Advantage is that good translation for fairly short sentences can be obtained
▪ Disadvantage – really bad at translating long sentences.
▪ Due to limited short term memory, let to ATTENTION MECHANISM

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 44
Attention Mechanism (Bahdanau et al , 2014)

▪ Technique that allowed the decoder to focus on the


appropriate words at each time steps
▪ Ensures that the path to from an input word to its
translation is much shorter
▪ For instance , at the time step where the decoder
needs to output the word ‘lait’ it focusses on the word
“milk”
▪ Path from input word to translation is smaller, so not
affected by the short term limitations of RNNS
▪ Visual Attention
▪ For example : Generating image captions using visual
attentions
▪ CNN processes image and outputs some feature maps
▪ Then decoder RNN with attention generates the caption ,
one word at a time
▪ Leads to Explainability (Ribeiro et al. 2016)
▪ What led the model to produce its output
▪ Which leads to fairness in results
▪ Google apologizes after its Vision AI produced racist
results

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 45
Attention Mechanism (Bahdanau et al , 2014)

▪ Attention is proposed as a method to both align and translate.


▪ Alignment
▪ is the problem in machine translation that identifies which parts of the input sequence are
relevant to each word in the output
▪ translation
▪ is the process of using the relevant information to select the appropriate output
▪ “… we introduce an extension to the encoder–decoder model which learns to align and translate
jointly. Each time the proposed model generates a word in a translation, it (soft-)searches for a set
of positions in a source sentence where the most relevant information is concentrated. The model
then predicts a target word based on the context vectors associated with these source positions and
all the previous generated target words.”
▪ Neural Machine Translation by Jointly Learning to Align and Translate, 2015.

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 46
Attention Mechanism: Introduction

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras


Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 47
Attention Mechanism: Introduction

• In a typical encoder decoder network, each time step of the decoder uses
the information obtained from the last time step of the encoder.

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras


Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 48
Attention Mechanism: Introduction

• In a typical encoder decoder network, each time step of the decoder uses
the information obtained from the last time step of the encoder.

• However, the translation would be effective if the network could focus/or


pay attention to specific input word that would contribute to the
prediction.

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras


Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 49
Attention Mechanism: Introduction

• Consider the task of machine translation:

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras


Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 50
Attention Mechanism: Introduction

While predicting each word in the o/p we would like our model to mimic
humans and focus on specific words in the i/p

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras


Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 51
Attention Mechanism: Introduction

While predicting each word in the o/p we would like our model to mimic
humans and focus on specific words in the i/p

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras


Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 52
Attention Mechanism: Introduction

While predicting each word in the o/p we would like our model to mimic
humans and focus on specific words in the i/p

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras


Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 53
Attention Mechanism: Introduction

• Essentially, at each time step, a distribution on the input words must be introduced.

• This distribution tells the model how much attention to pay to each input words at each
time step.
Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 54
Encoder Decoder with Attention Mechanism

• To do this, we could just


take a weighted average of
the corresponding word
representations and feed it
to the decoder

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 55
Encoder Decoder with Attention Mechanism

• To do this, we could just


take a weighted average of
the corresponding word
representations and feed it
to the decoder

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras


Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 56
Encoder Decoder with Attention Mechanism

• To do this, we could just


take a weighted average of
the corresponding word
representations and feed it
to the decoder

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras


Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 57
Encoder Decoder with Attention Mechanism

St

hj

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras


Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 58
Encoder Decoder with Attention Mechanism

• To enable the network to focus on certain data we


define the following function:

St

hj

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras


Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 59
Encoder Decoder with Attention Mechanism

• To enable the network to focus on certain data we


define the following function:

hj

St

hj

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras


Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 60
Encoder Decoder with Attention Mechanism

• To enable the network to focus on certain data we


define the following function:
Can be
considered as a
hj separate feed
forward network

St

hj

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras


Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 61
Encoder Decoder with Attention Mechanism

• To enable the network to focus on certain data we


define the following function:
Can be
considered as a
hj separate feed
forward network

St • This quantity captures the importance of the jth input


word for decoding the tth output word.

hj

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras


Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 62
Encoder Decoder with Attention Mechanism

• To enable the network to focus on certain data we


define the following function:
Can be
considered as a
hj separate feed
forward network

St • This quantity captures the importance of the jth input


word for decoding the tth output word.

• We could compute by normalizing these weights


using the softmax function.

hj

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras


Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 63
Encoder Decoder with Attention Mechanism

• To enable the network to focus on certain data we


define the following function:
Can be
considered as a
hj separate feed
forward network

St • This quantity captures the importance of the jth input


word for decoding the tth output word.

• We could compute by normalizing these weights


using the softmax function.

hj

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras


Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 64
Encoder Decoder with Attention Mechanism

• To enable the network to focus on certain data we


define the following function:
Can be
considered as a
hj separate feed
forward network

St • This quantity captures the importance of the jth input


word for decoding the tth output word.

• We could compute by normalizing these weights


using the softmax function.

hj

• Where, denotes the probability of focusing on the jth


word to produce the tth output word
Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 65
Encoder Decoder with Attention Mechanism

St

hj

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras


Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 66
Encoder Decoder with Attention Mechanism

• To enable the network to focus on certain data we


define the following function:

St

hj

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras


Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 67
Encoder Decoder with Attention Mechanism

• To enable the network to focus on certain data we


define the following function:

hj

St

hj

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras


Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 68
Encoder Decoder with Attention Mechanism

• To enable the network to focus on certain data we


define the following function:

hj

St • This quantity captures the importance of the jth input


word for decoding the tth output word.

hj

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras


Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 69
Encoder Decoder with Attention Mechanism

• To enable the network to focus on certain data we


define the following function:

hj

St • This quantity captures the importance of the jth input


word for decoding the tth output word.

• We could compute by normalizing these weights


using the softmax function.

hj

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras


Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 70
Encoder Decoder with Attention Mechanism

• To enable the network to focus on certain data we


define the following function:

hj

St • This quantity captures the importance of the jth input


word for decoding the tth output word.

• We could compute by normalizing these weights


using the softmax function.

hj

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras


Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 71
Encoder Decoder with Attention Mechanism

• To enable the network to focus on certain data we


define the following function:

hj

St • This quantity captures the importance of the jth input


word for decoding the tth output word.

• We could compute by normalizing these weights


using the softmax function.

hj

• Where, denotes the probability of focusing on the jth


word to produce the tth output word
Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 72
Encoder Decoder with Attention Mechanism

• Introducing the parametric form of :

St

hj • Where, ct (context) gives a weighted sum over the


inputs.

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras


Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 73
Encoder Decoder with Attention Mechanism

St

hj

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 74
Encoder Decoder with Attention Mechanism: Visualization

Example output of attention-based neural machine translation model


Bahdanau et al. 2015

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 75
Attention over Images

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras


Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 76
Attention over Images

• For a CNN (eg: VGG-16) we would consider the convolution layer to


be an input to the decoder, instead of the fully connected layers.

• This is because, the information about the image is contained in the


feature maps in the convolution layer.

• Therefore, we could add attention weights to each pixel of the feature


map volume to make the model focus on a particular pixel or region in
the image.
Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 77
Attention over Images

• For a CNN (eg: VGG-16) we would consider the convolution layer to


be an input to the decoder, instead of the fully connected layers.

• This is because, the information about the image is contained in the


feature maps in the convolution layer.

• Therefore, we could add attention weights to each pixel of the feature


map volume to make the model focus on a particular pixel or region in
the image.
Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 78
Attention over Images

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 79
Evaluation of Machine Translation: BLUE score

• Bilingual Evaluation Understudy (BLUE) is a score for comparing a candidate translation of text to
one or more reference translations.

• Correlates to human judgment of quality

• Scores are calculated for individual translated segments by comparing them with a set of good
quality reference translations.

• Scores are then averaged over the whole corpus to reach an estimate of the translation's overall
quality.

• Intelligibility or grammatical correctness are not taken into account

• Number between 0 and 1

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 80
Evaluation of Machine Translation: BLUE score

Precision (uni-gram):

(1 + 1 + 1 + 1 + 1 + 1)
p1 = =1
6

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 81
Evaluation of Machine Translation: BLUE score

Modified Precision (uni-gram):

(1 + 0 + 0 + 0 + 0 + 0)
p1 = = 0.16
6

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 82
Evaluation of Machine Translation: BLUE score

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 83
Evaluation of Machine Translation: BLUE score

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 84
Evaluation of Machine Translation: BLUE score

Brievity Penalty

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 85
Evaluation of Machine Translation: BLUE score

Brievity Penalty

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 86
Evaluation of machine translation

NIST
▪ n-gram precision
▪ counts how many (i=1,…,4) grams match their n-gram counterpart in the reference
translations.
▪ BLEU simply calculates n-gram precision adding equal weight to each segment
▪ NIST also calculates how informative a particular n-gram is.
▪ when a correct n-gram is found, the rarer that n-gram is, the more weight it will be given

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 87
Evaluation of machine translation

▪ WORD ERROR RATE


▪ metric based on the Levenshtein distance at the word level.
▪ based on the calculation of the number of words that differ between a piece of machine-translated text and a
reference translation.
▪ METEOR (Metric for Evaluation of Translation with Explicit ORdering)
▪ Recall – the proportion of the matched n-grams out of the total number of n-grams in the reference translation
▪ based on the harmonic mean of unigram precision and recall, with recall weighted higher than precision
▪ Features include stemming and synonymy matching along with exact word matching
▪ LEPOR (Length Penalty, Precision, n-gram Position difference Penalty and Recall)
▪ Language Independent – overcomes language bias problem
▪ factors of
▪ enhanced length penalty – Translator is punished if longer or shorter than the reference translation
precision - score reflects the accuracy of the hypothesis translation
▪ recall - score reflects the loyalty of the hypothesis translation to the reference translation or source language.
▪ n-gram word order penalty- is designed for the different position orders between the hypothesis translation
and reference translation.

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 88
Hierarchical Attention : Introduction to Hierarchical Models

Task: Chat Bot

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras


Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 89
Hierarchical Attention : Introduction to Hierarchical Models

Task: Chat Bot

Encoding of utterances

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras


Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 90
Hierarchical Attention : Introduction to Hierarchical Models

Task: Chat Bot

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras


Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 91
Hierarchical Attention : Introduction to Hierarchical Models

Task: Chat Bot

Encoding of sequence of utterances

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras


Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 92
Hierarchical Attention : Introduction to Hierarchical Models

Task: Chat Bot

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras


Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 93
Hierarchical Attention : Introduction to Hierarchical Models

Task: Document Summarization

Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras


Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 94
Document Classification using Hierarchical Attention Networks

▪ To understand the main message of a document


▪ Not every word in a sentence and every sentence in a document are equally important
▪ Meaning of word depends on context
▪ For example - “The bouquet of flowers is pretty” vs. “The food is pretty bad”.
▪ HAN
▪ considers the hierarchical structure of documents (document - sentences - words)
▪ Includes an attention mechanism that is able to find the most important words and sentences in
a document while taking the context into consideration

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 95
Hierarchical Attention Networks

Yang et al. Hierarchical Attention Networks for Document Classification, Proceedings of NAACL-HLT 2016

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 96
Hierarchical Attention Networks (Yang et al. 2016)

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 97
Transformers & Recursive Networks

DSE 3151 DEEP LEARNING

B.Tech Data Science & Engineering


August 2023
Rohini R Rao & Abhilash K Pai
Department of Data Science and Computer Applications
MIT Manipal
Slide -4 of 5
Transformers

Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017).

Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 2
Transformers

• Vaswani et al. 2017 proposed the transformer model,


entirely built on self attention mechanism without using
sequence aligned recurrent architectures.

• Key components:
• Self attention
• Multi-head attention
• Positional encoding
• Encoder-Decoder architecture

Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017).

Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 3
Transformers

The entire network can be viewed as an encoder decoder architecture

The Illustrated Transformer – Jay Alammar – Visualizing machine learning one concept at a time. (jalammar.github.io)

Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 4
Transformers

The encoding component is a stack of encoders and the decoding component is also
a stack of encoders of the same number (in the paper, this number = 6)
The Illustrated Transformer – Jay Alammar – Visualizing machine learning one concept at a time. (jalammar.github.io)

Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 5
Transformers

• The encoders are all identical in structure (yet they do not share weights).
• Each one is broken down into two sub-layers
• The encoder’s inputs first flow through a self-attention layer – a layer that helps the encoder look at other
words in the input sentence as it encodes a specific word – to the feed forward neural network.
The Illustrated Transformer – Jay Alammar – Visualizing machine learning one concept at a time. (jalammar.github.io)

Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 6
Transformers

The decoder has both those layers, but between them is an attention layer that helps the decoder
focus on relevant parts of the input sentence (similar what attention does in seq2seq models).

The Illustrated Transformer – Jay Alammar – Visualizing machine learning one concept at a time. (jalammar.github.io)

Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 7
Transformers

• A word in each position flows through its own


path in the encoder.

• There are dependencies between these paths in


the self-attention layer.

• The feed-forward layer does not have those


dependencies.

• Thus, the various paths can be executed in


parallel while flowing through the feed-forward
layer.

• This is a key property of the transformer.

The Illustrated Transformer – Jay Alammar – Visualizing machine learning one concept at a time. (jalammar.github.io)

Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 8
Transformers : Self attention

S1: Animal didn’t cross the street because it was too tired

S2: Animal didn’t cross the street because it was too wide

• In sentence S1, “it” refers to “animal” and in sentence S2 “it” refers to


“street”.

• Such deductions are hard for traditional sequence to sequence models.

• While processing each word in a sequence, self-attention mechanism


allows the model to decide as to which other parts of the same
sequence it needs to focus on, which makes such deductions easier
and allows better encoding.

• RNNs which maintain a hidden state to incorporate the representation


of previous vectors are no longer needed!

The Illustrated Transformer – Jay Alammar – Visualizing machine learning one concept at a time. (jalammar.github.io)

Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 9
Transformers : Self attention

Step 1: Create three vectors from encoder’s


input vector (xi):

• Query vector (qi)


• Key vector (ki)
• Value vector (vi)

These are created by multiplying the input


with weight matrices Wq, Wk, Wv, learned
during training.

• Note: In the paper by Vaswani et al, the q, k and w vectors has dimension of 64 and the input vector x has a dimension of 512,
The Illustrated Transformer – Jay Alammar – Visualizing machine learning one concept at a time. (jalammar.github.io)

Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 10
Transformers : Self attention

Step 2: Calculate self attention scores of each


word against the other words in the same
sentence.
• This is done by taking the dot product
of query vector with the key vector of
respective words.

• The scores are the divided by square


root of the key length.

• This is called Scaled Dot-Product attention


- it leads to more stable gradients.

The Illustrated Transformer – Jay Alammar – Visualizing machine learning one concept at a time. (jalammar.github.io)

Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 11
Transformers : Self attention

Step 3: Softmax is used to obtain normalized


probability scores; determines how much each
word will be expressed at this position.

Step 4: Multiply each value vector by the


Softmax score; to keep values of words we
want to focus on and drown out irrelevant
words.

Step 5: Sum up the weighted value vectors ;


produces the output of self-attention layer at
this position (for first word)

The Illustrated Transformer – Jay Alammar – Visualizing machine learning one concept at a time. (jalammar.github.io)

Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 12
Transformers : Self attention

• In the actual implementation, however, Step 1 to Step 4 is done in matrix form for faster processing.

The Illustrated Transformer – Jay Alammar – Visualizing machine learning one concept at a time. (jalammar.github.io)

Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 13
Transformers : Multi-head Attention

• Multi-head attention improves the performance of


attention layer in two ways:

• It expands the model’s ability to focus on different


positions.

• It gives the attention layer multiple “representation


subspaces” i.e., we have not only one, but multiple
sets of Query/Key/Value weight matrices.

The Illustrated Transformer – Jay Alammar – Visualizing machine learning one concept at a time. (jalammar.github.io)

Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 14
Transformers : Multi-head Attention

With multi-headed attention, we maintain


separate Q, K, V weight matrices for each
head resulting in different Q, K, V matrices.

The Illustrated Transformer – Jay Alammar – Visualizing machine learning one concept at a time. (jalammar.github.io)

Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 15
Transformers : Multi-head Attention

• Eight different weight matrices are obtained as an o/p.

• However, the feed-forward layer is not expecting eight matrices – it’s expecting a single matrix (a
vector for each word).

The Illustrated Transformer – Jay Alammar – Visualizing machine learning one concept at a time. (jalammar.github.io)

Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 16
Transformers : Multi-head Attention

To tackle the issue, the eight matrices are concatenated and multiplied with additional
weight matrix WO
The Illustrated Transformer – Jay Alammar – Visualizing machine learning one concept at a time. (jalammar.github.io)

Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 17
Transformers : Multi-head Attention

The Illustrated Transformer – Jay Alammar – Visualizing machine learning one concept at a time. (jalammar.github.io)

Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 18
Transformers : Multi-head Attention

• As we encode the word "it", one attention head is focusing most on "the animal", while another is focusing on "tired" –
in a sense, the model's representation of the word "it" bakes in some of the representation of both "animal" and "tired".
The Illustrated Transformer – Jay Alammar – Visualizing machine learning one concept at a time. (jalammar.github.io)

Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 19
Transformers : Positional Encoding

• Order of the sequence conveys important information for machine translation tasks and language
modeling.

• Position encoding is a way to account for the order of words in the input sequence.

• The positional information of the input token in the sequence is added to the input embedding
vectors.

512

The Illustrated Transformer – Jay Alammar – Visualizing machine learning one concept at a time. (jalammar.github.io)

Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 20
Transformers : Encoder

Layer Norm
statistics are calculated across all features
and all elements (words), for each
instance(sentence) independently.

Attention? Attention! | Lil'Log (lilianweng.github.io)


The Illustrated Transformer – Jay Alammar – Visualizing machine learning one concept at a time. (jalammar.github.io)

Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 21
Transformers : Decoder

• The Encoder
• processes the input sequence.
• The output of the top encoder is transformed
into a set of attention vectors K and V.
• Vectors K & V are used by each decoder in its
“encoder-decoder attention” layer which helps the
decoder focus on appropriate places in the input
sequence
• The Decoder
• has masked multi-head attention layer to
prevent the positions from seeing subsequent
positions
• The “Encoder-Decoder Attention” layer creates
its Queries matrix from the layer below it, and
takes the Keys and Values matrix from the
output of the encoder stack
Attention? Attention! | Lil'Log (lilianweng.github.io)
The Illustrated Transformer – Jay Alammar – Visualizing machine learning one concept at a time. (jalammar.github.io)

Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 22
Transformers: Encoders & Decoders

Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 23
Transformers : Decoder

The Illustrated Transformer – Jay Alammar – Visualizing machine learning one concept at a time. (jalammar.github.io)

Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 24
Massive Deep Learning Language models

▪ Language models
▪ estimate the probability of words appearing in a sentence, or of the sentence itself existing.
▪ building blocks in a lot of NLP applications
▪ Massive deep learning language models
▪ pretrained using an enormous amount of unannotated data to provide a general-purpose deep
learning model.
▪ Downstream users can create task-specific models with smaller annotated training datasets (transfer
learning)
▪ Tasks executed with BERT and GPT models:
• Natural language inference
• enables models to determine whether a statement is true, false or undetermined based on a
premise.
• For example, if the premise is “tomatoes are sweet” and the statement is “tomatoes are fruit” it
might be labelled as undetermined.
• Question answering
• model receives a question regarding text content and returns the answer in text, specifically
marking the beginning and end of each answer.
• Text classification
• is used for sentiment analysis, spam filtering, news categorization

Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 25
GPT (Generative Pre-Training) by Open AI Pretraining

▪ Unsupervised learning served as pre-training objective for supervised fine-tuned


models
▪ generative language model using unlabeled data
▪ then fine-tuning the model by providing examples of specific downstream tasks
like classification, sentiment analysis, textual entailment etc.
▪ Semi supervised learning using 3 components
1. Unsupervised Language Modelling (Pre-training)

2. where
1. k is the size of the context window, and ii) conditional probability P is modeled with the help
of a neural network (NN) with parameters Θ

Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 26
GPT (Generative Pre-Training) by Open AI

2. Supervised Fine-Tuning: maximising the likelihood of observing label y, given features or tokens
x_1,…,x_n.

where C was the labeled dataset made up of training examples.

3. Auxiliary learning objective for supervised fine-tuning to get better generalisation and faster
convergence.

where L₁(C) was the auxiliary objective of learning language model


λ was the weight given to this secondary learning objective. λ was set to 0.5.
▪ Supervised fine-tuning is achieved by adding a linear and a softmax layer to the transformer model to get the task
labels for downstream tasks.
▪ “zero-shot” framework
▪ measure a model’s performance having never been trained on the task.

Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 27
GPT-n series created by OpenAI (2018 onwards)

▪ Generative models are a type of statistical


model that are used to generate new data
points.
▪ learn the underlying relationships between
variables in a dataset in order to generate new
data points similar to those in the dataset.
▪ Trained on BooksCorpus dataset contains
over 7,000 unique unpublished books from
a variety of genres
▪ 12-layer decoder-only transformer with
masked self-attention heads
▪ GPT-2
▪ Application - generate long passages of
coherent text
▪ https://transformer.huggingface.co/doc/gpt2-
large

Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 28
GPT-3 (2020)

▪ step to Artificial General Intelligence(AGI)


“I am open to the idea that a worm with 302 neurons is
conscious, so I am open to the idea that GPT-3 with 175 billion
parameters is conscious too.” — David Chalmers

▪ 3rd-gen language prediction model in capacity


of 175 billion parameters.
▪ trained with 499 Billion tokens
▪ trained using next word prediction
▪ Context window size was increased from 1024
for GPT-2 to 2048 tokens for GPT-3.
▪ Size of word embeddings was increased to
12888 for GPT-3 from 1600 for GPT-2.
▪ To train models of different sizes, the batch size
is increased according to number of
parameters, while the learning rate is
decreased accordingly.

Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 29
GPT-3

zero-shot task transfer OR


meta learning
▪ The model is supposed to
understand the task
based on the examples
and instruction.
▪ For English to French
translation task, the
model was given an
English sentence
followed by the word
French and a prompt (:)

Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 30
GPT-3 Task Agnostic Model

▪ LIMITATIONS OF GPT-3
▪ GPT-3 can perform a wide variety of operations such as
compose prose, write code and fiction, business articles
▪ Does not have any internal representation of what these
words even mean. misses the semantically grounded
model of the topics on which it discusses.
▪ If the model is faced with data that is not in a similar
form or is unavailable from the Internet’s corpus of
existing text that was used initially in the training phase,
then the language generated is a loss.
▪ Expensive and complex inferencing due to hefty
architecture, less interpretability of the language, and
uncertainty around what helps the model achieve its
few-shot learning behavior.
▪ The text generated carries bias of the language it is
initially trained on.
▪ The articles, blogs, memos generated by GPT-3 may face
gender, ethnicity, race, or religious bias.
▪ model is capable of producing high-quality text,
sometimes looses coherence with data while generating
long sentences and thus may repeat sequences of text
again and again in a paragraph.

Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 31
BERT (Bidirectional Encoder Representations from Transformers) by google

▪ “BERT stands for Bidirectional Encoder Representations from Transformers. It is designed to pre-train
deep bidirectional representations from unlabeled text by jointly conditioning on both left and right
context. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer
to create state-of-the-art models for a wide range of NLP tasks.”
• Two variants
• BERT Base: 12 layers (transformer blocks), 12 attention heads, and 110 million parameters
• BERT Large: 24 layers (transformer blocks), 16 attention heads and, 340 million parameters
▪ BERT is pre-trained on two NLP tasks:
• Masked Language Modeling
• replace 15% of the input sequence with [MASK] and model learns to detect the masked word
• Next Sentence Prediction
▪ two sentences A and B are separated with the special token [SEP] and are formed in such a
way that 50% of the time B is the actual next sentence and 50% of the time is a random
sentence.

Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 32
BERT (Bidirectional Encoder Representations from Transformers) by google

▪ every input embedding is a combination


of 3 embeddings:
▪ Position Embeddings: captures “sequence”
or “order” information
▪ Segment Embeddings: can also take
sentence pairs as inputs for tasks (Question-
Answering)
▪ Token Embeddings: learned for the specific
token from the WordPiece token
vocabulary
▪ Output is an embedding since it has
only encoder and no decoder

Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 33
Transformers in Computer Vision

DEtection TRansformer (DETR)

https://github.com/facebookresearch/detr
Carion, Nicolas, et al. "End-to-end object detection with transformers." European conference on computer vision. Springer, Cham, 2020.

Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 34
Transformers in Computer Vision

DEtection TRansformer (DETR)

https://github.com/facebookresearch/detr
Carion, Nicolas, et al. "End-to-end object detection with transformers." European conference on computer vision. Springer, Cham, 2020.

Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 35
Transformers in Computer Vision

DEtection TRansformer (DETR): Results on COCO 2017 dataset (AP = Average Precision)

https://github.com/facebookresearch/detr
Carion, Nicolas, et al. "End-to-end object detection with transformers." European conference on computer vision. Springer, Cham, 2020.

Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 36
Recursive Neural Networks

• Many real world entities have recursive structure.

Eg: A syntactic tree structure representing a sentence. Eg: A tree representation of different segments in an image

Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 37
Recursive Neural Networks: Introduction

• Can we learn a good representation for these recursive structures?

the country of my birth

the place where I was born

Eg: Word phrase representations (learning representations of phrases of arbitrary length)

Source: Socher et al. Parsing Natural Scenes and Natural Language with Recursive Neural Networks, ICML’11

Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 38
Recursive Neural Networks: Introduction

• Can we learn a good representation for these recursive structures?

• The meaning of a sentence is determined by meaning of its words and the rules that combine them.

Source: Socher et al. Parsing Natural Scenes and Natural Language with Recursive Neural Networks, ICML’11

Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 39
Recursive Neural Networks

Source: Ian Goodfellow et al. “Deep Learning” MIT Press’15

Source: Socher et al. Parsing Natural Scenes and Natural Language with Recursive Neural Networks, ICML’11

Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 40
Recursive Neural Networks vs Recurrent Neural Networks

Source: Socher et al. Parsing Natural Scenes and Natural Language with Recursive Neural Networks, ICML’11

Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 41
Recursive Neural Networks

• To build a recursive neural network, we need two things:

• A model that merges pairs of representations.

• A model that determines the tree structure.

A score to determine which


pairs of representations to
merge first

Source: Socher et al. Parsing Natural Scenes and Natural Language with Recursive Neural Networks, ICML’11

Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 42
Recursive Neural Networks

• To build a recursive neural network, we need two things:

• A model that merges pairs of representations.

• A model that determines the tree structure.

Source: Socher et al. Parsing Natural Scenes and Natural Language with Recursive Neural Networks, ICML’11

Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 43
Recursive Neural Networks

• To build a recursive neural network, we need two things:

• A model that merges pairs of representations.

• A model that determines the tree structure.

• Approximate the
best tree by locally
maximizing each
subtree

Source: Socher et al. Parsing Natural Scenes and Natural Language with Recursive Neural Networks, ICML’11

Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 44
Autoencoders

DSE 3151 DEEP LEARNING

Dr. Rohini Rao & Dr. Abhilash K Pai


Department of Data Science and Computer Applications
MIT Manipal

Credits: Most of the slides are adapted from CS7015 Deep Learning, Dept. of CSE, IIT Madras
Unsupervised Learning

Unsupervised leaning : only use the inputs X(t) for learning.

▪ Automatically extract meaningful features/representations for the data.

▪ Compress data while maintaining the structure and complexity of the original dataset
(dimensionality reduction).

▪ Leverage the availability of unlabeled data (which can be used for unsupervised pre-training).

▪ Some well-known neural networks for unsupervised learning are:

▪ Restricted Boltzmann Machines (RBM)

▪ Sparse Coding Model


▪ Autoencoders

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 2
Autoencoders: Introduction

▪ Autoencoders are special type feed forward neural networks which are trained to reproduce their
input at the output layer.

Output Layer:
Reconstructed i/p

Hidden Layer (Bottleneck):


Latent Representation

Input Layer

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 3
Autoencoders: Introduction

▪ It consists of an ENCODER that encodes its input X into a hidden representation h, and a
DECODER which decodes the input again from this hidden representation.

Output Layer:
Reconstructed i/p
Decoder

Hidden Layer (Bottleneck):


Latent Representation

Encoder
Input Layer

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 4
Autoencoders: Introduction

Source: 85b - An introduction to autoencoders - in Python

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 5
Autoencoders: Introduction

model.fit(x, x, epochs=100, shuffle=True)

Source: 85b - An introduction to autoencoders - in Python

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 6
Autoencoders: Introduction

▪ The autoencoder model is trained to minimize a certain loss function which will ensure that
is close to

Output Layer: Reconstructed i/p

Hidden Layer: Latent Representation

W, W*, b, c are the


parameters of the network Input Layer

• As the hidden layer could produce a reduced representation of the input, autoencoders (as the one
shown above) can be used for dimensionality reduction.
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 7
Autoencoders: Introduction

Output Layer: Reconstructed i/p

New feature representation for Cat Image

Input Layer

• Also, the latent representation can be used as a new feature representation of the input X.

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 8
Autoencoders: Applications
▪ Feature Learning and Dimensionality reduction

▪ Pretraining of Deep Neural Networks

▪ Denoising and Inpainting

▪ Generate Images

▪ Anomaly Detection

▪ Recommender Systems (e.g. matrix completion)

Sedhain et al, AutoRec: Autoencoders Meet Collaborative Filtering, WWW 2015

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 9
Choice of Loss and Activation functions

• For binary inputs (each Xij ϵ {0,1}):

• The activation function “f” (o/p layer) is chosen as Sigmoid.

• The loss function is L([W,W*,c,b]) is (Cross Entropy):

• The activation function “g” is typically chosen as Sigmoid.

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 10
Choice of Loss and Activation functions

• For real valued inputs (each Xij ϵ R):

• The activation function “f” is chosen as Linear.

• The loss function is L([W,W*,c,b]) is (SSD):

• The activation function “g” is typically chosen as Sigmoid.

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 11
Autoencoders and PCA

• An autoencoder’s encoder part is equivalent to PCA if we:

• Use a Linear Encoder.

• Use a Linear Decoder.

• Use a Squared Error Loss function.

• Normalize the inputs to:

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 12
Undercomplete and Overcomplete Autoencoders

Undercomplete AE Overcomplete AE
- Dimension of h is less than dimension x - Dimension of h is greater than dimension x

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 13
Regularization in Autoencoders

• An overcomplete autoencoder would learn to copy the inputs xi


to hidden representation hi and then copy hi to xො i (learn an
identity function).

• Such an identity encoding is useless in practice as it does not


really tell us anything about the important characteristics of the
data. This is a case of overfitting!

• Overfitting (Poor generalization) could happen even in


undercomplete autoencoders.

• To avoid poor generalization, we need to introduce


regularization.

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 14
Regularization in Autoencoders

• The simplest solution is to add an L2 regularization term to the objective function:

• Another trick is to tie the weights of encoder and decoder (W* = WT). This effectively reduces the
number of parameters and acts as a regularizer.

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 15
Regularized Autoencoders

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 16
Denoising Autoencoders

• A denoising encoder simply corrupts the input data using a probabilistic process
before feeding it to the network.

Corrupted inputs

• In other words, with probability q the input is flipped to 0 and with probability (1 - q) it is retained
as it is.

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 17
Denoising Autoencoders

• How does this help?


• This helps because the objective is still to reconstruct the
original (un-corrupted) xi :

• It no longer makes sense for the model learn to copy corrupted


input to h (the objective function will not be minimized by
doing so)

• Instead, the model will now have to capture the characteristics of the data correctly.

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 18
Sparse Autoencoders

• A hidden neuron with sigmoid activation will have values between 0 and 1.

• We say that the neuron is activated when its output is close to 1 and not activated when its
output is close to 0.

• A sparse autoencoder tries to ensure the neuron is inactive most of the times.

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 19
Sparse Autoencoders

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 20
Sparse Autoencoders

• Now, the equation for the loss function will look like:

L’(ϴ) = L(ϴ) + Ω (ϴ)

• When will this term (Ω (ϴ)) reach its minimum value and
what is the minimum value?

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 21
Sparse Autoencoders

• Now, the equation for the loss function will look like:

L’(ϴ) = L(ϴ) + Ω (ϴ) Sparsity constraint

• When will this term (Ω (ϴ)) reach its minimum value and
what is the minimum value?

Ω (ϴ) will reach min. value


when

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 22
Contractive Autoencoders

• A contractive autoencoder also tries to prevent an overcomplete


autoencoder from learning the identity function.

• It does so by adding the following regularization term to the loss


function:

Frobenius Norm

Jacobian of the encoder

Variation in output of
2nd neuron in the
hidden layer with a
small variation in the
1st input.

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 23
Contractive Autoencoders

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 24
Generative Models: Introduction

• An autoencoder contains an encoder which takes the


input 𝒙 and maps it to a hidden representation.

• The decoder then takes this hidden representation and


tries to reconstruct the input from it as 𝒙
ෝ.

• The hidden layer produces a good abstraction of the


input, in the form of h.

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 25
Generative Models: Introduction

• An autoencoder contains an encoder which takes the


input 𝒙 and maps it to a hidden representation.

• The decoder then takes this hidden representation and


tries to reconstruct the input from it as 𝒙
ෝ.

• The hidden layer produces a good abstraction of the


input, in the form of h.

• Now, once the autoencoder is trained can we remove


the encoder, feed a hidden representation h to the
decoder and decode a 𝒙 ෝ from it ?

• In other words, can we do generation with


autoencoders?

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 26
Generative Models: Introduction
• In principle, yes! But in practice there is a problem with this
approach.

• h is a very high dimensional vector and only a few vectors in this


space would actually correspond to meaningful latent
representations of our input.

• So, of all the possible value of h which values should I feed to


the decoder?

• Ideally, we should feed only those values of h which are highly


• Remember classic autoencoders are deterministic likely.
i.e., for an X, the encoder will give the same
hidden representation every time. • In other words, we are interested in sampling from a distribution
over h’s (P(h|X)) so that we pick only those h’s which have high
probability.

• Let's understand this concept clearly with an example.

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 27
Generative Models: Introduction

Example: Movie Review


• Consider a movie critic who writes reviews for movies
M1: An unexpected and necessary masterpiece
M2: Delightfully merged information and comedy • For simplicity let us assume that he always writes reviews
M3: Director’s first true masterpiece containing a maximum of 5 words
M4: Sci- perfection, truly mesmerizing film
M5: Waste of time and money • Further, let us assume that there are a total of 50 words in
M6: Best Lame Historical Movie Ever his vocabulary

• Each of the 5 words in his review can be treated as a


random variable which takes one of the 50 values

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 28
Generative Models: Introduction

Example: Movie Review


• Consider a movie critic who writes reviews for movies
M1: An unexpected and necessary masterpiece
M2: Delightfully merged information and comedy • For simplicity let us assume that he always writes reviews
M3: Director’s first true masterpiece containing a maximum of 5 words
M4: Sci- perfection, truly mesmerizing film
M5: Waste of time and money • Further, let us assume that there are a total of 50 words in
M6: Best Lame Historical Movie Ever his vocabulary

• Given many such reviews written by the • Each of the 5 words in his review can be treated as a
reviewer we could learn the joint random variable which takes one of the 50 values
probability distribution

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 29
Generative Models: Introduction

Example: Movie Review


• We can think of a very simple factorization of this model
M1: An unexpected and necessary masterpiece by assuming that the ith word only depends on the previous
M2: Delightfully merged information and comedy 2 words and not anything before that:
M3: Director’s first true masterpiece
M4: Sci- perfection, truly mesmerizing film i=2:5

M5: Waste of time and money • Let us consider one such factor:
M6: Best Lame Historical Movie Ever P(Xi=time | Xi-2=waste, Xi-1=of)
• This could be estimated as:

count (waste of time)


count (waste of)
• And the two counts mentioned above can be computed by
going over all the reviews

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 30
Generative Models: Introduction

• What can we do with this joint distribution?

Joint distribution

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 31
Generative Models: Introduction

• What can we do with this joint distribution?


M7: More realistic than real life

• Given a review, classify if this was written by the reviewer

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 32
Generative Models: Introduction

M7: More realistic than real life • What can we do with this joint distribution?

• Given a review, classify if this was written by the reviewer

• What else can we do?

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 33
Generative Models: Introduction

• What can we do with this joint distribution?


M7: More realistic than real life

• Given a review, classify if this was written by the reviewer

• What else can we do?

• Generate new reviews which would look like reviews


written by this reviewer

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 34
Generative Models: Introduction

• What can we do with this joint distribution?


M7: More realistic than real life

• Given a review, classify if this was written by the reviewer

• What else can we do?

• Generate new reviews which would look like reviews


written by this reviewer

• How to do this?

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 35
Generative Models: Introduction

• How does the reviewer start his reviews (what is the first
word that he chooses)?

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 36
Generative Models: Introduction

• How does the reviewer start his reviews (what is the first
word that he chooses)?

• We could take the word which has the highest probability


and put it as the first word in our review

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 37
Generative Models: Introduction

• How does the reviewer start his reviews (what is the first
word that he chooses)?

• We could take the word which has the highest probability


and put it as the first word in our review

• Having selected this what is the most likely second word


that the reviewer uses?

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 38
Generative Models: Introduction

• How does the reviewer start his reviews (what is the first
word that he chooses)?

• We could take the word which has the highest probability


and put it as the first word in our review

• Having selected this what is the most likely second word


that the reviewer uses?

• Having selected first two words what is the most likely


third word that the reviewer uses?

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 39
Generative Models: Introduction

• How does the reviewer start his reviews (what is the first
word that he chooses)?

• We could take the word which has the highest probability


and put it as the first word in our review

• Having selected this what is the most likely second word


that the reviewer uses?

• Having selected first two words what is the most likely


third word that the reviewer uses?

• And so on …

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 40
Generative Models: Introduction

• How does the reviewer start his reviews (what is the first
word that he chooses)?

• We could take the word which has the highest probability


and put it as the first word in our review

• Having selected this what is the most likely second word


that the reviewer uses?

• Having selected first two words what is the most likely


third word that the reviewer uses?

• And so on …

• But, if we select the most likely word at each time step,


then it will give us the same review again and again

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 41
Generative Models: Introduction

• How does the reviewer start his reviews (what is the first
word that he chooses)?

• We could take the word which has the highest probability


and put it as the first word in our review

• Having selected this what is the most likely second word


that the reviewer uses?

• Having selected first two words what is the most likely


third word that the reviewer uses?

• And so on …

We should instead sample from this • But, if we select the most likely word at each time step,
then it will give us the same review again and again
distribution!

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 42
Generative Models: Introduction
• Suppose there are 10 words in our vocabulary, and we have
computed the joint probability distribution over all the
random variables.

• Now, consider that we need to generate the 3rd word in the


review, given the first two words of the review.

• There are 10 possibilities :

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 43
Generative Models: Introduction
• Suppose there are 10 words in our vocabulary, and we have
computed the joint probability distribution over all the
random variables.

• Now, consider that we need to generate the 3rd word in the


review, given the first two words of the review.

• There are 10 possibilities :

• We can think of this as a


10-sided dice where each
side correspond to a word
and has a certain
probability to show up.

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 44
Generative Models: Introduction
• Suppose there are 10 words in our vocabulary, and we have
computed the joint probability distribution over all the
random variables.

• Now, consider that we need to generate the 3rd word in the


review, given the first two words of the review.

• There are 10 possibilities :

• We can think of this as a


10-sided dice where each
side correspond to a word
and has a certain
Such models which tries to estimate the probability to show up.
probability P(X) from a large number of
samples are called as generative models • Roll this dice and pick the
side (word) which comes
up.

• Every run will now give a different review.


Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 45
Generative Models: Introduction

What about images?

• For 32x32 images we want to learn: P(V1, V2, …… V1024) where Vi is a random variable corresponding to
each pixel, which could possibly have values from 0-255.

• We could factorize this joint distribution by assuming that each pixel is dependent on its neighboring pixels.

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 46
Generative Models: Introduction

Input Image Output Images

• Again, what can we do with joint distribution P(V1, V2, …… V1024) ?

• Apart from classifying and generating (as discussed for previous language modelling in prev. slides), we
can also correct noisy inputs (here, images) or help in completing incomplete images.

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 47
Generative Models: The concept of latent variables

• There are certain underlying hidden (latent) characteristics


which are determining the pixels and their interactions.

• We could think of these as additional (latent) random variables


in our distribution

• And the interactions between the visible pixels are captured


through the latent variables.
• Latent variables : Blue sky,
Green grass, White clouds

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 48
Generative Models: Introduction

• Based on this notion, we would now like learn the joint distribution
P(V,H) where, V = {V1,V2..,V#pixels} is observed variable and H =
{H1,H2..,H#latent features} is hidden variables.

• Using this we can find:

• That is, given an image, we can find the most likely latent
configuration (H = h) that generated this image, where h captures a
latent (abstract) representation (imp. properties of the image)

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 49
Generative Models: Introduction

• Under this abstraction, all these images would look very similar (i.e.,
they would have very similar latent configurations h)

• Even though in the original feature space (pixels) there is a significant


difference between these images, in the latent space they would be
very close to each other

• Once again, assume that we are able to learn the joint distribution
P(V,H)

• Using this distribution, we can find

• By using this, we can now say “Create an image which is cloudy,


has a beach and depicts daytime“

• In other words, I can now generate images given certain latent


variables

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 50
Variational Autoencoders (VAE)

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 51
Variational Autoencoders (VAE)

• Encoder Goal: Learn a distribution over the latent variables (Q(z | X))

• Decoder Goal: Learn a distribution over the visible variables (P(X | z))

• The training data is mapped to a latent space using a Neural


network where the latent space posterior distribution (Q(z|X))
and prior distribution (Q(z)) are modelled as Gaussian and the
output of this network is two parameters – mean and covariance
(parameters of the posterior distribution)

• The latent space vector is mapped to input image using another


neural network.

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 52
Variational Autoencoders: Encoder

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 53
Variational Autoencoders: Decoder

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 54
Variational Autoencoders

This term acts as a regularizer by forcing the encoder to


produce latent representations that look like samples
from a standard Gaussian distribution.

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 55
Variational Autoencoders

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 56
Markov Models

DSE 3151 DEEP LEARNING

B.Tech Data Science & Engineering


August 2023
Rohini R Rao & Abhilash Pai
Department of Data Science and Computer Applications
MIT Manipal
Slide -4 of 5
Markov models

▪ Weak Law of Large Numbers


▪ When you collect independent samples, as the number of samples gets bigger, the mean of those
samples converges to the true mean of the population
▪ Markov
▪ independence was not a necessary condition for the mean to converge
▪ average of the outcomes from a process involving dependent random variables could converge over
time.
▪ Markov Model:
▪ describe how random(stochastic) systems or processes evolve over time
▪ The system is modeled as a sequence of states
▪ where all states are observable
▪ Model encodes dependencies and reach a steady-state over time
ie. it moves in between states with a specific probability
▪ Claude Shannon used Markov chains
▪ to model the English language as a sequence of letters that have a certain degree of randomness and
dependencies between each other

https://towardsdatascience.com/markov-models-and-markov-chains-explained-in-real-life-probabilistic-workout-routine-65e47b5c9a73

Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 2
Application of Markov Models

▪ Parking lots have a fixed number of spots available, but how many of these are available at any given
point in time can be described as a combination of multiple factors or variables:
▪ Day of the week,
▪ Time of the day,
▪ Parking fee,
▪ Proximity to transit,
▪ Proximity to businesses,
▪ Number of free parking spots in the vicinity,
▪ Number of available spots in the parking lot itself (a full-parking lot may deter some people to park
there)
▪ Some of these factors may be independent of each others are not
▪ For instance, Parking Fee typically depends on Time of day and Day of week.

https://towardsdatascience.com/markov-models-and-markov-chains-explained-in-real-life-probabilistic-workout-routine-65e47b5c9a73

Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 3
Applications of Markov Models

▪ Since Markov models describe the behavior over time, can be used to answer questions about the
future state of the system:
▪ How it evolves over time
▪ In what state is the system going to be in N time steps?
▪ Tracing possible sequences in the process:
▪ When the system goes from State A to State B in N steps, what is the likelihood that it follows a
specific path p?
▪ Parking Lot Example
▪ What is the occupancy rate of the parking lot 3 hours from now?
▪ How likely is the parking lot to be at 50% capacity and then at 25% capacity in 5 hours?
▪ Markov assumption
▪ It assumes the transition probability between each state only depends on the current state you are
in
▪ A Markov chain has short-term memory, it only remembers where you are now and where you
want to go next.

Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 4
Markov Chains

▪ A Markov chain is a stochastic model describing a sequence of possible events in which the probability
of each event depends only on the state attained in the previous event.

▪ Ques 1: calculate the percentage of instances its sunny on days directly following rainy days.

▪ Ques 2: calculate the percentage of instances its rainy on days directly following sunny days.

Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 5
Markov Chains

▪ Draw the Transition Matrix

▪ Draw the Markov Chain

▪ The previous 3 days are [rainy, sunny, rainy]. What’s the probability of rainy weather tomorrow?
▪ The previous 2 days are [rainy, rainy], What’s the probability of rainy weather tomorrow?
▪ The previous 3 days are [sunny, rainy, sunny]. What’s the probability of rainy weather tomorrow?
▪ Best starting point is a simple model which performs better than a random guess

Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 6
Example : Modeling my Workout

Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 7
Example: Modeling my workout – Transitional Matrix

Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 8
Ex: WORKOUT – Markov Chain

▪ To understand how the workout routine


evolves over time take into account three
components :
▪ Starting state
▪ End state
▪ Time-frame
▪ how long it will take the model to get from the
start to the end state.
▪ For example
▪ Ques 1 - If I start the workout with a run,
how likely am I to do push-ups on the second
set?
▪ Ques 2 - In 3-set workout, how likely am I to
do: 1) run, 2) push-ups and 3) pull-ups?
▪ Ques 3- What’s the likelihood of doing any of
the exercises on the fourth set?

Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 9
Ex: WORKOUT – Markov Chain

▪ Ques 1 - If I start the workout with a


run, how likely am I to do push-ups on
the second set?
▪ Start state is running 3 miles
▪ End state is doing a 20 push-up set.
▪ time-frame is two

Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 10
Ex: WORKOUT – Markov Chain

Second power of the transition matrix


the state of the Markov Chain at time-step 2

Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 11
Ex: WORKOUT – Markov Chain

Power of the transition matrix


▪ Future states are calculated using recursion
▪ Over a large enough number of iterations, all transition probabilities will
converge to a value and remain unchanged
▪ we can say the transition probabilities reached a steady-state

Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 12
Summary of Markov Model

▪ Stochastic Model
▪ a discrete-time process indexed at time 1,2,3,…that takes values called states which are observed
▪ Example states (S) ={hot , cold }
▪ State series over time => z∈ S_T
▪ Weather for 4 days can be a sequence => {z1=hot, z2 =cold, z3 =cold, z4 =hot}
▪ Markov model is engineered to handle data which can be represented as ‘sequence’ of observations
over time.
▪ Markov Assumptions
1. Limited Horizon assumption: Probability of being in a state at a time t depend only on the state at the
time (t-1).

▪ means state at time t represents enough summary of the past reasonably to predict the future.

Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 13
Summary of Markov Model

▪ Stationary Process Assumption: Conditional (probability) distribution over the next state, given the
current state, doesn't change over time.

find out the probability of sequence — >


{z1 = s_hot , z2 = s_cold , z3 = s_rain , z4 = s_rain , z5 = s_cold}
Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 14
Summary of Markov Model

Two Main Questions in Markov-model


1. Probability of particular sequences of state z?
2. How do we estimate the parameter of state transition matrix A to maximize the likelihood of the
observed sequence?

https://towardsdatascience.com/markov-and-hidden-markov-model-3eec42298d75

Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 15
Hidden Markov Model

▪ HMM
▪ is a probabilistic model to infer unobserved
information from observed data
▪ Cannot observe the state themselves but only
the result of some probability
function(observation) of the states.
▪ HMM is a statistical Markov model in
which the system being modeled is
assumed to be a Markov process with
unobserved (hidden) states.
▪ Markov Model: Series of (hidden) states
z={z_1,z_2………….} drawn from state alphabet
S ={s_1,s_2,…….𝑠_|𝑆|} where z_i belongs to S.
▪ Hidden Markov Model: Series of observed
output x = {x_1,x_2,………} drawn from an
output alphabet V= {𝑣1, 𝑣2, . . , 𝑣_|𝑣|} where Set of states (S) = {Happy, Grumpy}
x_i belongs to V Set of hidden states (Q) = {Sunny , Rainy}
State series over time = z∈ S_T
Observed States for 4 days = {z1=Happy, z2= Grumpy, z3=Grumpy, z4=Happy}

Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 16
Assumptions of HMM

Output independence assumption:


▪ Output observation is conditionally
independent of all other hidden states
and all other observations when given
the current hidden state.

• Emission Probability Matrix:


• Probability of hidden state generating
output v_i given that state at the
corresponding time was s_j.

Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 17
Three important questions in HMM are

1. What is the probability of an observed sequence?


2. What is the most likely series of states to generate an observed sequence?
3. How can we learn the values for the HMMs parameters A and B given some data?

Probability of Observed Sequence


1. Forward Procedure
Calculate the total probability of all the observations (from t_1 ) up to time t.
𝛼_𝑖 (𝑡) = 𝑃(𝑥_1 , 𝑥_2 , … , 𝑥_𝑡, 𝑧_𝑡 = 𝑠_𝑖; 𝐴, 𝐵)
2. Backward Procedure
Similarly calculate total probability of all the observations from final time (T) to t.
𝛽_i (t) = P(x_T , x_T-1 , …, x_t+1 , z_t= s_i ; A, B)

https://towardsdatascience.com/markov-and-hidden-markov-model-3eec42298d75

Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 18
HMM – Forward Procedure

▪ S = {hot,cold}
▪ v = {v1=1 ice cream ,v2=2 ice
cream,v3=3 ice cream}
▪ where V is the Number of ice creams
consumed on a day.
▪ Example Sequence =
{x1=v2,x2=v3,x3=v1,x4=v2}

https://towardsdatascience.com/markov-and-hidden-markov-model-3eec42298d75

Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 19
HMM – Forward Procedure

Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 20
HMM- Forward Procedure

▪ For first observed output x1=v2


Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 21
HMM-Forward Procedure

▪ 2. for observed output x2=v3

Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 22
HMM-Forward Procedure

▪ for observed output x3 and x4


▪ Similarly for x3=v1 and x4=v2, we have to multiply the paths that lead to v1 and v2.

Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 23
Hidden Markov Model Algorithm

▪ Step 1: Define the state space and observation space


▪ The state space is the set of all possible hidden states, and the
observation space is the set of all possible observations

▪ Step 2:Define the initial state distribution

▪ Step 3:Define the state transition probabilities

▪ Step 4:Define the observation likelihoods:


▪ Emission matrix - Probabilities of generating each observation from
each hidden state
▪ https://www.geeksforgeeks.org/hidden-markov-model-in-machine-learning/
Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 24
Hidden Markov Model Algorithm

▪ Step 5:Train the model


▪ Baum-Welch algorithm, or the forward-backward algorithm used to
estimate parameters of the state transition probabilities and the
observation likelihoods
▪ Iteratively updating the parameters until convergence.
▪ Step 6:Decode the most likely sequence of hidden states
▪ Given the observed data, the Viterbi algorithm is used to compute the most
likely sequence of hidden states
▪ This can be used to predict future observations, classify sequences, or
detect patterns in sequential data.
▪ Step 7:Evaluate the model
▪ using various metrics, such as accuracy, precision, recall, or F1 score.

Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 25
Baum-Welch Algorithm
▪ Also known as the forward-backward algorithm
▪ is a dynamic programming approach and a special case of the expectation-maximization algorithm
▪ purpose is to tune the parameters of the HMM such that the model is maximally like the observed data.
▪ the state transition matrix A
▪ the emission matrix B
▪ the initial state distribution π₀, such that the model is maximally like the observed data.
▪ Includes the
1. Initial phase
2. Forward phase
3. Backward phase
4. Update phase. λ
▪ The forward and the backward phase form the E-step of the EM algorithm
▪ computes expected hidden states given the observed data and the set of parameter matrices
before-tuned
▪ the update phase itself is the M-step
▪ update formulas then tune the parameter matrices to best fit the observed data and the expected
hidden states

Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 26
Baum-Welch Algorithm

1. Initial phase
▪ parameter matrices A, B, π₀ are initialized
▪ Can be done randomly if there is no prior knowledge about them.
2. Forward phase

▪ α is the joint probability of the observed data up to time k and the state at time k

https://medium.com/analytics-vidhya/baum-welch-algorithm-for-training-a-hidden-markov-model-part-2-of-the-hmm-series-d0e393b4fb86

Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 27
Baum-Welch Algorithm

1. Backward Phase

▪ β function is defined as the conditional probability of the observed data from time k+1 given the state
at time k
▪ The second term of the R.H.S. is the state transition probability from A, while the last term is the
emission probability from B.
▪ The R.H.S. is summed over all possible states at time k +1.

Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 28
Baum-Welch Algorithm

4. Update phase
Probability distribution of a state at time k given all observed data we have

Joint probability of two consecutive states given the data

Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 29
Baum-Welch Algorithm

4. Update phase

Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 30
HMM – Viterbi Algorithm

To find the best set of states

https://medium.com/analytics-vidhya/baum-welch-algorithm-for-training-a-hidden-markov-model-part-2-of-the-hmm-series-d0e393b4fb86

Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 31
HMM- Maximum Likelihood

▪ Maximum likelihood assignment


▪ For a given observed sequence of outputs 𝑥 𝜖 𝑉_𝑇, we intend to find the most likely series of states
𝑧 𝜖 𝑆_𝑇. We can understand this with an example found below.

Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 32
HMM – Viterbi Algorithm

Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 33
Automatic Speech Recognition

• ACOUSTIC MODEL
Uses audio recordings of speech
& compiles to statistical
representations of the sounds for
words.
• Language model
• which gives the probabilities
of sequences of words.
• Lexicon
• set of words with their
pronunciations broken down
into phonemes

https://jonathan-hui.medium.com/speech-recognition-gmm-hmm-8bb5eff8b196

Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 34
Building blocks of ASR

https://jonathan-hui.medium.com/speech-recognition-gmm-hmm-8bb5eff8b196

Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 35
Realization of ASR

Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 36
Optimizers
&
Practical Methodology
DSE 3151, B.Tech Data Science & Engineering
Rohini R Rao & Abhilash Pai
Department of Data Science and Computer Applications
MIT Manipal
Backpropagation in ANN : Recap
Gradient Descent Algorithm
Backpropagation in ANN : Recap
Gradient descent: Error surface
Backpropagation in ANN : Recap
Gradient descent: Error surface
Backpropagation in ANN : Recap
Gradient descent and its variants
Vanilla (Batch) Gradient Descent

What is the issue here?


Think of how the algorithm will work for large Dataset (Eg: ImageNet with Billions of Data)
Backpropagation in ANN : Recap
Gradient descent and its variants
Stochastic Gradient Descent
Mini Batch Gradient Descent Deep Learning
Optimizer
• Batch gradient descent:
• gradient is average of gradients computed from ALL the samples in dataset
• Mini Batch GD:
• subset of the dataset is used for calculating the loss function, therefore fewer
iterations are needed.
• batch size of 32 is considered to be appropriate for almost every case.
• Yann Lecun (2018) – “Friends don’t let friends use mini batches larger than 32”
• is faster , more efficient and robust than the earlier variants of gradient
descent.
• the cost function is noisier than the batch GD but smoother than SDG.
• Provides a good balance between speed and accuracy.
• It needs a hyperparameter that is “mini-batch-size”, which needs to be
tuned to achieve the required accuracy.

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 7


Backpropagation in ANN : Recap
Gradient descent and its variants
Mini-Batch Stochastic Gradient Descent
Backpropagation in ANN : Recap
Gradient descent and its variants
Backpropagation in ANN for
Gradient descent: Behaviour : Recap
different initializations

Slow in flat regions and fast in steep slopes


Backpropagation in ANN : Recap
Contour Plot: a better way of visualizing error surface in 2D
Backpropagation in ANN : Recap
Contour Plot vs 3D surface plot
Backpropagation in ANN : Recap
Contour Plot vs 3D surface plot
Backpropagation in ANN : Recap
Gradient descent: Problems
Plateaus and Flat Regions
Backpropagation in ANN : Recap
Gradient descent and its variants
Momentum-based Gradient Descent
Range : [0,1] (typically closer to 1)

Accumulated history of
weight updates

Update = An Exponential weighted average of gradients (more weightage to recent updates and less weightage to old updates)
Backpropagation in ANN : Recap
Gradient descent and its variants
Momentum-based Gradient Descent
Backpropagation in ANN : Recap
Gradient descent and its variants
Momentum-based Gradient Descent
Stochastic GD & SGD with Momentum DL
• stochastic means randomness on which the
algorithm is based upon.
• Instead of taking the whole dataset for each
iteration, randomly select the batches of data
• The path taken is full of noise as compared to
the gradient descent algorithm.
• Uses a higher number of iterations to reach
the local minima, thereby the overall
computation time increases.
• The computation cost is still less than that of
the gradient descent optimizer.
• If the data is enormous and computational
time is an essential factor, SGD should be
preferred over batch gradient descent
algorithm.
• Stochastic Gradient Descent with
Momentum Deep Learning Optimizer
• momentum helps in faster
convergence of the loss function.

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 18


SGD with Momentum Optimizer
• GD takes small , regular steps down
the slope so algorithm takes more
time to reach the bottom
• adding a fraction of the previous
update to the current update will
make the process a bit faster.
• Hyperparameter β - Momentum
• To simulate a friction mechanism • β - Momentum
and prevent momentum from • set between 0 (high friction) and 1 (low friction)
becoming too large • Typically 0.9
• Also rolls past local minima
• learning rate should be decreased
with a high momentum term.

Aurelien Geron, “Hands-On Machine Learning with Scikit-Learn , Keras & Tensorflow, OReilly Publications
Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 19
SGD with Nesterov Momentum Optimization
• Yurii Nesterov in 1983
• to measure the gradient of the cost
function not at the local position but
slightly ahead in the direction of the
momentum
• the momentum vector will be
pointing in the right direction (i.e.,
toward the optimum)
• it will be slightly more accurate to
use the gradient measured a bit
farther in that direction rather than
using the gradient at the original
position

Aurelien Geron, “Hands-On Machine Learning with Scikit-Learn , Keras & Tensorflow, OReilly Publications
Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 20
Adagrad (Adaptive Gradient Descent) Deep
Learning Optimizer
• Adaptive Learning Rate
• Scaling down the gradient vector along the
steepest dimension
• If the cost function is steep along the ith
dimension, then s will get larger and larger at each
iteration
• No need to modify the learning rate manually
• more reliable than gradient descent algorithms,
and it reaches convergence at a higher speed.

• Disadvantage
• it decreases the learning rate aggressively and
monotonically.
• Due to small learning rates, the model eventually
becomes unable to acquire more knowledge, and
hence the accuracy of the model is compromised.

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 21


RMS Prop(Root Mean Square) Deep Learning
Optimizer
• The problem with the gradients some are small while others may be huge
• Defining a single learning rate might not be the best idea.
• accumulating only the gradients from the most recent iterations (as
opposed to all the gradients since the beginning of training).

• β= 0.9

• Works better than Adagrad , was popular until ADAM

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 22


Adam Deep Learning Optimizer
• is derived from adaptive moment estimation.
• inherit the features of both Adagrad and RMS
prop algorithms.
• like Momentum optimization keeps track of
an exponentially decaying average of past
gradients,
• like RMSProp it keeps track of an
exponentially decaying average of past
squared gradients
• β1 and β2 represent the decay rate of the
average of the gradients.
• Advantages
• Is straightforward to implement
• faster running time t represents iteration
• low memory requirements, and requires less tuning β1= 0.9
• Disadvantages β 2=0.999
• Focusses on faster computation time, whereas Smoothing term ϵ is=10–7
SGD focus on data points.
• Therefore SGD generalize the data in a better
manner at the cost of low computation speed.

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 23


Curve Fitting – True Function is Sinusoidal

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 24


Curve Fitting – True Function is Sinusoidal

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 25


Bias

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 26


Variance

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 27


Mean Square Error

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 28


Train vs Test Error

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 29


Train vs Test Error

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 30


Learning Curves
• Line plot of learning (y-axis) over experience (x-axis)
• The metric used to evaluate learning could be
• Optimization Learning Curves:
• calculated on the metric by which the parameters of the model are being optimized, e.g. loss.
• Minimizing, such as loss or error
• Performance Learning Curves:
• calculated on the metric by which the model will be evaluated and selected, e.g. accuracy.
• Maximizing metric , such as classification accuracy
• Train Learning Curve:
• calculated from the training dataset that gives an idea of how well the model is learning.
• Validation Learning Curve:
• calculated from a hold-out validation dataset that gives an idea of how well the model is
generalizing.

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 31


Underfit Learning Curves

A plot of learning curves shows underfitting if:


•The training loss remains flat regardless of training.
•The training loss continues to decrease until the end of training.
Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 32
Overfitting Curves
• Overfitting
• Model specialized on training data, it is not
able to generalize to new data
• Results in increase in generalization error.
• generalization error can be measured by
the performance of the model on the
validation dataset.
• A plot of learning curves shows
overfitting if:
• The plot of training loss continues to
decrease with experience.
• The plot of validation loss decreases to a
point and begins increasing again.
• The inflection point in validation loss may
be the point at which training could be
halted as experience after that point shows
the dynamics of overfitting.

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 33


Good Fit Learning Curves
• A good fit is the goal of the learning
algorithm and exists between an overfit
and underfit model.
• A plot of learning curves shows a good fit
if:
• Plot of training loss decreases to a point of
stability
• Plot of validation loss decreases to a point
of stability and has a small gap with the
training loss.
• Loss of the model will almost always be
lower on the training than the validation
dataset.
• We should expect some gap between the
train and validation loss learning curves.
• This gap is referred to as the
“generalization gap.”
Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 34
Avoiding Overfitting


Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 35


Avoiding overfitting
• Dropout
• Proposed by Geoffrey Hinton in 2012
• at every training step, every neuron has a
probability p of being temporarily
“dropped out,”
• it will be entirely ignored during this
training step, but it may be active during
the next step
• p is called the dropout rate, usually 50%.
• After training, neurons don’t get dropped
• If p = 50%
• during testing a neuron will be connected
to twice as many input neurons as it was
(on average) during training.
• multiply each input connection weight by
the keep probability (1 – p) after training
• Alternatively, divide each neuron’s output
by the keep probability during training

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 36


Overfitting using Regularization
• L1, l2 regularization
• Regularization can be used to constrain the NN weights
• Lasso – (l1) least absolute shrinkage and selection operator
• adds the “absolute value of magnitude” of the coefficient as a penalty term to the loss function.
• Ridge regression (l2)
• adds the “squared magnitude” of the coefficient as the penalty term to the loss function.
• Use l1(), l2(), l1_l2() function
• which returns a regularizer that will compute the regularization loss at each step during training
• Regularization loss is added to final loss
• Max-Norm Normalization
• It constrains the weights w of the incoming connections

• r is max-norm hyper parameter and ∥ · ∥2 is the ℓ2 norm


• Reducing r increases the amount of regularization and helps reduce overfitting
• Can also help alleviate the unstable gradients problem if we are not using Batch normalization

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 37


Constant learning rate not ideal
• Better to start with a high learning
rate
• then reduce it once it stops
making fast progress
• can reach a good solution faster
• Learning Schedule strategies can
be applied
• Power Scheduling
• Exponential Scheduling
• Piecewise Constant Scheduling
• Performance Scheduling

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 38


Learning Rate Scheduling
• Power scheduling
• Learning rate set to a function of the iteration number ‘t’
• t: η(t) = η0 / (1 + t/k)c
• The initial learning rate η0, the power c (typically set to 1)
• The learning rate drops at each step, and after s steps it is down to η0 / 2 and so on
• schedule first drops quickly, then more and more slowly
• optimizer = keras.optimizers.SGD(lr=0.01, decay=1e-4)
• The decay is the number of steps it takes to divide the learning rate by one more unit,
• Keras assumes that c is equal to 1.
• Exponential scheduling
• Set the learning rate to: η(t) = η0 0.1t/s
• learning rate will gradually drop by a factor of 10 every s steps.

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 39


Learning Rate Scheduling
• Piecewise constant scheduling
• Constant learning rate for a number of epochs
• e.g., η0 = 0.1 for 5 epochs
• then a smaller learning rate for another number of epochs
• e.g., η1 = 0.001 for 50 epochs and so on
• Performance scheduling
• Measure the validation error every N steps (just like for early stopping)
• reduce the learning rate by a factor of λ when the error stops dropping

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 40


Backpropagation in ANN : Recap
How about optimizing GD by varying learning rate ?
• If learning rate is too small, it takes long time to converge.
If learning rate is too large, the gradients explode.

Some techniques for choosing learning rate:

• Linear Search
Backpropagation in ANN : Recap
GD with adaptive learning rate
• Motivation: Can we have a different learning rate for each parameter which takes care of
the frequency of features ?

• Intuition: Decay the learning rate for parameters in proportion to their update history.
• For sparse features, accumulated update history is small
• For dense features, accumulated update history is large

Make learning rate inversely proportional to the update history i.ie, if the feature has
been updated fewer times, give it a larger learning rate and vice versa

Let’s consider an example …..


Backpropagation in ANN : Recap
GD with adaptive learning rate
Practical Methodology
• ML practitioner needs to know
• how to choose an algorithm for a
particular application
• how to monitor and respond to
feedback obtained from
experiments
• in order to improve a machine
learning system

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 44


Practical Design Process
• PHASE 1
✓Define the problem statement.
✓Meta data of Data set
• In real world problems
• If training data is limited, determine
the value of reducing error against
the cost of collecting more data
• In scientific problems Step 1- Define Project
• Benchmarked data sets to be used
Objectives
✓Exploratory Analysis
• Deep or not?
✓Preprocessing pipeline specific to
data • Use needs to define metric
✓Define Project Objectives based goals

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 45


Step 1 - Problem Statement – Deep or Not?

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 46


Step – 1 Performance Metrics
• Performance Metrics
• Bayes error defines the minimum error rate , even if there is infinite training data and can
recover the true probability distribution
• For example
• Precision is the fraction of detections reported by the model that were correct
• Recall is the fraction of true events that were detected.
• Visualization such as PR curve
• ML model can quantify how confident it should be about a decision
• if a wrong decision can be harmful , a human operator can occasionally take over
• Coverage
• is the fraction of examples for which the ML system is able to produce a response
• For example
• Street View task, the goal for the project was to reach human-level transcription accuracy
while maintaining 95% coverage.
• Human-level performance on this task is 98% accuracy.

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 47


Practical Design Process
Can begin with simple statistical model
like
PHASE 2 .1 ➢ logistic regression
✓ Literature review to identify models ➢ Decision Tree
to be implemented ➢ Shallow Networks
✓ Pros, Cons of each model ➢ SVM
✓ Shortlist models for implementation
✓ Define baseline model

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 48


Step 2 - Default Baseline Models
Fully Connected Baseline

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 49


Step 2 - Default Baseline Models
Convolutional Network Baseline

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 50


Step 2 - Default Baseline Models
Recurrent Network Baseline

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 51


Practical Design Process
• PHASE 2.2 • Based on findings whether to
✓Define working end-to-end gather new data?
pipeline • adjust hyperparameters?
✓Determine your goals—what (learning rate, number of layers
error metric to use, and target etc)
value. • Hyperparameter tuning can be
Manual or Automatic
✓Diagnose performance and
optimization curves • or change architecture?
• Measure train and test error
• Overfitting vs Underfitting
Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 52
Step 3- Data Driven Refinement
• Manual Hyperparameter Tuning
• Need to understand the relationship
between hyperparameters, training error,
generalization error and computational
resources (memory & runtime)
• Goal of manual Hyperparameter tuning
• to find the lowest generalization error
subject to some runtime and memory
budget
• to adjust the effective capacity of the model
to match the complexity of the task
• Effective capacity is constrained by
✓ the representational capacity of the model
✓ ability of the learning algorithm to successfully
minimize the cost function used to train the
model
✓ the degree to which the cost function and
training procedure regularize the model.

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 53


Practical Design Process
• PHASE 3
➢Deployment in app/cloud
➢Tabulation and visualization of
results in terms of performance
and accuracy , roc /prc etc
➢Result analysis
• Comment on accuracy,
performance
• Reasoning about hyperparameters
➢Conclusion

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 54


References
• Ian Goodfellow, Yoshua Bengio and Aaron Courville, “Deep Learning”,
MIT Press 2016
• NPTEL Notes from CS6910 Deep Learning , Mitesh Khapra,
• Coursera Notes from Neural Networks and Deep Learning , Andrew
NG
• Aurelien Geron, “Hands-On Machine Learning with Scikit-Learn ,
Keras & Tensorflow, OReilly Publications
• Jiawei Han and Micheline Kamber, “Data Mining Concepts And
Techniques”, 3rd Edition, Morgan Kauffmann

Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 55


Generative Adversarial Networks

DSE 3151 DEEP LEARNING

Rohini Rao & Abhilash K Pai


Department of Data Science and Computer Applications
MIT Manipal

Credits: Most of the slides are adapted from CS7015 Deep Learning, Dept. of CSE, IIT Madras
Generative Models : Overview

• Most of the generative models perform maximum


likelihood estimation (MLE), where we have a density
Maximum Likelihood Estimation function (pmodel(x)) that the model describes (x is a
vector representing the input)

• Specifically, (pmodel(x| ϴ)) is a distribution controlled by


the parameters ϴ that describes exactly where the data
concentrates and where it is spread more sparsely.

• MLE aims to measure the log probability that the


above density function assigns to all the training data
points and adjust the parameters ϴ to increase this
probability.
Ian Goodfellow, NeurIPS’16

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 2
Taxonomy of Generative Models

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 3
Generative Adversarial Networks: Overview

• What can we use for such a complex


transformation?
• A Neural Network

• How do you train such a neural network?


• Using a two-player game.

• Basic ideas of GANs is similar to Variational autoencoders (VAEs) where we sample from a simple tractable distribution and then
learn a complex transformation on it so that the o/p looks as if it came from the training distribution.

• However, GANs start with a d-dimensional vector and generate a n-dimensional vector (usually, d<n) as compared to VAEs which
start from n-dimensional i/p and produce o/p of same dimension.

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 4
Generative Adversarial Networks: Overview

GANs consists of two networks that involve in an adversarial game:


1. Generator: A neural network that takes as input random noise and transforms it into a sample
from the model distribution.

2. Discriminator: A neural network that distinguishes between output data point (Fake) from the
Generator and training data samples (Real)

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 5
Generative Adversarial Networks: Overview

GANs consists of two networks that involve in an adversarial game:


1. Generator: A neural network that takes as input random noise and transforms it into a sample
from the model distribution.

2. Discriminator: A neural network that distinguishes between output data point (Fake) from the
Generator and training data samples (Real)

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 6
Generative Adversarial Networks: Overview

GANs consists of two networks that involve in an adversarial game:


1. Generator: A neural network that takes as input random noise and transforms it into a sample
from the model distribution.

2. Discriminator: A neural network that distinguishes between output data point (Fake) from the
Generator and training data samples (Real)

The job of the discriminator is to get


better and better at distinguishing
between true images and generated
(fake) images
The job of the generator is to produce
images which look so natural that the
discriminator thinks that the images
came from the real data distribution.

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 7
Generative Adversarial Networks: Overview

The generator mostly starts by generating a noisy image (as it takes random latent vectors as inputs).

A Friendly Introduction to Generative Adversarial Networks (GANs) - YouTube

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 8
Generative Adversarial Networks: Overview

The generator output keeps improving during training.

A Friendly Introduction to Generative Adversarial Networks (GANs) - YouTube

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 9
Generative Adversarial Networks: Overview
Equilibrium is reached when the generator finally succeeds to fool the discriminator.

A Friendly Introduction to Generative Adversarial Networks (GANs) - YouTube

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 10
Objective function of Generator

• Let GФ be the generator and Dθ be the discriminator ( Ф and


θ are the parameters of G and D, respectively) Dθ(GФ(z))= [0,1]

• We have a neural network-based generator which takes as


input a noise vector z N(0; I) and produces GФ(z) = X
GФ(z)
• We have a neural network-based discriminator which could
take as input a real X or a generated X= GФ(z) and classify
the input as real/fake

• Given an image generated by the generator as GФ(z)the


discriminator assigns a score Dθ(GФ(z))to it

• This score will be between 0 and 1 and will tell us the


probability of the image being real or fake

• For a given z, the generator would want to maximize log


Dθ(GФ(z)) (log likelihood) or minimize log(1 - Dθ(GФ(z)))

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 12
Objective function of Generator

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 13
Objective function of Discriminator

• The task of the discriminator is to assign a high score to real


images and a low score to fake images

• And it should do this for all possible real images and all
possible fake images

• In other words, it should try to maximize the following


objective function:

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 14
Generative Adversarial Networks: MiniMax formulation

Objective function:

θD ,ФG

Maximizes the obj. fn.


w.r.t to the Discriminator
The discriminator wants to maximize the second
network parameters term whereas the generator wants to minimize it
(hence it is a two-player game)
Minimizes the obj. fn.
w.r.t to the Generator
network parameters

Generative Adversarial Networks (GAN) - YouTube

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 15
Training the Discriminator Network
• Before training (when discriminator is not performing optimally- cannot clearly distinguish real and fake data)

Real Fake
(1) (0)

Decision
boundary
Generative Adversarial Networks (GAN) - YouTube

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 16
Training the Discriminator Network
• After training (when discriminator is performing optimally – clearly distinguishes real and fake data)

Decision
boundary

Generative Adversarial Networks (GAN) - YouTube

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 17
Training the Generator Network
• Before training

Tries to maximize the error in D.


That is, it incorrectly classifies D(G(z)) as 1
instead of 0

Generative Adversarial Networks (GAN) - YouTube

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 18
Training the Generator Network
• After training

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 19
Training the Generator Network

Data Distribution

Discriminator response Model Distribution

Poorly fit model After updating D After updating G Mixed strategy


equilibrium

Generative Adversarial Networks (GAN) - YouTube

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 20
Training the GANs

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 21
Deep Convolutional GANs

Generator

Radford et al. Unsupervised Representational Learning with Deep Convolutional GANs, ICLR 2016

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 22
Deep Convolutional GANs

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 23
GANs: Applications

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 24
GANs: Applications

Philip et al. Image-to-Image Translation with Conditional Adversarial Networks, CVPR 2017

Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 25

You might also like