Shallow Vs Deep Nns Dse 3151 Deep Learning
Shallow Vs Deep Nns Dse 3151 Deep Learning
(Source: softwaretestinghelp.com)
Aurelien Geron, “Hands-On Machine Learning with Scikit-Learn , Keras & Tensorflow, OReilly Publications
Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 35
Activation Functions – Sigmoid vs Tanh
• Sigmoid
• s(x) = 1/(1 + e−x) where e ≈ 2.71 is the base of the
natural logarithm
• to predict the probability
• between the range of 0 and 1, sigmoid is the
right choice.
• is differentiable-, we can find the slope of the
sigmoid curve at any two points.
• The function is monotonic but function’s
derivative is not.
• can cause a neural network to get stuck at the
training time.
• Tanh
• Range is from (-1 to 1)
• Centering data – mean = 0
• is also sigmoidal (s - shaped)
• Advantage is that the -ve inputs will be mapped
strongly negative
• Disadvantage – If x is very small or very large
slope or gradient becomes 0 which slows
down gradient descent
Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 36
Activation Functions – Sigmoid vs ReLU
Regression Classification
Aurelien Geron, “Hands-On Machine Learning with Scikit-Learn , Keras & Tensorflow, OReilly Publications
Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 46
Regression Loss Functions
• Mean Squared Error (MSE)
• values with a large error are penalized.
• is a convex function with a clearly defined
global minimum
• Can be used in gradient descent
optimization to set the weight values
• Very sensitive to outliers , will significantly
increase the loss.
• Mean Absolute Error (MAE)
• used in cases when the training data has a
large number of outliers
as the average distance approaches 0,
gradient descent optimization will not work
• Huber Loss
• Based on absolute difference between the
actual and predicted value and threshold
value, 𝛿
• Is quadratic when error is smaller than 𝛿
but linear when error is larger than 𝛿
Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 47
Classification Loss Functions
Cross entropy measures entropy between two probability distributions
• Binary Cross-Entropy/Log Loss
• Compares the actual value (0 or 1)
with the probability that the input
aligns with that category
• p(i) = probability that the category is 1
• 1 — p(i) = probability that the
category is 0)
• Categorical Cross-Entropy Loss
• In cases where the number of
classes is greater than two
Keras Metrics
• Accuracy metrics
• Accuracy
• Calculates how often predictions equal labels.
• Binary Accuracy
• Calculates how often predictions match binary labels.
• Categorical Accuracy
• Calculates how often predictions match one-hot labels
• Sparse Categorical Accuracy
• Calculates how often predictions match integer labels.
• TopK Categorical Accuracy
• calculates the percentage of records for which the targets are in the
top K predictions
• rank the predictions in the descending order of probability values.
• If the rank of the yPred is less than or equal to K, it is considered
accurate.
• Sparse TopK Categorical Accuracy class
• Computes how often integer targets are in the top K predictions.
Aurelien Geron, “Hands-On Machine Learning with Scikit-Learn , Keras & Tensorflow, OReilly Publications
Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 65
SGD with Nesterov Momentum Optimization
• Yurii Nesterov in 1983
• to measure the gradient of the cost
function not at the local position but
slightly ahead in the direction of the
momentum
• the momentum vector will be
pointing in the right direction (i.e.,
toward the optimum)
• it will be slightly more accurate to
use the gradient measured a bit
farther in that direction rather than
using the gradient at the original
position
Aurelien Geron, “Hands-On Machine Learning with Scikit-Learn , Keras & Tensorflow, OReilly Publications
Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 66
Adagrad (Adaptive Gradient Descent) Deep
Learning Optimizer
• Adaptive Learning Rate
• Scaling down the gradient vector along the
steepest dimension
• If the cost function is steep along the ith
dimension, then s will get larger and larger at each
iteration
• No need to modify the learning rate manually
• more reliable than gradient descent algorithms,
and it reaches convergence at a higher speed.
• Disadvantage
• it decreases the learning rate aggressively and
monotonically.
• Due to small learning rates, the model eventually
becomes unable to acquire more knowledge, and
hence the accuracy of the model is compromised.
• β= 0.9
▪ Example: Consider a discrete signal ‘xt’ which represents the position of a spaceship at time ‘t’
recorded by a laser sensor.
▪ Considering that, the most recent measurements are more important, we would like to take
a weighted average over ‘xt’. The new estimate at time ‘t’ is computed as follows: x1
convolution
∞
𝑠𝑡 = 𝑥𝑡−𝑎 𝑤−𝑎 = 𝑥 ∗ 𝑤 𝑡
𝑎=0
input Filter/Mask/Kernel
x2
▪ Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 2
The Convolution Operation - 1D
▪ In practice, we would sum only over a small window.
6
▪ We just slide the filter over the input and compute the value of st based on a window around xt
* * * * * * *
x 1.0 1.10 1.20 1.40 1.70 1.80 1.90 2.10 2.20
s 1.80
Content adapted from : CS7015 Deep Learning, Dept. of CSE, IIT Madras
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 3
The Convolution Operation - 1D
▪ In practice, we would sum only over a small window.
6
▪ We just slide the filter over the input and compute the value of st based on a window around xt
* * * * * * *
x 1.0 1.10 1.20 1.40 1.70 1.80 1.90 2.10 2.20
s 1.80 1.96
Content adapted from : CS7015 Deep Learning, Dept. of CSE, IIT Madras
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 4
The Convolution Operation - 1D
▪ In practice, we would sum only over a small window.
6
▪ We just slide the filter over the input and compute the value of st based on a window around xt
* * * * * * *
x 1.0 1.10 1.20 1.40 1.70 1.80 1.90 2.10 2.20
▪ Use cases of 1-D convolution : Audio signal processing, stock market analysis, time series analysis etc.
Content adapted from : CS7015 Deep Learning, Dept. of CSE, IIT Madras
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 5
Convolution in 2-D using Images : What is an Image?
What we see
▪ An image can be represented mathematically as a function f(x,y) which gives the intensity value at
position (x,y), where, f(x,y) ε {0,1,….,Imax-1} and x,y ε {0,1,…..,N-1}.
▪ Larger the value of N, more is the clarity of the picture (larger resolution), but more data to be analyzed
in the image.
▪ If the image is a Gray-scale (8-bit per pixel) image, then it requires N2 Bytes for storage.
▪ If the image is color - RGB, each pixel requires 3 Bytes of storage space.
N is the resolution of the image and Imax is the level of discretized brightness value.
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 7
Convolution in 2-D using Images : What is an Image?
Digital camera
[Source: D. Hoiem]
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 8
Convolution in 2-D using Images : What is an Image?
[Source: D. Hoiem]
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 9
Convolution in 2-D using Images : What is an Image?
255 255 255 255 255 255 255 255 255 255 255 255
255 255 255 255 255 255 255 255 255 255 255 255
255 255 255 0 0 255 255 255 255 255 255 255
255 255 255 75 75 75 255 255 255 255 255 255
255 255 75 95 95 75 255 255 255 255 255 255
255 255 96 127 145 175 255 255 255 255 255 255
255 255 127 145 175 175 175 255 95 255 255 255
255 255 127 145 200 200 175 175 95 255 255 255
255 255 127 145 145 175 127 127 95 47 255 255
255 255 127 145 145 175 127 127 95 47 255 255
255 255 74 127 127 127 95 95 95 47 255 255
255 255 255 74 74 74 74 74 74 255 255 255
255 255 255 255 255 255 255 255 255 255 255 255
255 255 255 255 255 255 255 255 255 255 255 255
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 10
Convolution in 2-D using Images : What is an Image?
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 11
The Convolution Operation - 2D
▪ Images are good examples of 2-D inputs.
▪ A 2-D convolution of an Image ‘I’ using a filter ‘K’ of size ‘m x n’ is now defined as (looking at previous pixels):
𝑚−1 𝑛−1
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 12
The Convolution Operation - 2D
▪ Another way is to consider center pixel as reference pixel, and then look at its surrounding pixels:
𝑚/2 𝑛/2
Pixel of interest
0 1 0 0 1
0 0 1 1 0
1 0 0 0 1
0 1 0 0 1
0 0 1 0 1
Content adapted from : CS7015 Deep Learning, Dept. of CSE, IIT Madras
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 13
The Convolution Operation - 2D
Source: https://developers.google.com/
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 14
The Convolution Operation - 2D
Input Image
Source: https://developers.google.com/
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 15
The Convolution Operation - 2D
Input Image
Source: https://developers.google.com/
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 16
The Convolution Operation - 2D
Smoothening Filter
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 17
The Convolution Operation - 2D
Sharpening Filter
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 18
The Convolution Operation - 2D
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 19
The Convolution Operation – 2D : Various filters (edge detection)
Prewitt
-1 0 1 1 1 1
-1 0 1 0 0 0
-1 0 1 -1 -1 -1
Sx Sy After applying
Horizontal edge
detection filter
Sobel
-1 0 1 1 2 1
-2 0 2 0 0 0
-1 0 1 -1 -2 -1
Sx Sy Input image After applying
Vertical edge
Laplacian Roberts detection filter
0 1 0 0 1 1 0
1 -4 1 -1 0 0 -1
0 1 0 Sx Sy
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 20
The Convolution Operation - 2D
1 -1 -1
-1 1 -1 Filter 1
-1 -1 1
stride=1
1 0 0 0 0 1 Dot
product
0 1 0 0 1 0 3 -1
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
Input image: 6 x 6
Note: Stride is the number of “unit” the kernel is shifted per slide over rows/ columns
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 21
The Convolution Operation - 2D
1 -1 -1
-1 1 -1 Filter 1
-1 -1 1
If stride=2
1 0 0 0 0 1
0 1 0 0 1 0 3 -3
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
Input image: 6 x 6
Note: Stride is the number of “unit” the kernel is shifted per slide over rows/ columns
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 22
The Convolution Operation - 2D
1 -1 -1
-1 1 -1 Filter 1
-1 -1 1
stride=1
1 0 0 0 0 1
3 -1 -3 -1
0 1 0 0 1 0
0 0 1 1 0 0 -3 1 0 -3
1 0 0 0 1 0 4 x 4 Feature Map
0 1 0 0 1 0 -3 -3 0 1
0 0 1 0 1 0
3 -2 -2 -1
Input image: 6 x 6
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 23
The Convolution Operation - 2D
-1 1 -1
-1 1 -1 Filter 2
-1 1 -1
stride=1
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 24
The Convolution Operation –RGB Images
R G B
Source: Intuitively Understanding Convolutions for Deep Learning | by Irhum Shafkat | Towards Data Science
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 25
The Convolution Operation –RGB Images multiple filters
11 -1-1 -1-1 -1-1 11 -1-1 -1-1 11 -1-1
1 -1 -1 -1 0 -1 -1 1 -1
-1 1 -1 -1-1 11 -1-1 -1-1 11 -1-1
-1-1 11 -1-1 Filter 1 -1 0 -1 Filter 2
0 0 0 Filter K
-1-1 -1-1 11 -1-1 11 -1-1 -1-1 11 -1-1
-1 -1 1 -1 0 -1 -1 1 -1
1 0 0 0 0 1
1 0 0 0 0 1
0 11 00 00 01 00 1
0 1 0 0 1 0
0 00 11 01 00 10 0
0 0 1 1 0 0 K-filters = K-Feature Maps
1 00 00 10 11 00 0
1 0 0 0 1 0
0 11 00 00 01 10 0
0 1 0 0 1 0
0 00 11 00 01 10 0
0 0 1 0 1 0
0 0 1 0 1 0 Depth of feature map = No. of feature maps = No. of filters
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 26
The Convolution Operation : Terminologies
1 0 0 0 0 1
1 0 0 0 0 1
0 11 00 00 01 00 1
0 1 0 0 1 0
0 00 11 01 00 10 0 -1-1 11 -1-1
0 0 1 1 0 0 -1 1 -1
1 00 00 10 11 00 0 -1-1 11 -1-1
1 0 0 0 1 0 0 0 0 Filter
0 11 00 00 01 10 0 -1-1 11 -1-1
0 1 0 0 1 0 -1 1 -1
0 00 11 00 01 10 0
0 0 1 0 1 0
0 0 1 0 1 0
1. Depth of an Input Image = No. of channels in the Input Image = Depth of a filter
2. Assuming square filters, Spatial Extent (F) of a filter is the size of the filter
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 27
The Convolution Operation : Zero Padding
conv3x3
2x2
4x4
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 28
The Convolution Operation : Zero Padding
Source: Intuitively Understanding Convolutions for Deep Learning | by Irhum Shafkat | Towards Data Science
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 29
Convolutional Neural Network (CNN) : At a glance
cat | dog
Convolution
Pooling
Can repeat Fully Connected
many times Feedforward network
Convolution
Pooling
1 -1 -1 -1 1 -1
-1 1 -1 Filter 1 -1 1 -1 Filter 2
-1 -1 1 -1 1 -1
• Max Pooling
3 -1 -3 -1 -1 -1 -1 -1 • Average Pooling
-3 1 0 -3 -1 -1 -2 1
-3 -3 0 1 -1 -1 -2 1
3 -2 -2 -1 -1 0 -4 3
Source: CS 898: Deep Learning and Its Applications, University of Waterloo, Canada.
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 31
Pooling
Stride ?
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 32
Why Pooling ?
bird
bird
Subsampling
1 -1 -1
W1
11 -1-1 -1-1 W2
-1-1 11 -1-1
𝑾𝟏 − 𝑭 + 𝟐𝑷 -1 1 -1 𝑯𝟏 − 𝑭 + 𝟐𝑷
𝑾𝟐 = +𝟏 -1-1 -1-1 11 𝑯𝟐 = +𝟏 𝑫𝟐 = 𝑲
𝑺 -1 -1 1 𝑺
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 34
Important properties of CNN
▪ Sparse Connectivity
▪ Shared weights
▪ Equivariant representation
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 35
Properties of CNN
1 1 1
1 -1 -1 Filter 1 2 0 -1
-1 1 -1 3 0 -1
-1 -1 1 4 0 -1
3
..
7 0 1
1 0 0 0 0 1 8 1 -1
0 1 0 0 1 0 9 0
0 0 1 1 0 0 10 0 -1
-1
..
1 0 0 0 1 0 1
Fewer parameters!
13 0
0 1 0 0 1 0 Only connect to 9 inputs, not fully
14 0 connected (Sparse Connectivity)
0 0 1 0 1 0
15 1
6 x 6 Image 16 1
..
Source: CS 898: Deep Learning and Its Applications, University of Waterloo, Canada.
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 36
Properties of CNN
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 37
Properties of CNN
1 1
1 -1 -1 2 0
-1 1 -1 3 0
-1 -1 1 4 0 3
..
7 0
1 0 0 0 0 1 8 1
0 1 0 0 1 0 9 0
0 0 1 1 0 0 10 0
-1
..
1 0 0 0 1 0
13 0
0 1 0 0 1 0 Even Fewer parameters!
14 0
0 0 1 0 1 0 Fewer parameters!
15 1
6 x 6 Image Shared weights
16 1
..
Source: CS 898: Deep Learning and Its Applications, University of Waterloo, Canada.
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 38
Equivariance to translation
▪ A function f is equivariant to a function g if f(g(x)) = g(f(x)) or if the output changes in the same way as the
input.
▪ As the same weights are shared across the images, hence if an object occurs in any image, it will be detected
irrespective of its position in the image.
Source: Translational Invariance Vs Translational Equivariance | by Divyanshu Mishra | Towards Data Science
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 39
CNN vs Fully Connected NN
▪ Shared weights
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 40
Convolutional Neural Network (CNN) : Non-linearity with activation
cat | dog
Convolution +
ReLU
Pooling
Fully Connected
Feedforward network
Convolution+
ReLu
Pooling
#Param.
#Param. ((5*5*16)*120 +
#Param. #Param. 120 = 48120 #Param.
((5*5*6)+1) * 16 = 2416
((5*5*1)+1) * 6 = 156 =0 84*120 + 84=
#Param. 10164
=0 #Param.
84*10 + 10= 850
tanh tanh
sigmoid
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P., & others. (1998). Gradient-based learning applied to document recognition.
Proceedings of the IEEE, 86(11), 2278–2324.
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 42
LeNet-5 Architecture for handwritten number recognition
Source: http://yann.lecun.com/
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 43
ImageNet Dataset
ZFNet
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 45
AlexNet (2012)
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 46
AlexNet Architecture
#Param. = 0 #Param. = 0
#Param. #Param.
#Param.
((5*5*96)+1) * 256 = 614656 ((3*3*256)+1) * 384 =
((11*11*3)+1) * 96 = 34944
885120
#Param. = 0
#Param.
((3*3*384)+1) * 256 =884992
Total #Param.
#Param. 62M
((3*3*384)+1) * 384 =
1327488
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012).
Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25.
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 47
ZFNet Architecture (2013)
Zeiler, M. D., & Fergus, R. (2013). Visualizing and understanding convolutional networks.
In European conference on computer vision (pp. 818-833). Springer, Cham.
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 48
ZFNet
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 49
ZFNet
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 50
VGGNet Architecture (2014)
• This work reinforced the notion that convolutional neural networks have to have a deep network of layers in order for
this hierarchical representation of visual data to work
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition , International Conference on Learning Representations (ICLR14)
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 51
GoogleNet Architecture (2014)
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., ... & Rabinovich, A. (2015). Going deeper with convolutions.
In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR’15)
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 52
GoogleNet Architecture (2014)
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., ... & Rabinovich, A. (2015). Going deeper with convolutions.
In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR’15)
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 53
GoogleNet Architecture (2014)
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 54
GoogleNet Architecture (2014)
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., & Rabinovich, A. (2015). Going deeper with convolutions.
In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR’15)
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 55
ResNet Architecture (2015)
Effect of increasing layers of shallow CNN when experimented over the CIFAR dataset
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 56
ResNet Architecture (2015)
ResNet-34
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition.
In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 58
Sequence Modeling using RNN
▪ Music Generation
▪ Sentiment Classification
▪ Machine Translation
▪ Speech Recognition
▪ Music Generation La
▪ Sentiment Classification
▪ Machine Translation
▪ Speech Recognition
▪ Music Generation
▪ Machine Translation
▪ Speech Recognition
▪ Music Generation
▪ Sentiment Classification
▪ Machine Translation
▪ Speech Recognition
▪ Music Generation
▪ Sentiment Classification
▪ Speech Recognition
▪ Music Generation
▪ Sentiment Classification
▪ Machine Translation
▪ Speech Recognition
▪ Music Generation
▪ Sentiment Classification
▪ Machine Translation
▪ Name Entity Recognition “Alice wants to discuss about “Alice wants to discuss about
Deep Learning with Bob” Deep Learning with Bob”
• In feedforward and convolutional neural networks, the size of the input was always fixed.
• In many applications with sequence data, the input is not of a fixed size.
• In feedforward and convolutional neural networks, the size of the input was always fixed.
• In many applications with sequence data, the input is not of a fixed size.
• Further, each input to the ANN/CNN network was independent of the previous or future inputs.
• With sequence data, successive inputs may not be independent of each other.
• The model needs to look at a sequence of inputs and produce an output (or outputs).
• The model needs to look at a sequence of inputs and produce an output (or outputs).
• For this purpose, lets consider each input to be corresponding to one time step.
Running
Task: Auto-complete Task: P-o-S tagging Task: Movie Review Task: Action Recognition
Legend
• The model needs to look at a sequence of inputs and produce an output (or outputs). Output
layer
• For this purpose, lets consider each input to be corresponding to one time step.
Hidden
layer
• Next, build a network for each time step/input, where each network performs the same task
(eg: Auto complete: input=character, output=character) Input layer
3. Make sure that the function executed at each time step is the same.
• Because at each time step we are doing the same task.
Introduction
Introduction
• If the input sequence is of length ‘n’, we would create ‘n’ networks for each input, as seen previously.
• If the input sequence is of length ‘n’, we would create ‘n’ networks for each input, as seen previously.
• At time step i=0 there are no previous inputs, so they are typically assumed to be all zeros.
• Since, the output of si at time step i is a function of all the inputs from previous time steps, we could say it has a form of memory.
• A part of a neural network that preserves some state across time steps is called a memory cell ( or simply a cell )
Unroll
Same representation as seen
previously
• Unrolling the network through time = representing network against time axis.
• At each time step t (also called a frame) RNN receives inputs xi as well as output from previous step yi-1
Seq-to-Seq
Vector-to-Seq
Seq-to-Vector
0.5
Source: https://www.youtube.com/c/joshstarmer
Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 29
Recurrent Neural Networks (RNN) : Example
1
Temperature
0.5
Source: https://www.youtube.com/c/joshstarmer
Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 30
Recurrent Neural Networks (RNN) : Example
1
Temperature
0.5
Source: https://www.youtube.com/c/joshstarmer
Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 31
Recurrent Neural Networks (RNN) : Example
Source: https://www.youtube.com/c/joshstarmer
Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 32
Recurrent Neural Networks (RNN) : Example
Source: https://www.youtube.com/c/joshstarmer
Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 33
Recurrent Neural Networks (RNN) : Example
Unrolling the feedback loop by making a copy of NN for each input value
Source: https://www.youtube.com/c/joshstarmer
Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 34
Recurrent Neural Networks (RNN) : Example
Source: https://www.youtube.com/c/joshstarmer
Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 35
Recurrent Neural Networks (RNN) : Example
Source: https://www.youtube.com/c/joshstarmer
Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 36
Recurrent Neural Networks (RNN) : Example
Source: https://www.youtube.com/c/joshstarmer
Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 37
Recurrent Neural Networks (RNN) : Example
Source: https://www.youtube.com/c/joshstarmer
Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 38
Backpropagation in ANN : Recap
Now, if zk = wk ak-1 + bk
Then, ak = σ(zk)
wk ak-1 bk
ak-3 ak-2 ak-1 ak C0 =(ak – y)2
zk
Now, if zk = wk ak-1 + bk
C0
Dependency graph Then, ak = σ(zk)
𝜕𝐶0
Aim is to compute :
𝜕𝑤𝑘
wk ak-1 bk
ak-3 ak-2 ak-1 ak C0 =(ak – y)2
zk
Now, if zk = wk ak-1 + bk
C0
Dependency graph Then, ak = σ(zk)
wk ak-1 bk
ak-3 ak-2 ak-1 ak C0 =(ak – y)2
zk
Now, if zk = wk ak-1 + bk
C0
Dependency graph Then, ak = σ(zk)
wk ak-1 bk
ak-3 ak-2 ak-1 ak C0 =(ak – y)2
zk
Now, if zk = wk ak-1 + bk
C0
Dependency graph Then, ak = σ(zk)
0.1 0.1 0.1 0.1 1) What is the total loss made by the model ?
0.7 0.7 0.1 0.1
0.1 0.1 0.7 0.7
0.1 0.1 0.1 0.1
0.1 0.1 0.1 0.1 1) What is the total loss made by the model ?
0.7 0.7 0.1 0.1
0.1 0.1 0.7 0.7
0.1 0.1 0.1 0.1
0.1 0.1 0.1 0.1 1) What is the total loss made by the model ?
0.7 0.7 0.1 0.1 Ans: the Sum of individual losses
0.1 0.1 0.7 0.7
0.1 0.1 0.1 0.1
0.1 0.1 0.1 0.1 1) What is the total loss made by the model ?
0.7 0.7 0.1 0.1 Ans: the Sum of individual losses
0.1 0.1 0.7 0.7
0.1 0.1 0.1 0.1
and
Ordered network
Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras
Recall that:
Therefore:
A gist of the exploding gradient (same case with vanishing gradient if instead of 2 the value is less than 1)
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 2
LSTM and GRU : Introduction
• Consider a scenario where we have to evaluate the
The white board analogy: expression on a whiteboard:
Evaluate “ac(bd+a) + ad”
given that a= 1, b= 3, c= 5, d=11
• Normally, the evaluation in white board would look
like:
ac = 5
bd = 33
bd + a = 34
ac(bd + a) = 170
ad = 11
ac(bd + a) + ad = 181
Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras • Now, if the white board has space to accommodate
only 3 steps, the above evaluation cannot fit in the
required space and would lead to loss of information.
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 3
LSTM and GRU : Introduction
ac = 5
bd = 33
bd + a = 34
ac(bd + a) = 170
ad = 11
ac(bd + a) + ad = 181
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 4
LSTM and GRU : Introduction
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 5
LSTM and GRU : Introduction
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 6
LSTM and GRU : Introduction
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 7
LSTM and GRU : Introduction
bd + a = 34
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 8
LSTM and GRU : Introduction
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 9
LSTM and GRU : Introduction
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 10
LSTM and GRU : Introduction
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 11
LSTM and GRU : Introduction
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 12
LSTM and GRU : Introduction
Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras
Selectively write:
Selectively forget
• How do we combine st-1 and to get the new
state?
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 15
LSTM and GRU : Introduction
Selectively forget • But we may not want to use the whole of st-1
• How do we combine st-1 and to get the new but forget some parts of it.
state?
• To do this a forget gate is introduced:
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 16
LSTM (Long Short-Term Memory)
Long-term memory
Short-term memory
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 17
LSTM (Long Short-Term Memory)
• LSTM has many variants which include different number of gates and also different arrangement of gates.
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 18
LSTM Cell
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 19
LSTM Cell
Source: Aurélien Géron. Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow: Concepts, tools, and techniques to build intelligent systems. " O'Reilly Media, Inc.", 2019.
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 20
LSTM computations
needed.
Wxi, Wxf, Wxo, Wxg are the weight matrices of each of the four layers for their connection to the input
vector x(t).
Whi, Whf, Who, and Whg are the weight matrices of each of the four layers for their connection to the
previous short-term state h(t–1).
bi, bf, bo, and bg are the bias terms for each of the four layers.
Source: Aurélien Géron. Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow: Concepts, tools, and techniques to build intelligent systems. " O'Reilly Media, Inc.", 2019.
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 21
Gated Recurrent Unit (GRU)
Gates: States:
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 22
Gated Recurrent Unit (GRU)
-1
Gates: States:
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 23
Gated Recurrent Unit (GRU)
-1
Gates: States:
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 24
Gated Recurrent Unit (GRU)
-1
Gates: States:
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 25
Gated Recurrent Unit (GRU)
-1
Gates: States:
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 26
Gated Recurrent Unit (GRU)
Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras
-1
Gates: States:
No explicit forget gate (the forget gate and input gates are tied)
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 27
Gated Recurrent Unit (GRU)
Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras
-1
Gates: States:
The gates depend directly on st-1 and not the intermediate ht-1 as in the case of LSTMs
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 28
Gated Recurrent Unit CELL (Kyunghyun Cho et al, 2014)
The main simplifications of LSTM are:
• Both state vectors (short and long term) are
merged into a single vector h(t).
Source: Aurélien Géron. Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow: Concepts, tools, and techniques to build intelligent systems. " O'Reilly Media, Inc.", 2019.
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 29
LSTM vs GRU computation
• They prevent any irrelevant information from being written to the state.
• It is easy to see that during backward pass the gradients will get multiplied by the gate.
• If the state at time t-1 did not contribute much to the state at time t then during backpropagation the gradients
flowing into st-1 will vanish
• The key difference from vanilla RNNs is that the flow of information and gradients is controlled by the gates which
ensure that the gradients vanish only when they should.
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 31
Different RNNs
Vanilla RNNs
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 32
Different RNNs
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 33
Different RNNs
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 34
Different RNNs
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 35
Different RNNs
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 36
Deep RNNs
Source: Deep Recurrent Neural Networks — Dive into Deep Learning 1.0.0-alpha1.post0 documentation (d2l.ai)
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 37
Bi-Directional RNNs: Intuition
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 38
Bi-Directional RNNs: Intuition
• The o/p at the third time step (where input is the string “apple”) depends on only previous two i/ps
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 39
Bi-Directional RNNs
• Adding an additional backward layer with connections as shown above makes the o/p at a time step depend on both
previous as well as future i/ps.
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 40
Bi-directional RNNs
▪ I am ___ hungry.
Source: Bidirectional Recurrent Neural Networks — Dive into Deep Learning 1.0.0-alpha1.post0 documentation (d2l.ai)
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 41
Bi-directional RNN computation
where,
A = activation function,
W = weight matrix
b = bias
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 42
Generating Shakespearan Text using a Character RNN
▪ Stateless RNNs
▪ at each training iteration the model starts
with a hidden state full of 0s
▪ Update this state at each time step
▪ Discards the output at the final state
when moving onto next training batch
▪ Stateful RNN
▪ Uses sequential nonoverlapping input
sequences
▪ Preserves the final state after processing
one training batch
▪ use it as initial state for next training
batch
▪ Model will learn long-term patterns
despite only backpropagating through
short sequences
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 44
Encoder Decoder Models
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 2
Encoder Decoder Models : Introduction
▪ Language Modeling:
Given the t-i words predict the tth word
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 3
Encoder Decoder Models : Introduction
▪ Language Modeling:
Given the t-i words predict the tth word
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 4
Encoder Decoder Models : Introduction
▪ Language Modeling:
Given the t-i words predict the tth word
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 5
Encoder Decoder Models : Introduction
▪ Language Modeling:
Given the t-i words predict the tth word
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 6
Encoder Decoder Models : Introduction
▪ Language Modeling:
Given the t-i words predict the tth word
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 7
Encoder Decoder Models : Introduction
▪ Language Modeling:
Given the t-i words predict the tth word
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 8
Encoder Decoder Models : Introduction
▪ Language Modeling:
Given the t-i words predict the tth word
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 9
Encoder Decoder Models : Introduction
▪ Language Modeling:
Given the t-i words predict the tth word
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 10
Encoder Decoder Models : Introduction
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 11
Encoder Decoder Models : Introduction
• Shorthand notations:
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 12
Encoder Decoder Models : Introduction
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 13
Encoder Decoder Models : Introduction
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 14
Encoder Decoder Models : Introduction
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 15
Encoder Decoder Models : Introduction
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 16
Encoder Decoder Models : Introduction
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 17
Encoder Decoder Models : Introduction
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 18
Encoder Decoder Models : Introduction
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 19
Encoder Decoder Models : Introduction
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 20
Encoder Decoder Models : Introduction
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 21
Encoder Decoder Models : Introduction
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 22
Encoder Decoder Models : Introduction
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 23
Encoder Decoder Models : Introduction
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 24
Encoder Decoder Models : Introduction
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 25
Encoder Decoder Models : Introduction
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 26
Encoder Decoder Models : Introduction
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 27
Encoder Decoder Models : Introduction
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 28
Encoder Decoder Models : Introduction
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 29
Encoder Decoder Models : Introduction
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 30
Applications of Encoder Decoder Models: Image Captioning
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 31
Applications of Encoder Decoder Models: Textual Entailment
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 32
Applications of Encoder Decoder Models: Textual Entailment
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 33
Applications of Encoder Decoder Models: Transliteration
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 34
Applications of Encoder Decoder Models: Image Question Answering
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 43
Beam Search
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 44
Attention Mechanism (Bahdanau et al , 2014)
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 45
Attention Mechanism (Bahdanau et al , 2014)
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 46
Attention Mechanism: Introduction
• In a typical encoder decoder network, each time step of the decoder uses
the information obtained from the last time step of the encoder.
• In a typical encoder decoder network, each time step of the decoder uses
the information obtained from the last time step of the encoder.
While predicting each word in the o/p we would like our model to mimic
humans and focus on specific words in the i/p
While predicting each word in the o/p we would like our model to mimic
humans and focus on specific words in the i/p
While predicting each word in the o/p we would like our model to mimic
humans and focus on specific words in the i/p
• Essentially, at each time step, a distribution on the input words must be introduced.
• This distribution tells the model how much attention to pay to each input words at each
time step.
Source: CS7015 Deep Learning, Dept. of CSE, IIT Madras
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 54
Encoder Decoder with Attention Mechanism
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 55
Encoder Decoder with Attention Mechanism
St
hj
St
hj
hj
St
hj
St
hj
hj
hj
hj
hj
St
hj
St
hj
hj
St
hj
hj
hj
hj
hj
hj
hj
hj
hj
St
St
hj
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 74
Encoder Decoder with Attention Mechanism: Visualization
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 75
Attention over Images
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 79
Evaluation of Machine Translation: BLUE score
• Bilingual Evaluation Understudy (BLUE) is a score for comparing a candidate translation of text to
one or more reference translations.
• Scores are calculated for individual translated segments by comparing them with a set of good
quality reference translations.
• Scores are then averaged over the whole corpus to reach an estimate of the translation's overall
quality.
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 80
Evaluation of Machine Translation: BLUE score
Precision (uni-gram):
(1 + 1 + 1 + 1 + 1 + 1)
p1 = =1
6
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 81
Evaluation of Machine Translation: BLUE score
(1 + 0 + 0 + 0 + 0 + 0)
p1 = = 0.16
6
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 82
Evaluation of Machine Translation: BLUE score
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 83
Evaluation of Machine Translation: BLUE score
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 84
Evaluation of Machine Translation: BLUE score
Brievity Penalty
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 85
Evaluation of Machine Translation: BLUE score
Brievity Penalty
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 86
Evaluation of machine translation
NIST
▪ n-gram precision
▪ counts how many (i=1,…,4) grams match their n-gram counterpart in the reference
translations.
▪ BLEU simply calculates n-gram precision adding equal weight to each segment
▪ NIST also calculates how informative a particular n-gram is.
▪ when a correct n-gram is found, the rarer that n-gram is, the more weight it will be given
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 87
Evaluation of machine translation
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 88
Hierarchical Attention : Introduction to Hierarchical Models
Encoding of utterances
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 95
Hierarchical Attention Networks
Yang et al. Hierarchical Attention Networks for Document Classification, Proceedings of NAACL-HLT 2016
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 96
Hierarchical Attention Networks (Yang et al. 2016)
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 97
Transformers & Recursive Networks
Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017).
Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 2
Transformers
• Key components:
• Self attention
• Multi-head attention
• Positional encoding
• Encoder-Decoder architecture
Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017).
Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 3
Transformers
The Illustrated Transformer – Jay Alammar – Visualizing machine learning one concept at a time. (jalammar.github.io)
Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 4
Transformers
The encoding component is a stack of encoders and the decoding component is also
a stack of encoders of the same number (in the paper, this number = 6)
The Illustrated Transformer – Jay Alammar – Visualizing machine learning one concept at a time. (jalammar.github.io)
Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 5
Transformers
• The encoders are all identical in structure (yet they do not share weights).
• Each one is broken down into two sub-layers
• The encoder’s inputs first flow through a self-attention layer – a layer that helps the encoder look at other
words in the input sentence as it encodes a specific word – to the feed forward neural network.
The Illustrated Transformer – Jay Alammar – Visualizing machine learning one concept at a time. (jalammar.github.io)
Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 6
Transformers
The decoder has both those layers, but between them is an attention layer that helps the decoder
focus on relevant parts of the input sentence (similar what attention does in seq2seq models).
The Illustrated Transformer – Jay Alammar – Visualizing machine learning one concept at a time. (jalammar.github.io)
Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 7
Transformers
The Illustrated Transformer – Jay Alammar – Visualizing machine learning one concept at a time. (jalammar.github.io)
Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 8
Transformers : Self attention
S1: Animal didn’t cross the street because it was too tired
S2: Animal didn’t cross the street because it was too wide
The Illustrated Transformer – Jay Alammar – Visualizing machine learning one concept at a time. (jalammar.github.io)
Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 9
Transformers : Self attention
• Note: In the paper by Vaswani et al, the q, k and w vectors has dimension of 64 and the input vector x has a dimension of 512,
The Illustrated Transformer – Jay Alammar – Visualizing machine learning one concept at a time. (jalammar.github.io)
Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 10
Transformers : Self attention
The Illustrated Transformer – Jay Alammar – Visualizing machine learning one concept at a time. (jalammar.github.io)
Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 11
Transformers : Self attention
The Illustrated Transformer – Jay Alammar – Visualizing machine learning one concept at a time. (jalammar.github.io)
Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 12
Transformers : Self attention
• In the actual implementation, however, Step 1 to Step 4 is done in matrix form for faster processing.
The Illustrated Transformer – Jay Alammar – Visualizing machine learning one concept at a time. (jalammar.github.io)
Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 13
Transformers : Multi-head Attention
The Illustrated Transformer – Jay Alammar – Visualizing machine learning one concept at a time. (jalammar.github.io)
Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 14
Transformers : Multi-head Attention
The Illustrated Transformer – Jay Alammar – Visualizing machine learning one concept at a time. (jalammar.github.io)
Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 15
Transformers : Multi-head Attention
• However, the feed-forward layer is not expecting eight matrices – it’s expecting a single matrix (a
vector for each word).
The Illustrated Transformer – Jay Alammar – Visualizing machine learning one concept at a time. (jalammar.github.io)
Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 16
Transformers : Multi-head Attention
To tackle the issue, the eight matrices are concatenated and multiplied with additional
weight matrix WO
The Illustrated Transformer – Jay Alammar – Visualizing machine learning one concept at a time. (jalammar.github.io)
Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 17
Transformers : Multi-head Attention
The Illustrated Transformer – Jay Alammar – Visualizing machine learning one concept at a time. (jalammar.github.io)
Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 18
Transformers : Multi-head Attention
• As we encode the word "it", one attention head is focusing most on "the animal", while another is focusing on "tired" –
in a sense, the model's representation of the word "it" bakes in some of the representation of both "animal" and "tired".
The Illustrated Transformer – Jay Alammar – Visualizing machine learning one concept at a time. (jalammar.github.io)
Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 19
Transformers : Positional Encoding
• Order of the sequence conveys important information for machine translation tasks and language
modeling.
• Position encoding is a way to account for the order of words in the input sequence.
• The positional information of the input token in the sequence is added to the input embedding
vectors.
512
The Illustrated Transformer – Jay Alammar – Visualizing machine learning one concept at a time. (jalammar.github.io)
Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 20
Transformers : Encoder
Layer Norm
statistics are calculated across all features
and all elements (words), for each
instance(sentence) independently.
Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 21
Transformers : Decoder
• The Encoder
• processes the input sequence.
• The output of the top encoder is transformed
into a set of attention vectors K and V.
• Vectors K & V are used by each decoder in its
“encoder-decoder attention” layer which helps the
decoder focus on appropriate places in the input
sequence
• The Decoder
• has masked multi-head attention layer to
prevent the positions from seeing subsequent
positions
• The “Encoder-Decoder Attention” layer creates
its Queries matrix from the layer below it, and
takes the Keys and Values matrix from the
output of the encoder stack
Attention? Attention! | Lil'Log (lilianweng.github.io)
The Illustrated Transformer – Jay Alammar – Visualizing machine learning one concept at a time. (jalammar.github.io)
Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 22
Transformers: Encoders & Decoders
Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 23
Transformers : Decoder
The Illustrated Transformer – Jay Alammar – Visualizing machine learning one concept at a time. (jalammar.github.io)
Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 24
Massive Deep Learning Language models
▪ Language models
▪ estimate the probability of words appearing in a sentence, or of the sentence itself existing.
▪ building blocks in a lot of NLP applications
▪ Massive deep learning language models
▪ pretrained using an enormous amount of unannotated data to provide a general-purpose deep
learning model.
▪ Downstream users can create task-specific models with smaller annotated training datasets (transfer
learning)
▪ Tasks executed with BERT and GPT models:
• Natural language inference
• enables models to determine whether a statement is true, false or undetermined based on a
premise.
• For example, if the premise is “tomatoes are sweet” and the statement is “tomatoes are fruit” it
might be labelled as undetermined.
• Question answering
• model receives a question regarding text content and returns the answer in text, specifically
marking the beginning and end of each answer.
• Text classification
• is used for sentiment analysis, spam filtering, news categorization
Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 25
GPT (Generative Pre-Training) by Open AI Pretraining
2. where
1. k is the size of the context window, and ii) conditional probability P is modeled with the help
of a neural network (NN) with parameters Θ
Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 26
GPT (Generative Pre-Training) by Open AI
2. Supervised Fine-Tuning: maximising the likelihood of observing label y, given features or tokens
x_1,…,x_n.
3. Auxiliary learning objective for supervised fine-tuning to get better generalisation and faster
convergence.
Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 27
GPT-n series created by OpenAI (2018 onwards)
Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 28
GPT-3 (2020)
Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 29
GPT-3
Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 30
GPT-3 Task Agnostic Model
▪ LIMITATIONS OF GPT-3
▪ GPT-3 can perform a wide variety of operations such as
compose prose, write code and fiction, business articles
▪ Does not have any internal representation of what these
words even mean. misses the semantically grounded
model of the topics on which it discusses.
▪ If the model is faced with data that is not in a similar
form or is unavailable from the Internet’s corpus of
existing text that was used initially in the training phase,
then the language generated is a loss.
▪ Expensive and complex inferencing due to hefty
architecture, less interpretability of the language, and
uncertainty around what helps the model achieve its
few-shot learning behavior.
▪ The text generated carries bias of the language it is
initially trained on.
▪ The articles, blogs, memos generated by GPT-3 may face
gender, ethnicity, race, or religious bias.
▪ model is capable of producing high-quality text,
sometimes looses coherence with data while generating
long sentences and thus may repeat sequences of text
again and again in a paragraph.
Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 31
BERT (Bidirectional Encoder Representations from Transformers) by google
▪ “BERT stands for Bidirectional Encoder Representations from Transformers. It is designed to pre-train
deep bidirectional representations from unlabeled text by jointly conditioning on both left and right
context. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer
to create state-of-the-art models for a wide range of NLP tasks.”
• Two variants
• BERT Base: 12 layers (transformer blocks), 12 attention heads, and 110 million parameters
• BERT Large: 24 layers (transformer blocks), 16 attention heads and, 340 million parameters
▪ BERT is pre-trained on two NLP tasks:
• Masked Language Modeling
• replace 15% of the input sequence with [MASK] and model learns to detect the masked word
• Next Sentence Prediction
▪ two sentences A and B are separated with the special token [SEP] and are formed in such a
way that 50% of the time B is the actual next sentence and 50% of the time is a random
sentence.
Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 32
BERT (Bidirectional Encoder Representations from Transformers) by google
Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 33
Transformers in Computer Vision
https://github.com/facebookresearch/detr
Carion, Nicolas, et al. "End-to-end object detection with transformers." European conference on computer vision. Springer, Cham, 2020.
Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 34
Transformers in Computer Vision
https://github.com/facebookresearch/detr
Carion, Nicolas, et al. "End-to-end object detection with transformers." European conference on computer vision. Springer, Cham, 2020.
Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 35
Transformers in Computer Vision
DEtection TRansformer (DETR): Results on COCO 2017 dataset (AP = Average Precision)
https://github.com/facebookresearch/detr
Carion, Nicolas, et al. "End-to-end object detection with transformers." European conference on computer vision. Springer, Cham, 2020.
Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 36
Recursive Neural Networks
Eg: A syntactic tree structure representing a sentence. Eg: A tree representation of different segments in an image
Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 37
Recursive Neural Networks: Introduction
Source: Socher et al. Parsing Natural Scenes and Natural Language with Recursive Neural Networks, ICML’11
Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 38
Recursive Neural Networks: Introduction
• The meaning of a sentence is determined by meaning of its words and the rules that combine them.
Source: Socher et al. Parsing Natural Scenes and Natural Language with Recursive Neural Networks, ICML’11
Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 39
Recursive Neural Networks
Source: Socher et al. Parsing Natural Scenes and Natural Language with Recursive Neural Networks, ICML’11
Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 40
Recursive Neural Networks vs Recurrent Neural Networks
Source: Socher et al. Parsing Natural Scenes and Natural Language with Recursive Neural Networks, ICML’11
Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 41
Recursive Neural Networks
Source: Socher et al. Parsing Natural Scenes and Natural Language with Recursive Neural Networks, ICML’11
Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 42
Recursive Neural Networks
Source: Socher et al. Parsing Natural Scenes and Natural Language with Recursive Neural Networks, ICML’11
Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 43
Recursive Neural Networks
• Approximate the
best tree by locally
maximizing each
subtree
Source: Socher et al. Parsing Natural Scenes and Natural Language with Recursive Neural Networks, ICML’11
Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 44
Autoencoders
Credits: Most of the slides are adapted from CS7015 Deep Learning, Dept. of CSE, IIT Madras
Unsupervised Learning
▪ Compress data while maintaining the structure and complexity of the original dataset
(dimensionality reduction).
▪ Leverage the availability of unlabeled data (which can be used for unsupervised pre-training).
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 2
Autoencoders: Introduction
▪ Autoencoders are special type feed forward neural networks which are trained to reproduce their
input at the output layer.
Output Layer:
Reconstructed i/p
Input Layer
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 3
Autoencoders: Introduction
▪ It consists of an ENCODER that encodes its input X into a hidden representation h, and a
DECODER which decodes the input again from this hidden representation.
Output Layer:
Reconstructed i/p
Decoder
Encoder
Input Layer
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 4
Autoencoders: Introduction
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 5
Autoencoders: Introduction
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 6
Autoencoders: Introduction
▪ The autoencoder model is trained to minimize a certain loss function which will ensure that
is close to
• As the hidden layer could produce a reduced representation of the input, autoencoders (as the one
shown above) can be used for dimensionality reduction.
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 7
Autoencoders: Introduction
Input Layer
• Also, the latent representation can be used as a new feature representation of the input X.
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 8
Autoencoders: Applications
▪ Feature Learning and Dimensionality reduction
▪ Generate Images
▪ Anomaly Detection
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 9
Choice of Loss and Activation functions
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 10
Choice of Loss and Activation functions
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 11
Autoencoders and PCA
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 12
Undercomplete and Overcomplete Autoencoders
Undercomplete AE Overcomplete AE
- Dimension of h is less than dimension x - Dimension of h is greater than dimension x
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 13
Regularization in Autoencoders
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 14
Regularization in Autoencoders
• Another trick is to tie the weights of encoder and decoder (W* = WT). This effectively reduces the
number of parameters and acts as a regularizer.
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 15
Regularized Autoencoders
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 16
Denoising Autoencoders
• A denoising encoder simply corrupts the input data using a probabilistic process
before feeding it to the network.
Corrupted inputs
• In other words, with probability q the input is flipped to 0 and with probability (1 - q) it is retained
as it is.
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 17
Denoising Autoencoders
• Instead, the model will now have to capture the characteristics of the data correctly.
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 18
Sparse Autoencoders
• A hidden neuron with sigmoid activation will have values between 0 and 1.
• We say that the neuron is activated when its output is close to 1 and not activated when its
output is close to 0.
• A sparse autoencoder tries to ensure the neuron is inactive most of the times.
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 19
Sparse Autoencoders
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 20
Sparse Autoencoders
• Now, the equation for the loss function will look like:
• When will this term (Ω (ϴ)) reach its minimum value and
what is the minimum value?
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 21
Sparse Autoencoders
• Now, the equation for the loss function will look like:
• When will this term (Ω (ϴ)) reach its minimum value and
what is the minimum value?
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 22
Contractive Autoencoders
Frobenius Norm
Variation in output of
2nd neuron in the
hidden layer with a
small variation in the
1st input.
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 23
Contractive Autoencoders
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 24
Generative Models: Introduction
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 25
Generative Models: Introduction
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 26
Generative Models: Introduction
• In principle, yes! But in practice there is a problem with this
approach.
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 27
Generative Models: Introduction
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 28
Generative Models: Introduction
• Given many such reviews written by the • Each of the 5 words in his review can be treated as a
reviewer we could learn the joint random variable which takes one of the 50 values
probability distribution
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 29
Generative Models: Introduction
M5: Waste of time and money • Let us consider one such factor:
M6: Best Lame Historical Movie Ever P(Xi=time | Xi-2=waste, Xi-1=of)
• This could be estimated as:
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 30
Generative Models: Introduction
Joint distribution
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 31
Generative Models: Introduction
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 32
Generative Models: Introduction
M7: More realistic than real life • What can we do with this joint distribution?
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 33
Generative Models: Introduction
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 34
Generative Models: Introduction
• How to do this?
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 35
Generative Models: Introduction
• How does the reviewer start his reviews (what is the first
word that he chooses)?
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 36
Generative Models: Introduction
• How does the reviewer start his reviews (what is the first
word that he chooses)?
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 37
Generative Models: Introduction
• How does the reviewer start his reviews (what is the first
word that he chooses)?
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 38
Generative Models: Introduction
• How does the reviewer start his reviews (what is the first
word that he chooses)?
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 39
Generative Models: Introduction
• How does the reviewer start his reviews (what is the first
word that he chooses)?
• And so on …
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 40
Generative Models: Introduction
• How does the reviewer start his reviews (what is the first
word that he chooses)?
• And so on …
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 41
Generative Models: Introduction
• How does the reviewer start his reviews (what is the first
word that he chooses)?
• And so on …
We should instead sample from this • But, if we select the most likely word at each time step,
then it will give us the same review again and again
distribution!
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 42
Generative Models: Introduction
• Suppose there are 10 words in our vocabulary, and we have
computed the joint probability distribution over all the
random variables.
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 43
Generative Models: Introduction
• Suppose there are 10 words in our vocabulary, and we have
computed the joint probability distribution over all the
random variables.
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 44
Generative Models: Introduction
• Suppose there are 10 words in our vocabulary, and we have
computed the joint probability distribution over all the
random variables.
• For 32x32 images we want to learn: P(V1, V2, …… V1024) where Vi is a random variable corresponding to
each pixel, which could possibly have values from 0-255.
• We could factorize this joint distribution by assuming that each pixel is dependent on its neighboring pixels.
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 46
Generative Models: Introduction
• Apart from classifying and generating (as discussed for previous language modelling in prev. slides), we
can also correct noisy inputs (here, images) or help in completing incomplete images.
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 47
Generative Models: The concept of latent variables
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 48
Generative Models: Introduction
• Based on this notion, we would now like learn the joint distribution
P(V,H) where, V = {V1,V2..,V#pixels} is observed variable and H =
{H1,H2..,H#latent features} is hidden variables.
• That is, given an image, we can find the most likely latent
configuration (H = h) that generated this image, where h captures a
latent (abstract) representation (imp. properties of the image)
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 49
Generative Models: Introduction
• Under this abstraction, all these images would look very similar (i.e.,
they would have very similar latent configurations h)
• Once again, assume that we are able to learn the joint distribution
P(V,H)
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 50
Variational Autoencoders (VAE)
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 51
Variational Autoencoders (VAE)
• Encoder Goal: Learn a distribution over the latent variables (Q(z | X))
• Decoder Goal: Learn a distribution over the visible variables (P(X | z))
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 52
Variational Autoencoders: Encoder
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 53
Variational Autoencoders: Decoder
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 54
Variational Autoencoders
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 55
Variational Autoencoders
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 56
Markov Models
https://towardsdatascience.com/markov-models-and-markov-chains-explained-in-real-life-probabilistic-workout-routine-65e47b5c9a73
Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 2
Application of Markov Models
▪ Parking lots have a fixed number of spots available, but how many of these are available at any given
point in time can be described as a combination of multiple factors or variables:
▪ Day of the week,
▪ Time of the day,
▪ Parking fee,
▪ Proximity to transit,
▪ Proximity to businesses,
▪ Number of free parking spots in the vicinity,
▪ Number of available spots in the parking lot itself (a full-parking lot may deter some people to park
there)
▪ Some of these factors may be independent of each others are not
▪ For instance, Parking Fee typically depends on Time of day and Day of week.
https://towardsdatascience.com/markov-models-and-markov-chains-explained-in-real-life-probabilistic-workout-routine-65e47b5c9a73
Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 3
Applications of Markov Models
▪ Since Markov models describe the behavior over time, can be used to answer questions about the
future state of the system:
▪ How it evolves over time
▪ In what state is the system going to be in N time steps?
▪ Tracing possible sequences in the process:
▪ When the system goes from State A to State B in N steps, what is the likelihood that it follows a
specific path p?
▪ Parking Lot Example
▪ What is the occupancy rate of the parking lot 3 hours from now?
▪ How likely is the parking lot to be at 50% capacity and then at 25% capacity in 5 hours?
▪ Markov assumption
▪ It assumes the transition probability between each state only depends on the current state you are
in
▪ A Markov chain has short-term memory, it only remembers where you are now and where you
want to go next.
Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 4
Markov Chains
▪ A Markov chain is a stochastic model describing a sequence of possible events in which the probability
of each event depends only on the state attained in the previous event.
▪ Ques 1: calculate the percentage of instances its sunny on days directly following rainy days.
▪ Ques 2: calculate the percentage of instances its rainy on days directly following sunny days.
Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 5
Markov Chains
▪ The previous 3 days are [rainy, sunny, rainy]. What’s the probability of rainy weather tomorrow?
▪ The previous 2 days are [rainy, rainy], What’s the probability of rainy weather tomorrow?
▪ The previous 3 days are [sunny, rainy, sunny]. What’s the probability of rainy weather tomorrow?
▪ Best starting point is a simple model which performs better than a random guess
Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 6
Example : Modeling my Workout
Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 7
Example: Modeling my workout – Transitional Matrix
Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 8
Ex: WORKOUT – Markov Chain
Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 9
Ex: WORKOUT – Markov Chain
Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 10
Ex: WORKOUT – Markov Chain
Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 11
Ex: WORKOUT – Markov Chain
Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 12
Summary of Markov Model
▪ Stochastic Model
▪ a discrete-time process indexed at time 1,2,3,…that takes values called states which are observed
▪ Example states (S) ={hot , cold }
▪ State series over time => z∈ S_T
▪ Weather for 4 days can be a sequence => {z1=hot, z2 =cold, z3 =cold, z4 =hot}
▪ Markov model is engineered to handle data which can be represented as ‘sequence’ of observations
over time.
▪ Markov Assumptions
1. Limited Horizon assumption: Probability of being in a state at a time t depend only on the state at the
time (t-1).
▪ means state at time t represents enough summary of the past reasonably to predict the future.
Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 13
Summary of Markov Model
▪ Stationary Process Assumption: Conditional (probability) distribution over the next state, given the
current state, doesn't change over time.
https://towardsdatascience.com/markov-and-hidden-markov-model-3eec42298d75
Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 15
Hidden Markov Model
▪ HMM
▪ is a probabilistic model to infer unobserved
information from observed data
▪ Cannot observe the state themselves but only
the result of some probability
function(observation) of the states.
▪ HMM is a statistical Markov model in
which the system being modeled is
assumed to be a Markov process with
unobserved (hidden) states.
▪ Markov Model: Series of (hidden) states
z={z_1,z_2………….} drawn from state alphabet
S ={s_1,s_2,…….𝑠_|𝑆|} where z_i belongs to S.
▪ Hidden Markov Model: Series of observed
output x = {x_1,x_2,………} drawn from an
output alphabet V= {𝑣1, 𝑣2, . . , 𝑣_|𝑣|} where Set of states (S) = {Happy, Grumpy}
x_i belongs to V Set of hidden states (Q) = {Sunny , Rainy}
State series over time = z∈ S_T
Observed States for 4 days = {z1=Happy, z2= Grumpy, z3=Grumpy, z4=Happy}
Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 16
Assumptions of HMM
Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 17
Three important questions in HMM are
https://towardsdatascience.com/markov-and-hidden-markov-model-3eec42298d75
Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 18
HMM – Forward Procedure
▪ S = {hot,cold}
▪ v = {v1=1 ice cream ,v2=2 ice
cream,v3=3 ice cream}
▪ where V is the Number of ice creams
consumed on a day.
▪ Example Sequence =
{x1=v2,x2=v3,x3=v1,x4=v2}
https://towardsdatascience.com/markov-and-hidden-markov-model-3eec42298d75
Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 19
HMM – Forward Procedure
Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 20
HMM- Forward Procedure
Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 21
HMM-Forward Procedure
Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 22
HMM-Forward Procedure
Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 23
Hidden Markov Model Algorithm
Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 25
Baum-Welch Algorithm
▪ Also known as the forward-backward algorithm
▪ is a dynamic programming approach and a special case of the expectation-maximization algorithm
▪ purpose is to tune the parameters of the HMM such that the model is maximally like the observed data.
▪ the state transition matrix A
▪ the emission matrix B
▪ the initial state distribution π₀, such that the model is maximally like the observed data.
▪ Includes the
1. Initial phase
2. Forward phase
3. Backward phase
4. Update phase. λ
▪ The forward and the backward phase form the E-step of the EM algorithm
▪ computes expected hidden states given the observed data and the set of parameter matrices
before-tuned
▪ the update phase itself is the M-step
▪ update formulas then tune the parameter matrices to best fit the observed data and the expected
hidden states
Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 26
Baum-Welch Algorithm
1. Initial phase
▪ parameter matrices A, B, π₀ are initialized
▪ Can be done randomly if there is no prior knowledge about them.
2. Forward phase
▪ α is the joint probability of the observed data up to time k and the state at time k
https://medium.com/analytics-vidhya/baum-welch-algorithm-for-training-a-hidden-markov-model-part-2-of-the-hmm-series-d0e393b4fb86
Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 27
Baum-Welch Algorithm
1. Backward Phase
▪ β function is defined as the conditional probability of the observed data from time k+1 given the state
at time k
▪ The second term of the R.H.S. is the state transition probability from A, while the last term is the
emission probability from B.
▪ The R.H.S. is summed over all possible states at time k +1.
Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 28
Baum-Welch Algorithm
4. Update phase
Probability distribution of a state at time k given all observed data we have
Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 29
Baum-Welch Algorithm
4. Update phase
Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 30
HMM – Viterbi Algorithm
https://medium.com/analytics-vidhya/baum-welch-algorithm-for-training-a-hidden-markov-model-part-2-of-the-hmm-series-d0e393b4fb86
Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 31
HMM- Maximum Likelihood
Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 32
HMM – Viterbi Algorithm
Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 33
Automatic Speech Recognition
• ACOUSTIC MODEL
Uses audio recordings of speech
& compiles to statistical
representations of the sounds for
words.
• Language model
• which gives the probabilities
of sequences of words.
• Lexicon
• set of words with their
pronunciations broken down
into phonemes
https://jonathan-hui.medium.com/speech-recognition-gmm-hmm-8bb5eff8b196
Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 34
Building blocks of ASR
https://jonathan-hui.medium.com/speech-recognition-gmm-hmm-8bb5eff8b196
Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 35
Realization of ASR
Rohini R Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 36
Optimizers
&
Practical Methodology
DSE 3151, B.Tech Data Science & Engineering
Rohini R Rao & Abhilash Pai
Department of Data Science and Computer Applications
MIT Manipal
Backpropagation in ANN : Recap
Gradient Descent Algorithm
Backpropagation in ANN : Recap
Gradient descent: Error surface
Backpropagation in ANN : Recap
Gradient descent: Error surface
Backpropagation in ANN : Recap
Gradient descent and its variants
Vanilla (Batch) Gradient Descent
Accumulated history of
weight updates
Update = An Exponential weighted average of gradients (more weightage to recent updates and less weightage to old updates)
Backpropagation in ANN : Recap
Gradient descent and its variants
Momentum-based Gradient Descent
Backpropagation in ANN : Recap
Gradient descent and its variants
Momentum-based Gradient Descent
Stochastic GD & SGD with Momentum DL
• stochastic means randomness on which the
algorithm is based upon.
• Instead of taking the whole dataset for each
iteration, randomly select the batches of data
• The path taken is full of noise as compared to
the gradient descent algorithm.
• Uses a higher number of iterations to reach
the local minima, thereby the overall
computation time increases.
• The computation cost is still less than that of
the gradient descent optimizer.
• If the data is enormous and computational
time is an essential factor, SGD should be
preferred over batch gradient descent
algorithm.
• Stochastic Gradient Descent with
Momentum Deep Learning Optimizer
• momentum helps in faster
convergence of the loss function.
Aurelien Geron, “Hands-On Machine Learning with Scikit-Learn , Keras & Tensorflow, OReilly Publications
Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 19
SGD with Nesterov Momentum Optimization
• Yurii Nesterov in 1983
• to measure the gradient of the cost
function not at the local position but
slightly ahead in the direction of the
momentum
• the momentum vector will be
pointing in the right direction (i.e.,
toward the optimum)
• it will be slightly more accurate to
use the gradient measured a bit
farther in that direction rather than
using the gradient at the original
position
Aurelien Geron, “Hands-On Machine Learning with Scikit-Learn , Keras & Tensorflow, OReilly Publications
Rohini R Rao & Abhilash Pai, Dept of Data Science and CA 20
Adagrad (Adaptive Gradient Descent) Deep
Learning Optimizer
• Adaptive Learning Rate
• Scaling down the gradient vector along the
steepest dimension
• If the cost function is steep along the ith
dimension, then s will get larger and larger at each
iteration
• No need to modify the learning rate manually
• more reliable than gradient descent algorithms,
and it reaches convergence at a higher speed.
• Disadvantage
• it decreases the learning rate aggressively and
monotonically.
• Due to small learning rates, the model eventually
becomes unable to acquire more knowledge, and
hence the accuracy of the model is compromised.
• β= 0.9
•
•
• Linear Search
Backpropagation in ANN : Recap
GD with adaptive learning rate
• Motivation: Can we have a different learning rate for each parameter which takes care of
the frequency of features ?
• Intuition: Decay the learning rate for parameters in proportion to their update history.
• For sparse features, accumulated update history is small
• For dense features, accumulated update history is large
Make learning rate inversely proportional to the update history i.ie, if the feature has
been updated fewer times, give it a larger learning rate and vice versa
Credits: Most of the slides are adapted from CS7015 Deep Learning, Dept. of CSE, IIT Madras
Generative Models : Overview
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 2
Taxonomy of Generative Models
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 3
Generative Adversarial Networks: Overview
• Basic ideas of GANs is similar to Variational autoencoders (VAEs) where we sample from a simple tractable distribution and then
learn a complex transformation on it so that the o/p looks as if it came from the training distribution.
• However, GANs start with a d-dimensional vector and generate a n-dimensional vector (usually, d<n) as compared to VAEs which
start from n-dimensional i/p and produce o/p of same dimension.
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 4
Generative Adversarial Networks: Overview
2. Discriminator: A neural network that distinguishes between output data point (Fake) from the
Generator and training data samples (Real)
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 5
Generative Adversarial Networks: Overview
2. Discriminator: A neural network that distinguishes between output data point (Fake) from the
Generator and training data samples (Real)
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 6
Generative Adversarial Networks: Overview
2. Discriminator: A neural network that distinguishes between output data point (Fake) from the
Generator and training data samples (Real)
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 7
Generative Adversarial Networks: Overview
The generator mostly starts by generating a noisy image (as it takes random latent vectors as inputs).
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 8
Generative Adversarial Networks: Overview
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 9
Generative Adversarial Networks: Overview
Equilibrium is reached when the generator finally succeeds to fool the discriminator.
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 10
Objective function of Generator
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 12
Objective function of Generator
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 13
Objective function of Discriminator
• And it should do this for all possible real images and all
possible fake images
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 14
Generative Adversarial Networks: MiniMax formulation
Objective function:
θD ,ФG
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 15
Training the Discriminator Network
• Before training (when discriminator is not performing optimally- cannot clearly distinguish real and fake data)
Real Fake
(1) (0)
Decision
boundary
Generative Adversarial Networks (GAN) - YouTube
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 16
Training the Discriminator Network
• After training (when discriminator is performing optimally – clearly distinguishes real and fake data)
Decision
boundary
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 17
Training the Generator Network
• Before training
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 18
Training the Generator Network
• After training
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 19
Training the Generator Network
Data Distribution
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 20
Training the GANs
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 21
Deep Convolutional GANs
Generator
Radford et al. Unsupervised Representational Learning with Deep Convolutional GANs, ICLR 2016
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 22
Deep Convolutional GANs
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 23
GANs: Applications
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 24
GANs: Applications
Philip et al. Image-to-Image Translation with Conditional Adversarial Networks, CVPR 2017
Rohini Rao & Abhilash K Pai, Dept. of DSCA DSE 3151 Deep Learning 25