0% found this document useful (0 votes)
8 views38 pages

Unit-3

The document discusses convolutional neural networks (CNNs), detailing their architecture which includes convolution layers, pooling layers, and fully connected layers. It highlights notable CNN architectures such as AlexNet, VGG, and Inception, explaining their design principles and innovations. Additionally, it covers techniques like dropout, data augmentation, and residual networks to enhance model performance and mitigate overfitting.

Uploaded by

Kp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views38 pages

Unit-3

The document discusses convolutional neural networks (CNNs), detailing their architecture which includes convolution layers, pooling layers, and fully connected layers. It highlights notable CNN architectures such as AlexNet, VGG, and Inception, explaining their design principles and innovations. Additionally, it covers techniques like dropout, data augmentation, and residual networks to enhance model performance and mitigate overfitting.

Uploaded by

Kp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Unit-3

ConvNet Architectures

ConvNet: In deep learning, a convolutional neural network (CNN) is a class of deep neural
networks, most commonly applied to analyzing visual imagery.
ConvNet architectures are basically made of 3 elements:-
1. Convolution Layers
2. Pooling Layers
3. Fully Connected Layers

Convolution- The term convolution refers to the mathematical combination of two functions
to produce a third function.
Unit-3

Pooling- The objective of Pooling is to down-sample an input representation (image, hidden-


layer output matrix, etc.), reducing its dimensions and allowing for assumptions to be made
about features contained in the sub-regions created.

Fully Connected Layers-

 FCL in a neural network are those layers where all the inputs from one layer are
connected to every activation unit of the next layer.
 ConvNet architectures follow a general rule of successively applying Convolutional
Layers to the input, periodically down-sampling the spatial dimensions while
decreasing the number of feature maps using the Pooling Layers.
Feature Maps- The feature map is the output of one filter applied to the previous layer. I.e at
each layer, the feature map is the output of that layer.
The architectures to be discussed are used as general design guidelines for modern
programmers to adapt and used to implement feature extraction and exploring which are
further used for image classification, object detection, image captioning, image segmentation
and much more.
Some common architectures:-
1. AlexNet
2. VGG
3. Inception
4. ResNet
AlexNet:
AlexNet, developed by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, is a landmark
model that won the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) in 2012. It
introduced several innovative ideas that shaped the future of CNNs.
Unit-3

AlexNet consists of 8 layers, including 5 convolutional layers and 3 fully connected layers. It
uses traditional stacked convolutional layers with max-pooling in between. Its deep network
structure allows for the extraction of complex features from images.

 The architecture employs overlapping pooling layers to reduce spatial dimensions


while retaining the spatial relationships among neighbouring features.
 Activation function: AlexNet uses the ReLU activation function and dropout
regularization, which enhance the model’s ability to capture non-linear relationships
within the data.
The key features of AlexNet are as follows:-

 AlexNet was created to be more computationally efficient than earlier CNN


topologies. It introduced parallel computing by utilising two GPUs during training.
 AlexNet is a relatively shallow network compared to GoogleNet. It has eight layers,
which makes it simpler to train and less prone to overfitting on smaller datasets.
 In 2012, AlexNet produced ground-breaking results in the ImageNet Large Scale Visual
Recognition Challenge (ILSVRC). It outperformed prior CNN architectures greatly and
set the path for the rebirth of deep learning in computer vision.
 In 2012, AlexNet produced ground-breaking results in the ImageNet Large Scale Visual
Recognition Challenge (ILSVRC). It outperformed prior CNN architectures greatly and
set the path for the rebirth of deep learning in computer vision.
 ‘ReLU’ is used as an activation function rather than ‘tanh’.

The output size of a convolutional operation can be calculated using the following
formula:

Where: O is the output size


W is the input size (width or height)
K is the kernel size
P is the padding
S is the stride

This formula helps to determine the dimensions of the output feature map, which is
essential for designing and understanding the architecture of a CNN.
Unit-3

AlexNet Summary
Unit-3

Code: AlexNet
#Import Utilities

import tensorflow as tf
from tensorflow.keras import layers, utils

#AlexNet Model
alexnet = tf.keras.models.Sequential([
layers.Conv2D(filters=96, kernel_size=(11,11), strides=4, activation='relu',
input_shape=(227,227,3)),
layers.MaxPooling2D(pool_size=(3,3), strides=2, padding='same'),
layers.Conv2D(filters=256, kernel_size=(5,5), padding='same', activation='relu'),
layers.MaxPooling2D(pool_size=(3,3), strides=2),
layers.Conv2D(filters=384, kernel_size=(3,3), padding='same', activation='relu'),
layers.Conv2D(filters=384, kernel_size=(3,3), padding='same', activation='relu'),
layers.Conv2D(filters=256, kernel_size=(3,3), padding='same', activation='relu'),
layers.MaxPooling2D(pool_size=(3,3), strides=2),
layers.Flatten(),

# Fully connected layers


layers.Dense(units=4096, activation='relu'),
layers.Dropout(0.5),
layers.Dense(units=4096, activation='relu'),
layers.Dropout(0.5),

# Units in last layer are 1000 per imagenet dataset


layers.Dense(units=1000, activation='softmax')
])

#Summary of AlexNet
alexnet.summary()

ReLU Non-Linearity

 AlexNet demonstrates that saturating activation functions like Tanh or Sigmoid can be
used to train deep CNNs much more quickly.
 The image below demonstrates that AlexNet can achieve a training error rate of 25%
with the aid of ReLUs (solid curve). Compared to a network using tanh, this is six times
faster (dotted curve). On the CIFAR-10 dataset, this was evaluated.
Unit-3

Data Augmentation
Overfitting can be avoided by showing Neural Net various iterations of the same image.
Additionally, it assists in producing more data and compels the Neural Net to memorise the
main qualities.
Augmentation by Mirroring: Consider that our training set contains a picture of a cat. A cat
can also be seen as its mirror image. This indicates that by just flipping the image above the
vertical axis, we may double the size of the training datasets.

Augmentation by Random Cropping of Images

 Randomly cropping the original image will also produce additional data that is simply
the original data shifted.
 For the network’s inputs, the creators of AlexNet selected random crops with
dimensions of 227 by 227 from within the 256 by 256 image boundary. They multiplied
the size of the data by 2048 using this technique.
Unit-3

Dropout

 A neuron is removed from the neural network during dropout with a probability of
0.5. A neuron that is dropped does not make any contribution to either forward or
backward propagation.
 As seen in the graphic below, each input is processed by a separate Neural Network
design. The acquired weight parameters are therefore more reliable and less prone to
overfitting.
Unit-3

Residual Networks

 Recent years have seen tremendous progress in the field of Image Processing and
Recognition. Deep Neural Networks are becoming deeper and more complex.
 It has been proved that adding more layers to a Neural Network can make it more
robust for image-related tasks.
 But it can also cause them to lose accuracy. That’s where Residual Networks come
into place.
 The tendency to add so many layers by deep learning practitioners is to extract
important features from complex images.
 So, the first layers may detect edges, and the subsequent layers at the end may detect
recognizable shapes, like tires of a car.
 But if we add more than 30 layers to the network, then its performance suffers and it
attains a low accuracy.
 This is contrary to the thinking that the addition of layers will make a neural network
better. This is not due to overfitting, because in that case, one may use dropout and
regularization techniques to solve the issue altogether.
 It’s mainly present because of the popular vanishing gradient problem.

Residual Block

 The core idea of ResNet is the residual block, which consists of two or more
convolutional layers, a batch normalization layer, and skip connections.
 The skip connections add the original input to the output of the residual block,
allowing the network to learn residual functions instead of trying to approximate the
desired output directly.

y = F(x) + x
Unit-3

 The output of the previous layer is added to the output of the layer after it in the
residual block.
 The hop or skip could be 1, 2 or even 3. When adding, the dimensions of x may be
different than F(x) due to the convolution process, resulting in a reduction of its
dimensions.
 Thus, we add an additional 1 x 1 convolution layer to change the dimensions of x.

 A residual block has a 3 x 3 convolution layer followed by a batch normalization layer


and a ReLU activation function.
 This is again continued by a 3 x 3 convolution layer and a batch normalization layer.
Unit-3

 The skip connection basically skips both these layers and adds directly before the
ReLU activation function. Such residual blocks are repeated to form a residual
network.

The residual block used in ResNet-50 is called the Bottleneck Residual Block. This block it has
the following architecture:

Here's a breakdown of the components within the residual block:


ReLU Activation: The ReLU (Rectified Linear Unit) activation function is applied after each
convolutional layer and the batch normalization layers. ReLU allows only positive values to
pass through, introducing non-linearity into the network, which is essential for the network
to learn complex patterns in the data.
Bottleneck Convolution Layers: the block consists of three convolutional layers with batch
normalization and ReLU activation after each.

 The first convolutional layer likely uses a filter size of 1x1 and reduces the number of
channels in the input data. This dimensionality reduction helps to compress the data
and improve computational efficiency without sacrificing too much information.
Unit-3

 The second convolutional layer might use a filter size of 3x3 to extract spatial features
from the data.
 The third convolutional layer again uses a filter size of 1x1 to restore the original
number of channels before the output is added to the shortcut connection.

Training a Convnet

Activation Functions
Sigmoid:

Tanh

ReLU
Unit-3

Weights initialization
Q: what happens when W=constant init is used?

 First idea: Small random numbers (gaussian with zero mean and 1e-2 standard
deviation)

Works okay for small networks, but problems with deeper networks.

Forward pass for a 6-layer net


with hidden size 4096
Unit-3

Increase std of initial weights


from 0.01 to 0.05

Weight Initialization: “Xavier” Initialization

“Xavier” initialization: std


= 1/sqrt(Din)

“Just right”: Activations are nicely scaled for all layers!


Unit-3

 For conv layers, Din is filter_size2 * input_channels


Let: y = x w +x w +...+x w
 1 1 2 2 Din Din

For ReLU

Change from tanh to ReLU


Unit-3

VGG Architecture

 VGG stands for Visual Geometry Group; it is a standard deep Convolutional


Neural Network (CNN) architecture with multiple layers.
 The “deep” refers to the number of layers with VGG-16 or VGG-19 consisting
of 16 and 19 convolutional layers.
 The VGG architecture is the basis of ground-breaking object recognition
models. Developed as a deep neural network, the VGGNet also surpasses
baselines on many tasks and datasets beyond ImageNet. Moreover, it is now
still one of the most popular image recognition architectures.

VGG Convolutional Network Architecture

 VGGNets are based on the most essential features of convolutional neural networks
(CNN). The following graphic shows the basic concept of how a CNN works:
Unit-3

The VGG network is constructed with very small convolutional filters. The VGG-16 consists
of 13 convolutional layers and three fully connected layers.
Let’s take a brief look at the architecture of VGG:

 Input: The VGGNet takes in an image input size of 224×224. For the ImageNet
competition, the creators of the model cropped out the center 224×224 patch in each
image to keep the input size of the image consistent.
 Convolutional Layers: VGG’s convolutional layers leverage a minimal receptive field,
i.e., 3×3, the smallest possible size that still captures up/down and left/right.
Moreover, there are also 1×1 convolution filters acting as a linear transformation of
the input. This is followed by a ReLU unit, which is a huge innovation from AlexNet
that reduces training time. ReLU stands for rectified linear unit activation function; it
is a piecewise linear function that will output the input if positive; otherwise, the
output is zero. The convolution stride is fixed at 1 pixel to keep the spatial resolution
preserved after convolution (stride is the number of pixel shifts over the input matrix).
 Hidden Layers: All the hidden layers in the VGG network use ReLU. VGG does not
usually leverage Local Response Normalization (LRN) as it increases memory
consumption and training time. Moreover, it makes no improvements to overall
accuracy.
 Fully-Connected Layers: The VGGNet has three fully connected layers. Out of the
three layers, the first two have 4096 channels each, and the third has 1000 channels,
1 for each class.

Fully Connected layer


Unit-3

VGG16 Architecture

 The number 16 in the name VGG refers to the fact that it is a 16-layer deep neural
network (VGGnet). This means that VGG16 is a pretty extensive network and has a
total of around 138 million parameters.
 Even according to modern standards, it is a huge network. However, VGGNet16
architecture’s simplicity is what makes the network more appealing. Just by looking at
its architecture, it can be said that it is quite uniform.
 There are a few convolution layers followed by a pooling layer that reduces the height
and the width. If we look at the number of filters that we can use, around 64 filters
are available that we can double to about 128 and then to 256 filters. In the last layers,
we can use 512 filters.
Unit-3

Complexity and Challenges of VGG

 The number of filters that can use doubles on every step or through every stack of the
convolution layer. This is a major principle used to design the architecture of the
VGG16 network. One of the crucial downsides of the VGG16 network is that it is a
huge network, which means that it takes more time to train its parameters.
 Because of its depth and number of fully connected layers, the VGG16 model is more
than 533MB. This makes implementing a VGG network a time-consuming task.

Inception-Net
 Inception-Net is a convolutional neural network (CNN) architecture that Google
developed to improve upon the performance of previous CNNs
 It uses "inception modules" that apply a combination of 1x1, 3x3, and 5x5
convolutions on the input data and utilizes auxiliary classifiers to improve
performance.
 InceptionNet is a convolutional neural network architecture developed by Google in
2014. It is known for using inception modules, blocks of layers that learn a
combination of local and global features from the input data.
 InceptionNet was designed to be more efficient and faster to train than other deep
convolutional neural networks.
 It has been used in image classification, object detection, and face recognition and has
been the basis for popular neural network architectures such as Inception-v4 and
Inception-ResNet.

What is InceptionNet?

 InceptionNet is a convolutional neural network (CNN) designed for image classification


tasks and developed for the ImageNet Large Scale Visual Recognition Challenge.
 InceptionNet is known for using inception modules, blocks of layers designed to learn
a combination of local and global features from the input data.
 These modules are composed of smaller convolutional and pooling layers, which are
combined to allow the network to learn spatial and temporal features from the input
data.
 InceptionNet was designed to train more efficiently and faster than other deep CNNs.
It has been widely used in various applications, including image classification, object
detection, and face recognition.
Unit-3

Inception Blocks

 Conventional convolutional neural networks typically use convolutional and pooling


layers to extract features from the input data. However, these networks are limited in
capturing local and global features, as they typically focus on either one or the other.
 The inception blocks in the InceptionNet architecture are intended to solve the
problem of learning a combination of local and global features from the input data.
 Inception blocks address this problem using a modular design that allows the network
to learn a variety of feature maps at different scales.
 These feature maps are then concatenated together to form a more comprehensive
representation of the input data. This allows the network to capture a wide range of
features, including both low-level and high-level features, which can be useful for
tasks such as image classification.
 By using inception blocks, the InceptionNet architecture can learn a more
comprehensive set of features from the input data, which can improve the network's
performance on tasks such as image classification.
Unit-3

Inception Modules

 Inception modules are a key feature of the InceptionNet convolutional neural


network architecture. They are blocks of layers designed to learn a combination of
local and global features from the input data.
 Inception modules comprise a series of smaller convolutional and pooling layers,
which are combined to allow the network to learn spatial and temporal features
from the input data.
 The idea behind the inception module is to learn a variety of feature maps at
different scales and then concatenate them together to form a more
comprehensive representation of the input data.
 This allows the network to capture a wide range of low-level and high-level
features, which can be useful for tasks such as image classification.
 Inception modules can be added to the network at various points, depending on
the desired complexity level and the input data size. We can also modify them by
changing the number and size of the convolutional and pooling layers and the type
of nonlinear activation function used.
Unit-3

Working flow of Inception Module

 An Inception Module is a building block used in the Inception network


architecture for CNNs.
 It improves performance by allowing multiple parallel convolutional filters to
be applied to the input data.
 The basic structure of an Inception Module is a combination of multiple
convolutional filters of different sizes applied in parallel to the input data.
 The filters may have different kernel sizes (e.g. 3x3, 5x5) and/or different
strides (e.g. 1x1, 2x2).
 Output of each filter is concatenated together to form a single output feature
map.
 Inception Module also includes a max pooling layer, which takes the maximum
value from a set of non-overlapping regions of the input data.
 This reduces the spatial dimensionality of the data and allows for translation
invariance.
 The use of multiple parallel filters and max pooling layers allows the Inception
Module to extract features at different scales and resolutions, improving the
network's ability to recognize patterns in the input data.
 In summary, the Inception module improves feature extraction, improving the
network's performance.
Unit-3

Different Inception Versions


Inception v1

 Inception v1, also known as GoogLeNet, was the first version of the Inception network
architecture.
 It was introduced in 2014 by Google and designed to improve the performance of
CNNs on the ImageNet dataset.
 It uses a modular architecture, where the network comprises multiple Inception
Modules stacked together.
 Each module contains multiple parallel convolutional filters of different sizes, which
are applied to the input data and concatenated to form a single output feature map.
 Inception v1 includes a total of 9 Inception Modules, with max-pooling layers at
different scales.
 It includes a global average pooling layer and a fully connected layer for classification.
 It achieved state-of-the-art performance on the ImageNet dataset at the time of its
release.
 It was a very deep and complex network. It introduced the idea of using multiple
parallel convolutional filters and showed how to reduce the computational cost using
1x1 convolution.

Inception v2

 Inception v2 is an improved version of the Inception network architecture introduced


in 2015 by Google.
 It builds upon the original Inception v1 architecture and aims to improve the
performance of CNNs further.
 Inception v2 uses a similar modular architecture, where the network comprises
multiple Inception Modules stacked together.
Unit-3

 It uses a new Inception Module, called the Inception-ResNet Module, which combines
the benefits of both Inception and Residual networks.
 These Inception-ResNet Modules allow for a deeper network with fewer parameters
and better performance.
 Inception v2 also uses a batch normalization layer after each convolutional layer,
which helps improve the network's stability and performance.
 Inception v2 achieved state-of-the-art performance on several image classification
benchmarks, and its architecture has been used as a basis for many subsequent CNNs.
 Inception v2 improved the Inception architecture by introducing Inception-ResNet
modules, which allow for deeper networks with fewer parameters, and batch
normalization layers which improved the stability and performance of the network.

Inception v3

 Inception v3 is the third version of the Inception network architecture, which was
introduced in 2015 by Google.
 It builds upon the original Inception v1 and v2 architectures and aims to improve
the performance of CNNs further.
 Inception v3 uses a similar modular architecture, where the network comprises
multiple Inception Modules stacked together.
 It uses a new type of Inception Module, called the Factorization-Based Inception
Module, which uses factorization techniques to reduce the number of parameters
in the network and improve performance.
 Inception v3 also introduces batch normalization layers after each convolutional
layer, which helps improve the network's stability and performance.
 Inception v3 achieved state-of-the-art performance on the ImageNet dataset, and
its architecture has been used as a basis for many subsequent CNNs.
 Inception v3 improved the Inception architecture by introducing Factorization-
Based Inception Modules, which allows for a deeper network with fewer
Unit-3

parameters, and batch normalization layers which improved the stability and
performance of the network.

Inception v4

 Inception v4 is the fourth version of the Inception network architecture, which was
introduced in 2016 by Google.
 It builds upon the original Inception v1, v2, and v3 architectures and aims to improve
the performance of CNNs further.
 Inception v4 uses a similar modular architecture, where the network comprises
multiple Inception Modules stacked together.
 It uses a new type of Inception Module, called the Inception-Auxiliary Module, which
provides auxiliary classifiers to improve the network's performance.
 Inception v4 also introduces the use of Stem layers, which reduce the spatial
resolution of the input data before it is passed to the Inception Modules.
 Inception v4 achieved state-of-the-art performance on several image classification
benchmarks, and its architecture has been used as a basis for many subsequent CNNs.
 Inception v4 improved the Inception architecture by introducing Inception-Auxiliary
Modules, which provide auxiliary classifiers to improve the network's performance,
and Stem layers, which reduce the spatial resolution of the input data before it is
passed to the Inception Modules.
Unit-3

Deep Learning Optimization Algorithms


 Optimization algorithms play a crucial role in training deep learning models. They
control how a neural network is incrementally changed to model the complex
relationships encoded in the training data.
 With an array of optimization algorithms available, the challenge often lies in selecting
the most suitable one for your specific project. Whether you’re working on improving
accuracy, reducing training time, or managing computational resources,
understanding the strengths and applications of each algorithm is fundamental.

What is a model-optimization algorithm?

 A deep learning model comprises multiple layers of interconnected neurons organized


into layers. Each neuron computes an activation function on the incoming data and
passes the result to the next layer. The activation functions introduce non-linearity,
allowing for complex mappings between inputs and outputs.
 The connection strength between neurons and their activations are parametrized by
weights and biases.
 These parameters are iteratively adjusted during training to minimize the discrepancy
between the model’s output and the desired output given by the training data. The
discrepancy is quantified by a loss function.
Unit-3

 Schematic visualization of the deep learning model training process. In each iteration
of the training cycle, the neural network produces predictions on a batch of training
samples. This predicted output is compared to the ground truth using a loss function.
 The gradient of the loss function with respect to the neural network’s weights
uncovers how these weights have to be updated to bring the model’s outputs closer
to the ground truth.
 This adjustment is governed by an optimization algorithm. Optimizers utilize gradients
computed by backpropagation to determine the direction and magnitude of
parameter updates, aiming to navigate the model’s high-dimensional parameter
space efficiently.
 Optimizers employ various strategies to balance exploration and exploitation, seeking
to escape local minima and converge to optimal or near-optimal solutions.

Gradient Descent

 Gradient Descent is an algorithm designed to minimize a function by iteratively


moving towards the minimum value of the function.

 The gradient descent optimization algorithm applied to a cost function. The cost
function is convex, with a unique minimum. The gradient descent algorithm starts
Unit-3

with a randomly selected initial weight. The gradient vector indicates the direction of
the steepest ascent. The optimization process is illustrated by arrows representing
incremental steps taken in the opposite direction of the gradient, moving the initial
weight toward the minimum cost point on the curve.
All deep learning model optimization algorithms widely used today are based on Gradient
Descent. Hence, having a good grasp of the technical and mathematical details is essential.
So let’s take a look:
Objective: Gradient Descent aims to find a function’s parameters (weights) that minimize the
cost function. In the case of a deep learning model, the cost function is the average of the loss
for all training samples as given by the loss function. While the loss function is a function of
the model’s output and the ground truth, the cost function is a function of the model’s
weights and biases.
Working steps:

 Initialization: Start with random values for the model’s weights.


 Gradient computation: Calculate the gradient of the cost function with respect to
each parameter. The gradient is a vector that points in the direction of the steepest
increase of the function.
 Update parameters: Adjust the model’s parameters in the direction opposite to the
gradient. This step is done by subtracting a fraction of the gradient from the current
values of the parameters. The size of this step is determined by the learning rate, a
hyperparameter that controls how fast or slow move toward the optimal weights.
 Mathematical representation: the update rule for each parameter 𝒘 can be
mathematically represented as

Where, w represents the model’s parameters (weights) and 𝛼 is the learning rate. Δ𝑤𝐽(w) is
the gradient of the cost function 𝐽(w) with respect to w.

 The learning rate is a crucial hyperparameter that needs to be chosen carefully. If it’s
too small, the algorithm will converge very slowly. If it’s too large, the algorithm might
overshoot the minimum and fail to converge.
Unit-3

 An illustration of how different learning rate configurations can affect the


convergence of the algorithm. If the learning rate is too small, convergence to the
optimal value is slow. If the learning rate is too high, the optimization overshoots the
minimum.
Stochastic Gradient Descent (SGD)
Stochastic Gradient Descent (SGD) is a variant of the traditional Gradient Descent
optimization algorithm that introduces randomness into the optimization process to improve
convergence speed and potentially escape local minima.
This approach can lead to a quicker descent but might involve more meandering. Let’s take a
closer look at the specifics of Stochastic Gradient Descent:
Objective: like Gradient Descent, the primary goal of SGD is to minimize the cost function of
a model by iteratively adjusting its parameters (weights). However, SGD aims to achieve this
goal more efficiently by using only a single training example at a time to inform the update of
the model’s parameters.
Working:
Initialization: Start with a random set of parameters for the model.
Gradient computation: Instead of calculating the gradient of the cost function over the entire
training data, SGD computes the gradient based on a single randomly selected training
example.
Update parameters: Update the model’s parameters using this computed gradient. The
parameters are adjusted in the direction opposite to the gradient, similar to basic Gradient
Descent.
Mathematical representation:
The parameter update rule in SGD is similar to that of Gradient Descent but applies to a single
example i:

Here, w represents the model’s parameters (weights), 𝛼 is the learning rate, and ∆𝘸𝘑𝘪(𝘸) is
the gradient of the cost function 𝐽i(w) for the ith training example with respect to w.

Adam (Adaptive Moment Estimation)


Adam seeks to optimize the model’s parameters to minimize the cost function, utilizing
adaptive learning rates for each parameter. It uniquely combines momentum (keeping track
of past gradients) and scaling the learning rate based on the second moments of the
gradients, making it effective for a wide range of problems.
Unit-3

Working:
Initialization: Start with random initial parameter values and initialize a first moment vector
(m) and a second moment vector (v). Each “moment vector” stores aggregated information
about the gradients of the cost function with respect to the model’s parameters:

 The first moment vector accumulates the means (or the first moments) of the
gradients, acting like a momentum by averaging past gradients to determine the
direction to update the parameters.
 The second moment vector accumulates the variances (or second moments) of the
gradients, helping adjust the size of the updates by considering the variability of past
gradients.
Both moment vectors are initialized to zero at the start of the optimization. Their size is
identical to the size of the model’s parameters (i.e., if a model has N parameters, both vectors
will be vectors of size N).
Adam also introduces a bias correction mechanism to account for these vectors being
initialized as zeros. The vectors’ initial state leads to a bias towards zero, especially in the early
stages of training, because they haven’t yet accumulated enough gradient information. To
correct this bias, Adam adjusts the calculations of the adaptive learning rate by applying a
correction factor to both moment vectors. This factor grows smaller over time and
asymptotically approaches 1, ensuring that the influence of the initial bias diminishes as
training progresses.

 Compute gradient: For each mini-batch, compute the gradients of the cost function
with respect to the parameters.
 Update moments: Update the first moment vector (m) with the bias-corrected
moving average of the gradients. Similarly, update the second moment vector (v) with
the bias-corrected moving average of the squared gradients.
 Adjust learning rate: Calculate the adaptive learning rate for each parameter using
the updated first and second moment vectors, ensuring effective parameter updates.
 Update parameters: Use the adaptive learning rates to update the model’s
parameters.
The second moment vector accumulates the variances (or second moments) of the gradients,
helping adjust the size of the updates by considering the variability of past gradients.
Mathematical representation: The parameter update rule for Adam can be expressed as

Where 𝘸 represents the parameters, α is the learning rate, and 𝘮ₜ and 𝘷ₜ are bias-corrected
estimates of first and second moments of the gradients, respectively.
Unit-3

Recurrent Neural Networks


 Recurrent Neural Networks (RNNs) work a bit different from regular neural networks.
In neural network the information flows in one direction from input to output.
 However, in RNN information is fed back into the system after each step. Think of it
like reading a sentence, when you’re trying to predict the next word you don’t just
look at the current word but also need to remember the words that came before to
make accurate guess.
 RNNs allow the network to “remember” past information by feeding the output from
one step into next step.
 This helps the network understand the context of what has already happened and
make better predictions based on that.
 For example when predicting the next word in a sentence the RNN uses the previous
words to help decide what word is most likely to come next.

basic architecture of RNN and the feedback loop mechanism where the output is passed back
as input for the next time step.
RNN Differs from Feedforward Neural Networks

 Feedforward Neural Networks (FNNs) process data in one direction from input to
output without retaining information from previous inputs. This makes them suitable
for tasks with independent inputs like image classification. However, FNNs struggle
with sequential data since they lack memory.
 Recurrent Neural Networks (RNNs) solve this by incorporating loops that allow
information from previous steps to be fed back into the network. This feedback
enables RNNs to remember prior inputs making them ideal for tasks where context is
important.
Unit-3

Key Components of RNNs


1. Recurrent Neurons: The fundamental processing unit in RNN is a Recurrent Unit.
Recurrent units hold a hidden state that maintains information about previous inputs
in a sequence. Recurrent units can “remember” information from prior steps by
feeding back their hidden state, allowing them to capture dependencies across time.

Recurrent Neuron

2. RNN Unfolding:
 RNN unfolding or unrolling is the process of expanding the recurrent structure
over time steps. During unfolding each step of the sequence is represented as
a separate layer in a series illustrating how information flows across each time
step.
 This unrolling enables backpropagation through time (BPTT) a learning process
where errors are propagated across time steps to adjust the network’s weights
enhancing the RNN’s ability to learn dependencies within sequential data.

Recurrent Neural Network Architecture

 RNNs share similarities in input and output structures with other deep learning
architectures but differ significantly in how information flows from input to output.
 Unlike traditional deep neural networks, where each dense layer has distinct weight
matrices, RNNs use shared weights across time steps, allowing them to remember
information over sequences.
 In RNNs, the hidden state 𝐻𝑖 is calculated for every input 𝑋𝑖 to retain sequential
dependencies. The computations follow these core formulas:
Unit-3

Hidden State Calculation

Here, h represents the current hidden state, U and are weight matrices, and is the
bias.

Output Calculation:

The output Y is calculated by applying O, an activation function, to the weighted


hidden state, where V and C represent weights and bias.

Overall Function:

This function defines the entire RNN operation, where the state matrix S holds each
element 𝑠𝑖 representing the network’s state at each time step i.

Working

At each time step RNNs process units with a fixed activation function. These units have
an internal hidden state that acts as memory that retains information from previous
time steps. This memory allows the network to store past knowledge and adapt based
on new inputs.

Updating the Hidden State in RNNs

The current hidden state ℎ𝑡 , depends on the previous state ℎ𝑡−1, and the current input
𝑥𝑡 , and is calculated using the following relations:
Unit-3

 State Update:

Where: ℎ𝑡 is the current hidden state, ℎ𝑡−1 is the previous state and current input 𝑥𝑡 ,

 Activation Function

Here, 𝑊ℎℎ is the weight matrix for the recurrent neuron, and 𝑊𝑥ℎ is the weight matrix for the
input neuron.

 Output Calculation

Where, 𝑦𝑡 is the output and 𝑊ℎ𝑦 is the weight at the output layer.

These parameters are updated using backpropagation. However, since RNN works on
sequential data here we use an updated backpropagation which is known as backpropagation
through time.
Types of Recurrent Neural Networks
There are four types of RNNs based on the number of inputs and outputs in the network:

 One-to-One RNN: This is the simplest type of neural network architecture where there
is a single input and a single output. It is used for straightforward classification tasks
such as binary classification where no sequential data is involved.

One to One RNN

 One-to-Many RNN: In a One-to-Many RNN the network processes a single input to


produce multiple outputs over time. This is useful in tasks where one input triggers a
sequence of predictions (outputs). For example in image captioning a single image can
be used as input to generate a sequence of words as a caption.
Unit-3

 Many-to-One RNN: The Many-to-One RNN receives a sequence of inputs and


generates a single output. This type is useful when the overall context of the input
sequence is needed to make one prediction. In sentiment analysis the model receives
a sequence of words (like a sentence) and produces a single output like positive,
negative or neutral.

 Many-to-Many RNN: The Many-to-Many RNN type processes a sequence of inputs


and generates a sequence of outputs. In language translation task a sequence of
words in one language is given as input, and a corresponding sequence in another
language is generated as output.
Unit-3

LSTM – Long Short Term Memory


 Long Short-Term Memory (LSTM) is an enhanced version of the Recurrent Neural
Network (RNN)
 LSTMs can capture long-term dependencies in sequential data making them ideal for
tasks like language translation, speech recognition and time series forecasting.
 Unlike traditional RNNs which use a single hidden state passed through time LSTMs
introduce a memory cell that holds information over extended periods addressing the
challenge of learning long-term dependencies.
Problem with Long-Term Dependencies in RNN

 Recurrent Neural Networks (RNNs) are designed to handle sequential data by


maintaining a hidden state that captures information from previous time steps.
 However they often face challenges in learning long-term dependencies where
information from distant time steps becomes crucial for making accurate predictions
for current state.
 This problem is known as the vanishing gradient or exploding gradient problem.

Vanishing Gradient: When training a model over time, the gradients (which help the
model learn) can shrink as they pass through many steps. This makes it hard for the
model to learn long-term patterns since earlier information becomes almost
irrelevant.

Exploding Gradient: Sometimes, gradients can grow too large, causing instability. This
makes it difficult for the model to learn properly, as the updates to the model become
erratic and unpredictable.
 Both of these issues make it challenging for standard RNNs to effectively capture long-
term dependencies in sequential data.

LSTM Architecture
LSTM architectures involves the memory cell which is controlled by three gates: the input
gate, the forget gate and the output gate. These gates decide what information to add to,
remove from and output from the memory cell.
Input gate: Controls what information is added to the memory cell.
Forget gate: Determines what information is removed from the memory cell.
Output gate: Controls what information is output from the memory cell.
This allows LSTM networks to selectively retain or discard information as it flows through the
network which allows them to learn long-term dependencies. The network has a hidden state
which is like its short-term memory. This memory is updated using the current input, the
previous hidden state and the current state of the memory cell.
Unit-3

Working of LSTM
LSTM architecture has a chain structure that contains four neural networks and different
memory blocks called cells.

Information is retained by the cells and the memory manipulations are done by the gates.
There are three gates –
Forget Gate
The information that is no longer useful in the cell state is removed with the forget gate. Two
inputs xt (input at the particular time) and ht-1 (previous cell output) are fed to the gate and
multiplied with weight matrices followed by the addition of bias. The resultant is passed
through an activation function which gives a binary output. If for a particular cell state the
output is 0, the piece of information is forgotten and for output 1, the information is retained
for future use.
The equation for the forget gate is:

Where:
W_f represents the weight matrix associated with the forget gate.
[h_t-1, x_t] denotes the concatenation of the current input and the previous hidden state.
b_f is the bias with the forget gate.
σ is the sigmoid activation function.
Unit-3

Input gate
The addition of useful information to the cell state is done by the input gate. First, the
information is regulated using the sigmoid function and filter the values to be remembered
similar to the forget gate using inputs ht-1 and xt. . Then, a vector is created using tanh
function that gives an output from -1 to +1, which contains all the possible values from ht-1
and xt. At last, the values of the vector and the regulated values are multiplied to obtain the
useful information. The equation for the input gate is:

Multiply the previous state by ft, disregarding the information we had previously chosen to
ignore. Next, we include it∗Ct. This represents the updated candidate values, adjusted for the
amount that we chose to update each state value.

Where, ⊙ denotes element-wise multiplication and tanh is activation function


Unit-3

Output gate
The task of extracting useful information from the current cell state to be presented as output
is done by the output gate. First, a vector is generated by applying tanh function on the cell.
Then, the information is regulated using the sigmoid function and filter by the values to be
remembered using inputs ℎ𝑡−1 𝑎𝑛𝑑 𝑥𝑡 . At last, the values of the vector and the regulated
values are multiplied to be sent as an output and input to the next cell. The equation for the
output gate is:

Bidirectional LSTM Model

 Bidirectional LSTM (Bi LSTM/ BLSTM) is a variation of normal LSTM which processes
sequential data in both forward and backward directions. This allows Bi-LSTM to learn
longer-range dependencies in sequential data than traditional LSTMs which can only
process sequential data in one direction.
 Bi-LSTMs are made up of two LSTM networks one that processes the input sequence
in the forward direction and one that processes the input sequence in the backward
direction.
 The outputs of the two LSTM networks are then combined to produce the final output.

You might also like