Unit-3
Unit-3
ConvNet Architectures
ConvNet: In deep learning, a convolutional neural network (CNN) is a class of deep neural
networks, most commonly applied to analyzing visual imagery.
ConvNet architectures are basically made of 3 elements:-
1. Convolution Layers
2. Pooling Layers
3. Fully Connected Layers
Convolution- The term convolution refers to the mathematical combination of two functions
to produce a third function.
Unit-3
FCL in a neural network are those layers where all the inputs from one layer are
connected to every activation unit of the next layer.
ConvNet architectures follow a general rule of successively applying Convolutional
Layers to the input, periodically down-sampling the spatial dimensions while
decreasing the number of feature maps using the Pooling Layers.
Feature Maps- The feature map is the output of one filter applied to the previous layer. I.e at
each layer, the feature map is the output of that layer.
The architectures to be discussed are used as general design guidelines for modern
programmers to adapt and used to implement feature extraction and exploring which are
further used for image classification, object detection, image captioning, image segmentation
and much more.
Some common architectures:-
1. AlexNet
2. VGG
3. Inception
4. ResNet
AlexNet:
AlexNet, developed by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, is a landmark
model that won the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) in 2012. It
introduced several innovative ideas that shaped the future of CNNs.
Unit-3
AlexNet consists of 8 layers, including 5 convolutional layers and 3 fully connected layers. It
uses traditional stacked convolutional layers with max-pooling in between. Its deep network
structure allows for the extraction of complex features from images.
The output size of a convolutional operation can be calculated using the following
formula:
This formula helps to determine the dimensions of the output feature map, which is
essential for designing and understanding the architecture of a CNN.
Unit-3
AlexNet Summary
Unit-3
Code: AlexNet
#Import Utilities
import tensorflow as tf
from tensorflow.keras import layers, utils
#AlexNet Model
alexnet = tf.keras.models.Sequential([
layers.Conv2D(filters=96, kernel_size=(11,11), strides=4, activation='relu',
input_shape=(227,227,3)),
layers.MaxPooling2D(pool_size=(3,3), strides=2, padding='same'),
layers.Conv2D(filters=256, kernel_size=(5,5), padding='same', activation='relu'),
layers.MaxPooling2D(pool_size=(3,3), strides=2),
layers.Conv2D(filters=384, kernel_size=(3,3), padding='same', activation='relu'),
layers.Conv2D(filters=384, kernel_size=(3,3), padding='same', activation='relu'),
layers.Conv2D(filters=256, kernel_size=(3,3), padding='same', activation='relu'),
layers.MaxPooling2D(pool_size=(3,3), strides=2),
layers.Flatten(),
#Summary of AlexNet
alexnet.summary()
ReLU Non-Linearity
AlexNet demonstrates that saturating activation functions like Tanh or Sigmoid can be
used to train deep CNNs much more quickly.
The image below demonstrates that AlexNet can achieve a training error rate of 25%
with the aid of ReLUs (solid curve). Compared to a network using tanh, this is six times
faster (dotted curve). On the CIFAR-10 dataset, this was evaluated.
Unit-3
Data Augmentation
Overfitting can be avoided by showing Neural Net various iterations of the same image.
Additionally, it assists in producing more data and compels the Neural Net to memorise the
main qualities.
Augmentation by Mirroring: Consider that our training set contains a picture of a cat. A cat
can also be seen as its mirror image. This indicates that by just flipping the image above the
vertical axis, we may double the size of the training datasets.
Randomly cropping the original image will also produce additional data that is simply
the original data shifted.
For the network’s inputs, the creators of AlexNet selected random crops with
dimensions of 227 by 227 from within the 256 by 256 image boundary. They multiplied
the size of the data by 2048 using this technique.
Unit-3
Dropout
A neuron is removed from the neural network during dropout with a probability of
0.5. A neuron that is dropped does not make any contribution to either forward or
backward propagation.
As seen in the graphic below, each input is processed by a separate Neural Network
design. The acquired weight parameters are therefore more reliable and less prone to
overfitting.
Unit-3
Residual Networks
Recent years have seen tremendous progress in the field of Image Processing and
Recognition. Deep Neural Networks are becoming deeper and more complex.
It has been proved that adding more layers to a Neural Network can make it more
robust for image-related tasks.
But it can also cause them to lose accuracy. That’s where Residual Networks come
into place.
The tendency to add so many layers by deep learning practitioners is to extract
important features from complex images.
So, the first layers may detect edges, and the subsequent layers at the end may detect
recognizable shapes, like tires of a car.
But if we add more than 30 layers to the network, then its performance suffers and it
attains a low accuracy.
This is contrary to the thinking that the addition of layers will make a neural network
better. This is not due to overfitting, because in that case, one may use dropout and
regularization techniques to solve the issue altogether.
It’s mainly present because of the popular vanishing gradient problem.
Residual Block
The core idea of ResNet is the residual block, which consists of two or more
convolutional layers, a batch normalization layer, and skip connections.
The skip connections add the original input to the output of the residual block,
allowing the network to learn residual functions instead of trying to approximate the
desired output directly.
y = F(x) + x
Unit-3
The output of the previous layer is added to the output of the layer after it in the
residual block.
The hop or skip could be 1, 2 or even 3. When adding, the dimensions of x may be
different than F(x) due to the convolution process, resulting in a reduction of its
dimensions.
Thus, we add an additional 1 x 1 convolution layer to change the dimensions of x.
The skip connection basically skips both these layers and adds directly before the
ReLU activation function. Such residual blocks are repeated to form a residual
network.
The residual block used in ResNet-50 is called the Bottleneck Residual Block. This block it has
the following architecture:
The first convolutional layer likely uses a filter size of 1x1 and reduces the number of
channels in the input data. This dimensionality reduction helps to compress the data
and improve computational efficiency without sacrificing too much information.
Unit-3
The second convolutional layer might use a filter size of 3x3 to extract spatial features
from the data.
The third convolutional layer again uses a filter size of 1x1 to restore the original
number of channels before the output is added to the shortcut connection.
Training a Convnet
Activation Functions
Sigmoid:
Tanh
ReLU
Unit-3
Weights initialization
Q: what happens when W=constant init is used?
First idea: Small random numbers (gaussian with zero mean and 1e-2 standard
deviation)
Works okay for small networks, but problems with deeper networks.
For ReLU
VGG Architecture
VGGNets are based on the most essential features of convolutional neural networks
(CNN). The following graphic shows the basic concept of how a CNN works:
Unit-3
The VGG network is constructed with very small convolutional filters. The VGG-16 consists
of 13 convolutional layers and three fully connected layers.
Let’s take a brief look at the architecture of VGG:
Input: The VGGNet takes in an image input size of 224×224. For the ImageNet
competition, the creators of the model cropped out the center 224×224 patch in each
image to keep the input size of the image consistent.
Convolutional Layers: VGG’s convolutional layers leverage a minimal receptive field,
i.e., 3×3, the smallest possible size that still captures up/down and left/right.
Moreover, there are also 1×1 convolution filters acting as a linear transformation of
the input. This is followed by a ReLU unit, which is a huge innovation from AlexNet
that reduces training time. ReLU stands for rectified linear unit activation function; it
is a piecewise linear function that will output the input if positive; otherwise, the
output is zero. The convolution stride is fixed at 1 pixel to keep the spatial resolution
preserved after convolution (stride is the number of pixel shifts over the input matrix).
Hidden Layers: All the hidden layers in the VGG network use ReLU. VGG does not
usually leverage Local Response Normalization (LRN) as it increases memory
consumption and training time. Moreover, it makes no improvements to overall
accuracy.
Fully-Connected Layers: The VGGNet has three fully connected layers. Out of the
three layers, the first two have 4096 channels each, and the third has 1000 channels,
1 for each class.
VGG16 Architecture
The number 16 in the name VGG refers to the fact that it is a 16-layer deep neural
network (VGGnet). This means that VGG16 is a pretty extensive network and has a
total of around 138 million parameters.
Even according to modern standards, it is a huge network. However, VGGNet16
architecture’s simplicity is what makes the network more appealing. Just by looking at
its architecture, it can be said that it is quite uniform.
There are a few convolution layers followed by a pooling layer that reduces the height
and the width. If we look at the number of filters that we can use, around 64 filters
are available that we can double to about 128 and then to 256 filters. In the last layers,
we can use 512 filters.
Unit-3
The number of filters that can use doubles on every step or through every stack of the
convolution layer. This is a major principle used to design the architecture of the
VGG16 network. One of the crucial downsides of the VGG16 network is that it is a
huge network, which means that it takes more time to train its parameters.
Because of its depth and number of fully connected layers, the VGG16 model is more
than 533MB. This makes implementing a VGG network a time-consuming task.
Inception-Net
Inception-Net is a convolutional neural network (CNN) architecture that Google
developed to improve upon the performance of previous CNNs
It uses "inception modules" that apply a combination of 1x1, 3x3, and 5x5
convolutions on the input data and utilizes auxiliary classifiers to improve
performance.
InceptionNet is a convolutional neural network architecture developed by Google in
2014. It is known for using inception modules, blocks of layers that learn a
combination of local and global features from the input data.
InceptionNet was designed to be more efficient and faster to train than other deep
convolutional neural networks.
It has been used in image classification, object detection, and face recognition and has
been the basis for popular neural network architectures such as Inception-v4 and
Inception-ResNet.
What is InceptionNet?
Inception Blocks
Inception Modules
Inception v1, also known as GoogLeNet, was the first version of the Inception network
architecture.
It was introduced in 2014 by Google and designed to improve the performance of
CNNs on the ImageNet dataset.
It uses a modular architecture, where the network comprises multiple Inception
Modules stacked together.
Each module contains multiple parallel convolutional filters of different sizes, which
are applied to the input data and concatenated to form a single output feature map.
Inception v1 includes a total of 9 Inception Modules, with max-pooling layers at
different scales.
It includes a global average pooling layer and a fully connected layer for classification.
It achieved state-of-the-art performance on the ImageNet dataset at the time of its
release.
It was a very deep and complex network. It introduced the idea of using multiple
parallel convolutional filters and showed how to reduce the computational cost using
1x1 convolution.
Inception v2
It uses a new Inception Module, called the Inception-ResNet Module, which combines
the benefits of both Inception and Residual networks.
These Inception-ResNet Modules allow for a deeper network with fewer parameters
and better performance.
Inception v2 also uses a batch normalization layer after each convolutional layer,
which helps improve the network's stability and performance.
Inception v2 achieved state-of-the-art performance on several image classification
benchmarks, and its architecture has been used as a basis for many subsequent CNNs.
Inception v2 improved the Inception architecture by introducing Inception-ResNet
modules, which allow for deeper networks with fewer parameters, and batch
normalization layers which improved the stability and performance of the network.
Inception v3
Inception v3 is the third version of the Inception network architecture, which was
introduced in 2015 by Google.
It builds upon the original Inception v1 and v2 architectures and aims to improve
the performance of CNNs further.
Inception v3 uses a similar modular architecture, where the network comprises
multiple Inception Modules stacked together.
It uses a new type of Inception Module, called the Factorization-Based Inception
Module, which uses factorization techniques to reduce the number of parameters
in the network and improve performance.
Inception v3 also introduces batch normalization layers after each convolutional
layer, which helps improve the network's stability and performance.
Inception v3 achieved state-of-the-art performance on the ImageNet dataset, and
its architecture has been used as a basis for many subsequent CNNs.
Inception v3 improved the Inception architecture by introducing Factorization-
Based Inception Modules, which allows for a deeper network with fewer
Unit-3
parameters, and batch normalization layers which improved the stability and
performance of the network.
Inception v4
Inception v4 is the fourth version of the Inception network architecture, which was
introduced in 2016 by Google.
It builds upon the original Inception v1, v2, and v3 architectures and aims to improve
the performance of CNNs further.
Inception v4 uses a similar modular architecture, where the network comprises
multiple Inception Modules stacked together.
It uses a new type of Inception Module, called the Inception-Auxiliary Module, which
provides auxiliary classifiers to improve the network's performance.
Inception v4 also introduces the use of Stem layers, which reduce the spatial
resolution of the input data before it is passed to the Inception Modules.
Inception v4 achieved state-of-the-art performance on several image classification
benchmarks, and its architecture has been used as a basis for many subsequent CNNs.
Inception v4 improved the Inception architecture by introducing Inception-Auxiliary
Modules, which provide auxiliary classifiers to improve the network's performance,
and Stem layers, which reduce the spatial resolution of the input data before it is
passed to the Inception Modules.
Unit-3
Schematic visualization of the deep learning model training process. In each iteration
of the training cycle, the neural network produces predictions on a batch of training
samples. This predicted output is compared to the ground truth using a loss function.
The gradient of the loss function with respect to the neural network’s weights
uncovers how these weights have to be updated to bring the model’s outputs closer
to the ground truth.
This adjustment is governed by an optimization algorithm. Optimizers utilize gradients
computed by backpropagation to determine the direction and magnitude of
parameter updates, aiming to navigate the model’s high-dimensional parameter
space efficiently.
Optimizers employ various strategies to balance exploration and exploitation, seeking
to escape local minima and converge to optimal or near-optimal solutions.
Gradient Descent
The gradient descent optimization algorithm applied to a cost function. The cost
function is convex, with a unique minimum. The gradient descent algorithm starts
Unit-3
with a randomly selected initial weight. The gradient vector indicates the direction of
the steepest ascent. The optimization process is illustrated by arrows representing
incremental steps taken in the opposite direction of the gradient, moving the initial
weight toward the minimum cost point on the curve.
All deep learning model optimization algorithms widely used today are based on Gradient
Descent. Hence, having a good grasp of the technical and mathematical details is essential.
So let’s take a look:
Objective: Gradient Descent aims to find a function’s parameters (weights) that minimize the
cost function. In the case of a deep learning model, the cost function is the average of the loss
for all training samples as given by the loss function. While the loss function is a function of
the model’s output and the ground truth, the cost function is a function of the model’s
weights and biases.
Working steps:
Where, w represents the model’s parameters (weights) and 𝛼 is the learning rate. Δ𝑤𝐽(w) is
the gradient of the cost function 𝐽(w) with respect to w.
The learning rate is a crucial hyperparameter that needs to be chosen carefully. If it’s
too small, the algorithm will converge very slowly. If it’s too large, the algorithm might
overshoot the minimum and fail to converge.
Unit-3
Here, w represents the model’s parameters (weights), 𝛼 is the learning rate, and ∆𝘸𝘑𝘪(𝘸) is
the gradient of the cost function 𝐽i(w) for the ith training example with respect to w.
Working:
Initialization: Start with random initial parameter values and initialize a first moment vector
(m) and a second moment vector (v). Each “moment vector” stores aggregated information
about the gradients of the cost function with respect to the model’s parameters:
The first moment vector accumulates the means (or the first moments) of the
gradients, acting like a momentum by averaging past gradients to determine the
direction to update the parameters.
The second moment vector accumulates the variances (or second moments) of the
gradients, helping adjust the size of the updates by considering the variability of past
gradients.
Both moment vectors are initialized to zero at the start of the optimization. Their size is
identical to the size of the model’s parameters (i.e., if a model has N parameters, both vectors
will be vectors of size N).
Adam also introduces a bias correction mechanism to account for these vectors being
initialized as zeros. The vectors’ initial state leads to a bias towards zero, especially in the early
stages of training, because they haven’t yet accumulated enough gradient information. To
correct this bias, Adam adjusts the calculations of the adaptive learning rate by applying a
correction factor to both moment vectors. This factor grows smaller over time and
asymptotically approaches 1, ensuring that the influence of the initial bias diminishes as
training progresses.
Compute gradient: For each mini-batch, compute the gradients of the cost function
with respect to the parameters.
Update moments: Update the first moment vector (m) with the bias-corrected
moving average of the gradients. Similarly, update the second moment vector (v) with
the bias-corrected moving average of the squared gradients.
Adjust learning rate: Calculate the adaptive learning rate for each parameter using
the updated first and second moment vectors, ensuring effective parameter updates.
Update parameters: Use the adaptive learning rates to update the model’s
parameters.
The second moment vector accumulates the variances (or second moments) of the gradients,
helping adjust the size of the updates by considering the variability of past gradients.
Mathematical representation: The parameter update rule for Adam can be expressed as
Where 𝘸 represents the parameters, α is the learning rate, and 𝘮ₜ and 𝘷ₜ are bias-corrected
estimates of first and second moments of the gradients, respectively.
Unit-3
basic architecture of RNN and the feedback loop mechanism where the output is passed back
as input for the next time step.
RNN Differs from Feedforward Neural Networks
Feedforward Neural Networks (FNNs) process data in one direction from input to
output without retaining information from previous inputs. This makes them suitable
for tasks with independent inputs like image classification. However, FNNs struggle
with sequential data since they lack memory.
Recurrent Neural Networks (RNNs) solve this by incorporating loops that allow
information from previous steps to be fed back into the network. This feedback
enables RNNs to remember prior inputs making them ideal for tasks where context is
important.
Unit-3
Recurrent Neuron
2. RNN Unfolding:
RNN unfolding or unrolling is the process of expanding the recurrent structure
over time steps. During unfolding each step of the sequence is represented as
a separate layer in a series illustrating how information flows across each time
step.
This unrolling enables backpropagation through time (BPTT) a learning process
where errors are propagated across time steps to adjust the network’s weights
enhancing the RNN’s ability to learn dependencies within sequential data.
RNNs share similarities in input and output structures with other deep learning
architectures but differ significantly in how information flows from input to output.
Unlike traditional deep neural networks, where each dense layer has distinct weight
matrices, RNNs use shared weights across time steps, allowing them to remember
information over sequences.
In RNNs, the hidden state 𝐻𝑖 is calculated for every input 𝑋𝑖 to retain sequential
dependencies. The computations follow these core formulas:
Unit-3
Here, h represents the current hidden state, U and are weight matrices, and is the
bias.
Output Calculation:
Overall Function:
This function defines the entire RNN operation, where the state matrix S holds each
element 𝑠𝑖 representing the network’s state at each time step i.
Working
At each time step RNNs process units with a fixed activation function. These units have
an internal hidden state that acts as memory that retains information from previous
time steps. This memory allows the network to store past knowledge and adapt based
on new inputs.
The current hidden state ℎ𝑡 , depends on the previous state ℎ𝑡−1, and the current input
𝑥𝑡 , and is calculated using the following relations:
Unit-3
State Update:
Where: ℎ𝑡 is the current hidden state, ℎ𝑡−1 is the previous state and current input 𝑥𝑡 ,
Activation Function
Here, 𝑊ℎℎ is the weight matrix for the recurrent neuron, and 𝑊𝑥ℎ is the weight matrix for the
input neuron.
Output Calculation
Where, 𝑦𝑡 is the output and 𝑊ℎ𝑦 is the weight at the output layer.
These parameters are updated using backpropagation. However, since RNN works on
sequential data here we use an updated backpropagation which is known as backpropagation
through time.
Types of Recurrent Neural Networks
There are four types of RNNs based on the number of inputs and outputs in the network:
One-to-One RNN: This is the simplest type of neural network architecture where there
is a single input and a single output. It is used for straightforward classification tasks
such as binary classification where no sequential data is involved.
Vanishing Gradient: When training a model over time, the gradients (which help the
model learn) can shrink as they pass through many steps. This makes it hard for the
model to learn long-term patterns since earlier information becomes almost
irrelevant.
Exploding Gradient: Sometimes, gradients can grow too large, causing instability. This
makes it difficult for the model to learn properly, as the updates to the model become
erratic and unpredictable.
Both of these issues make it challenging for standard RNNs to effectively capture long-
term dependencies in sequential data.
LSTM Architecture
LSTM architectures involves the memory cell which is controlled by three gates: the input
gate, the forget gate and the output gate. These gates decide what information to add to,
remove from and output from the memory cell.
Input gate: Controls what information is added to the memory cell.
Forget gate: Determines what information is removed from the memory cell.
Output gate: Controls what information is output from the memory cell.
This allows LSTM networks to selectively retain or discard information as it flows through the
network which allows them to learn long-term dependencies. The network has a hidden state
which is like its short-term memory. This memory is updated using the current input, the
previous hidden state and the current state of the memory cell.
Unit-3
Working of LSTM
LSTM architecture has a chain structure that contains four neural networks and different
memory blocks called cells.
Information is retained by the cells and the memory manipulations are done by the gates.
There are three gates –
Forget Gate
The information that is no longer useful in the cell state is removed with the forget gate. Two
inputs xt (input at the particular time) and ht-1 (previous cell output) are fed to the gate and
multiplied with weight matrices followed by the addition of bias. The resultant is passed
through an activation function which gives a binary output. If for a particular cell state the
output is 0, the piece of information is forgotten and for output 1, the information is retained
for future use.
The equation for the forget gate is:
Where:
W_f represents the weight matrix associated with the forget gate.
[h_t-1, x_t] denotes the concatenation of the current input and the previous hidden state.
b_f is the bias with the forget gate.
σ is the sigmoid activation function.
Unit-3
Input gate
The addition of useful information to the cell state is done by the input gate. First, the
information is regulated using the sigmoid function and filter the values to be remembered
similar to the forget gate using inputs ht-1 and xt. . Then, a vector is created using tanh
function that gives an output from -1 to +1, which contains all the possible values from ht-1
and xt. At last, the values of the vector and the regulated values are multiplied to obtain the
useful information. The equation for the input gate is:
Multiply the previous state by ft, disregarding the information we had previously chosen to
ignore. Next, we include it∗Ct. This represents the updated candidate values, adjusted for the
amount that we chose to update each state value.
Output gate
The task of extracting useful information from the current cell state to be presented as output
is done by the output gate. First, a vector is generated by applying tanh function on the cell.
Then, the information is regulated using the sigmoid function and filter by the values to be
remembered using inputs ℎ𝑡−1 𝑎𝑛𝑑 𝑥𝑡 . At last, the values of the vector and the regulated
values are multiplied to be sent as an output and input to the next cell. The equation for the
output gate is:
Bidirectional LSTM (Bi LSTM/ BLSTM) is a variation of normal LSTM which processes
sequential data in both forward and backward directions. This allows Bi-LSTM to learn
longer-range dependencies in sequential data than traditional LSTMs which can only
process sequential data in one direction.
Bi-LSTMs are made up of two LSTM networks one that processes the input sequence
in the forward direction and one that processes the input sequence in the backward
direction.
The outputs of the two LSTM networks are then combined to produce the final output.