Deep Learning UNIT 5
Deep Learning UNIT 5
Deep Learning
Layers,
Activation Functions,
Optimizers,
Build CNN Models From Scratch,
Pretrained Models: Alexnet,vgg-16,vgg-19, Gloogle net,
Residualnet,
Transfer Learning,
Practical Implementation Of CNN And Transfer Learning.
SIFT(Scale Invariant Feature Transform
Metric Deep Learning Machine Learning
Deep Learning algorithms rely heavily on a
Machine learning requires a large
huge amount of data, thus we must feed a
Data requirement quantity of data, it may also operate with
significant amount of data in order to get decent
lesser amounts of data.
efficiency.
Machine learning process takes shorter
Training and Deep Learning requires a long implementation
time for training the modeling than deep
implementation time for training the models yet a short
learning, but also it takes a considerable
time execution chance to examine the algorithm.
time to verify the hypothesis.
Deep learning is an improved type of machine
ML algorithms require a stage of
Feature learning that will not require the feature
extracting features by an expert before
extraction extraction for each issue; rather, it attempts to
proceeding
learn high-level features from raw data by itself.
When working with a deep learning algorithm,
When working with machine learning,
we can obtain a more accurate outcome for a
we can readily comprehend the results,
Outcome specific issue than when working with a
which means we can understand why
interpretation machine learning model, but we are unable to
this outcome occurred and what the
determine why this specific result evolved and
approach was.
the logic behind it.
Metric Deep Learning Machine Learning
Machine learning algorithms can be used
Deep learning methods are ideal for tackling difficult
to solve simple to somewhat complicated
Applications challenges.
issues.
Eg: self-driving cars
Eg: Sentiment analysis
Because they depend on the levels of the Artificial
The majority of machine learning
Type of data neural network, Deep Learning models can operate
techniques require structured data
both with structured and non – structured data.
A deep learning algorithm differs from a standard The standard ML paradigm divides an
machine learning algorithm in that it accepts data for issue into sub-parts and solves each portion
Strategy
a particular issue and produces the final result. As a separately before producing the final
result, it takes an end-to-end strategy. solution
Hardware To perform successfully, the deep learning model Machine learning algorithms can run on
requirement requires a large quantity of data, which necessitates low-end devices because they do not
s the use of GPUs and thus the high-end system. require a huge volume of input.
To produce outcomes, machine learning
Human Deep learning is more difficult to establish initially,
needs greater continuing human
intervention but it takes less interaction after that.
involvement.
Deep Learning
The term 'Deep Learning' is coined because the neural networks have various layers that enable learning,
unlearning, and relearning.
In terms of the technical field, it uses neural networks that are influenced by the human brain, as these
networks are interconnected nodes of layers which helps in processing information.
Big Data advancements enable larger neural networks. Computers now quickly understand and respond
to complex events, aiding tasks like language translation and image categorization.
Deep learning solves pattern recognition issues independently, without any human assistance.
Deep Learning is a branch of Artificial Intelligence (AI) that enables machines to learn from large
amounts of data.
It uses neural networks with many layers to automatically find patterns and make predictions.
It is very useful for tasks like image recognition, language translation, and speech processing.
Deep learning models learn directly from data, without the need for manual feature extraction.
Why Is Deep Learning Crucial?
Deep learning is crucial because it enables machines to learn complex patterns and make decisions with
minimal human intervention. Its applications drive advancements in fields like healthcare, autonomous
systems, and natural language processing.
2. High Accuracy
In computer vision, audio processing, and natural language processing (NLP), deep learning models yield the
most accurate results.
3. Pattern Recognition
While deep learning models are capable of autonomously detecting a wide range of patterns, most models
require the assistance of a machine learning engineer.
4. Representation Learning
Deep learning models are highly proficient in acquiring hierarchical data representations, automatically
deriving relevant features from unprocessed input.
Popular applications of Deep Learning include self-driving cars, chatbots, medical image
analysis, and recommendation systems.
Deep learning is neural networks with a large number of parameters and layers in one of four
fundamental network architectures:
Unsupervised pre-trained networks
Convolutional neural networks
Recurrent neural networks
Recursive neural networks
Applications of Deep learning
Core components
Parameters
Layers
Activation functions
Loss functions
Optimization methods
Hyper parameters
Parameters
relate to the x parameter vector in the equation Ax = b in basic machine learning.
Parameters in neural networks relate directly to the weights on the connections in the network
use methods of optimization such as gradient descent to find good values for the parameter vector
to minimize loss across our training dataset.
In deep networks, we still have a parameter vector representing the connection in the network
model we’re trying to optimize.
The biggest change in deep networks with respect to parameters is how the layers are
connected in the different architectures.
Layers
In deep learning, "layers" refer to the different stages of processing within a neural network, where data
is transformed through a series of computations,.
Layers also can be represented by subnetworks in certain architectures, as well.
Layers are a fundamental architectural unit in deep networks.
Input, hidden, and output layers define feedforward neural networks.
Input Layer: Receives the raw input data.
Hidden Layers: Multiple layers between the input and output layer where most of the computation
happens.
Output Layer: Produces the final prediction based on the processed information from the hidden layers.
Activation Functions
Activation functions are mathematical functions applied within each layer to introduce non-linearity
and enable the network to learn complex patterns in data by controlling the flow of information
between layers; essentially, activation functions determine which signals should be passed on to the
next layer based on their input values
Activation functions are used in specific architectures to drive feature extraction.
The higher-order features learned from the data in deep networks are a nonlinear transform
applied to the output of the previous layer.
This allows the network to learn patterns in the data within a constrained space.
We group these design decisions for network architecture into two main
areas across all architectures:
Hidden layers
Output layers
Hidden layers are concerned with extracting progressively higher-order features from the
raw data.
Depending on the architecture we’re working with, we tend to use certain subsets of layer
activation functions.
Hidden layer activation functions
Commonly used functions include:
Sigmoid
Tanh
Hard tanh
Rectified linear unit (ReLU) (and its variants)
Output layer for regression
• If we want to output a single real-valued number from our model, use a linear activation function.
A convolution tool that splits the various features of the image for analysis
A fully connected layer that uses the output of the convolution layer to predict the best
description for the image.
Basic Convolutional Neural Network Architecture
CNN architecture is inspired by the organization and functionality of the visual cortex (the part
of the brain that processes visual information from the eyes. It's located in the occipital lobe at
the back of the brain) and designed to mimic the connectivity pattern of neurons within the
human brain.
The neurons within a CNN are split into a three-dimensional structure, with each set of neurons
analyzing a small region or feature of the image.
In other words, each group of neurons specializes in identifying one part of the image.
CNNs use the predictions from the layers to produce a final output that presents a vector of
probability scores to represent the likelihood that a specific feature belongs to a certain class.
CNN layers
The input layer accepts three-dimensional input generally in the form spatially of the size (width ×
height) of the image and has a depth representing the color channels (generally three for RGB color
channels).
The hidden layers carry out feature extraction by performing different calculations and manipulations.
There are multiple hidden layers like the convolution layer, the ReLU layer, and pooling layer, that
perform feature extraction from the image.
Finally, there’s a fully connected layer that identifies the object in the image.
3 major groups:
1. Input layer
2. Feature-extraction (learning) layers
3. Classification layers
The feature-extraction layers have a general repeating pattern of the sequence:
1. Convolution layer
We express the Rectified Linear Unit (ReLU) activation function as a layer in the diagram here to
match up to other literature.
2. Pooling layer
These layers find a number of features in the images and progressively construct higher-order
features.
This corresponds directly to the ongoing theme in deep learning by which features are
automatically learned as opposed to traditionally hand engineered.
classification layers in which we have one or more fully connected layers to take the higher-order features
and produce class probabilities or scores.
These layers are fully connected to all of the neurons in the previous layer, as their name implies.
The output of these layers produces typically a two dimensional output of the dimensions.
Input Layers
Input layers are where we load and store the raw input data of the image for processing in the
network.
This input data specifies the width, height, and number of channels. Typically, the number of
channels is three, for the RGB values for each pixel.
The input layer holds the raw input of the image.
The input layer represents the pixel matrix of the image.
The input layer allows the CNN to receive and make sense of information.
The input layer consists of one or more artificial neurons.
Each neuron receives input from the outside world.
Convolutional layers
Convolutional layers are considered the core building blocks of CNN architectures.
As Figure illustrates, convolutional layers transform the input data by using a patch of locally
connecting neurons from the previous layer.
The layer will compute a dot product between the region of the neurons in the input layer and the
weights to which they are locally connected in the output layer.
In convolutional neural networks (CNNs), the primary components are convolutional layers.
These layers typically involve input vectors, such as an image, filters (or feature detectors), and
output vectors, often referred to as feature maps.
As the input, such as an image, traverses through a convolutional layer, it undergoes abstraction into a
feature map, also known as an activation map. This process involves the convolution operation,
which enables the detection of more complex features within the image.
Additionally, Rectified linear units (ReLU) commonly serve as activation functions within these
layers to introduce non-linearity into the network.
Furthermore, CNNs often employ pooling operations to reduce the spatial dimensions of the feature
maps, leading to a more manageable output volume.
Overall, convolutional layers play a crucial role in extracting meaningful features from the input data,
making them fundamental in tasks such as image classification and natural language processing,
among others, within the realm of machine learning models.
A convolution is a grouping function in mathematics. Convolution occurs in CNNs when two matrices
(rectangular arrays of numbers arranged in columns and rows) combine to generate a third matrix.
In the convolutional layers of a CNN, these convolutions filter input data to extract information.
Position the kernel’s center element above the source pixel. Then, replace the source pixel with a
weighted sum of itself and its neighboring pixels.
Parameter sharing and local connectivity are two principles used in CNNs.
In a feature map, all neurons share weights, which defines parameter sharing. Local connection means
each neuron connects only to a part of the input image, unlike a fully connected neural network where
all neurons connect to every input. This reduces the number of parameters in the system and speeds up
the calculation
Steps in a Convolution Layer
Initialize Filters:
Randomly initialize a set of filters with learnable parameters.
Pooling (Optional):
Often followed by a pooling layer (like max pooling) to reduce the spatial dimensions of the
feature map and retain the most important information.
Key Components of a CNN
The convolutional neural network is made of four main parts.
They help the CNNs mimic how the human brain operates to recognize patterns and features in
images:
Convolutional layers
Rectified Linear Unit (ReLU for short)
Pooling layers
Fully connected layers
This section dives into the definition of each one of these components through the example of
the following example of classification of a handwritten digit.
Convolutional layers
The convolution operation, is known as the feature detector of a CNN.
The input to a convolution can be raw data or a feature map output from another convolution.
It is often interpreted as a filter in which the kernel filters input data for certain kinds of
information; for example, an edge kernel lets pass through only information from the edge of an
image.
Convolutional layers have parameters for the layer and additional hyper parameters.
Gradient descent is used to train the parameters in this layer such that the class scores are
consistent with the labels in the training set.
Following are the major components of convolutional layers:
Filters
Activation maps
Parameter sharing
Layer-specific hyper parameters
Filters
The parameters for a convolutional layer configure the layer’s set of filters.
Filters are a function that has a width and height smaller than the width and height of the input
volume.
Filters (e.g., convolutions) are applied across the width and height of the input volume in a sliding
window manner, as demonstrated in Figure
Filters are also applied for every depth of the input volume. We compute the output of the filter by
producing the dot product of the filter and the input region.
ReLU activation functions as layers
With CNNs, we often see ReLU layers used.
The ReLU layer will apply an element-wise activation function over the input data thresholding — for example,
max(0,x) — at zero, giving us the same dimension output as the input to the layer.
Running this function over the input volume will change the pixel values but will not change the spatial dimensions of
the input data in the output.
ReLU layers do not have parameters nor additional hyperparameters.
• ReLU (Rectified Linear Unit): This is the most commonly used activation function in CNNs. It returns 0 if it
receives any negative input, but for any positive value x, it returns that value back. Hence, it can be written as f(x) =
max(0, x). The function is non-linear, which means the output is not proportional to the input. It helps to alleviate the
vanishing gradient problem.
• Leaky ReLU: Leaky ReLU is a variant of ReLU. Instead of being 0 when x < 0, a leaky ReLU allows a small, non-
zero, constant gradient α (Normally, α=0.01). Hence, the function could be written as f(x)=max(αx,x). It mitigates the
dying ReLU problem which refers to the problem when the ReLU neurons become inactive and only output 0 for any
input.
• ReLUs also prevent the emergence of the so-called “vanishing gradient” problem, which is common when using
sigmoidal functions. This problem refers to the tendency for the gradient of a neuron to approach zero for high values
of the input.
• While sigmoidal functions have derivatives that tend to 0 as they approach positive infinity, ReLU always remains at a
constant 1. This allows backpropagation of the error and learning to continue, even for high values of the input to the
activation function:
How ReLU Helps:
Non-Saturating: ReLU does not saturate for positive inputs, meaning its gradient remains 1, whereas
functions like sigmoid and tanh have small gradients in their saturated regions.
Avoiding Exponential Decay: The constant gradient of 1 for positive inputs ensures that the gradients do
not diminish exponentially as they propagate backward through the network.
Faster Convergence: ReLU's simple and efficient gradient calculation allows for faster training and
convergence of neural networks compared to saturating activation functions.
Sparsity: ReLU creates sparse activations, which can help in learning more informative representations,
making it suitable for many types of problems.
Limitations of ReLU:
"Dead Neurons": If a neuron receives a constant negative input, its activation and gradient become 0,
causing the neuron to become inactive, or "dead," which can be problematic.
Leaky ReLU and Alternatives: To address the "dead neuron" issue, variations like Leaky ReLU (which
allows a small gradient for negative inputs) or other activation functions can be used.
Convolutional layer hyper parameters
In Convolutional Neural Networks (CNNs), hyperparameters are settings that control the
training process and network structure, like kernel size, stride, pooling size, and the number of
layers, which need to be set before training begins and directly impact performance.
Following are the hyper parameters that dictate the spatial arrangement and size of the output
volume from a convolutional layer are:
No. of layers/ depth of the layers
Filter (or kernel) size (field size)
Stride
padding
Number of Layers
The number of layers in a CNN is a critical hyperparameter that determines the depth of the network.
A deeper network can learn more complex features and patterns from the data, but it is also more prone to overfitting.
Therefore, it is important to strike a balance between the number of layers and the complexity of the problem. A good
starting point is to use a small number of layers and gradually increase their depth until the desired performance is
achieved.
Filter Size
The filter size is another important hyperparameter that determines the receptive field of each
convolutional layer.
A larger filter size can capture more information from the input image, but it also increases the number of
parameters in the network.
A smaller filter size can reduce the number of parameters, but it may not be able to capture all the
relevant features in the image.
Therefore, it is important to experiment with different filter sizes and choose the one that gives the best
performance.
Stride
The stride is a hyperparameter that determines the number of pixels by which the filter moves across
the input image.
A larger stride can reduce the size of the output feature maps, but it can also lead to information loss.
A smaller stride can preserve more information, but it also increases the computation time and
memory requirements.
Therefore, it is important to choose an appropriate stride that balances the trade-off between
information loss and computational efficiency. Default stride in CNN is 1
Pooling Layers
• Pooling layers are commonly inserted between successive convolutional layers.
• We want to follow convolutional layers with pooling layers to progressively reduce the spatial size
(width and height) of the data representation.
• Pooling layers reduce the data representation progressively over the network and help control overfitting.
• The pooling layer operates independently on every depth slice of the input .
• Pooling layer is used in CNNs to reduce the spatial dimensions (width and
height) of the input feature maps while retaining the most important
information. It involves sliding a two-dimensional filter over each channel
of a feature map and summarizing the features within the region covered
by the filter.
Max pooling: As the filter moves across the input, it selects the pixel with the maximum value to
send to the output array. As an aside, this approach tends to be used more often compared to
average pooling.
Average pooling: As the filter moves across the input, it calculates the average value within the
receptive field to send to the output array.
Padding
Padding is a technique used to preserve the spatial dimensions of the input image while applying
convolutional layers.
It involves adding zeros around the border of the input image to create a padded image that can be
convolved with the filter.
Padding can help preserve the information at the edges of the image and prevent the loss of spatial
resolution.
However, it also increases the memory requirements and computation time of the network.
Therefore, it is important to experiment with different padding techniques and choose the one that
gives the best performance.
Learning Rate
The learning rate is a hyperparameter that determines the step size at which the network updates its
parameters during training.
A large learning rate can lead to rapid convergence but may result in unstable and oscillating training.
A small learning rate can ensure stable and smooth training but may result in slower convergence.
Therefore, it is important to experiment with different learning rates and choose the one that gives the
best trade-off between training speed and stability.
Fully Connected Layers
• We use this layer to compute class scores that we’ll use as output of the network (e.g., the output layer at
the end of the network).
• The dimensions of the output volume is [ 1 × 1 × N ], where N is the number of output classes we’re
evaluating.
• Fully connected layers have the normal parameters for the layer and hyperparameters.
• Fully connected layers perform transformations on the input data volume that are a function of the
activations in the input volume and the parameters (weights and biases of the neurons).
Batch Size
The batch size is a hyperparameter that determines the number of samples that are processed by the
network in each training iteration. A larger batch size can reduce the variance of the gradient
estimates and improve the stability of the training. However, it also increases the memory
requirements and may lead to slower convergence. A smaller batch size can reduce the memory
requirements and improve the convergence speed but may lead to noisy gradient estimates.
Therefore, it is important to experiment with different batch sizes and choose the one that gives the
best trade-off between stability and speed.
Other Applications of CNNs
Beyond normal two-dimensional image data, we also see CNNs applied to three dimensional datasets.
Here are some examples of these alternative uses:
MRI data
3D shape data
Graph data
NLP applications
The position-invariant nature of CNNs has proven useful in these domains because we’re not limited
to hand-coding our features to appear in certain “spots” in the feature vector.
Convolution Layer
This is the opening layer in the architecture of CNN used to get characteristics from the image.
Convolution of an image with various filters used to perform operations like sharpening, edge
detection.
Pooling Layer
The use of this layer is to train the algorithm to take a few numbers of parameters of an image
when it is too large. They are three types. One is Maximum pooling, it considers the largest
element from a feature map. Another one is Average pooling, it considers the average of all
the elements from a feature map and another is Sum pooling, it considers the sum of all
elements of a feature map.
ReLU Layer
ReLU denoted as Rectified Linear Unit. It is an activation function used for the outputs of CNN neurons.
The mathematical formula for Rectified Linear Unit is defined as y = max (0, x).
Fully Connected Layer
Fully connected (FC) layer is a feature vector for input. The operation of the FC layer is to smoothing of
high-level characteristics which are learnable by Convolutional layers and merging features. It passes
flattened output which is generated to the output layer.
Pretrained models:
1. Alexnet
2. VGG-16
3. VGG-19
4. Googlenet,
5. Residualnet,
Alexnet The convolutional neural network (CNN) architecture known as AlexNet was created
by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, who served as Krizhevsky’s
PhD advisor
AlexNet was created to be more computationally efficient than earlier CNN topologies.
It has eight layers, which makes it simpler to train and less prone to overfitting on smaller datasets.
Several architectural improvements were introduced by AlexNet, including the use of rectified linear
units (ReLU) as activation functions, overlapping pooling, and dropout regularisation.
AlexNet consists of 8 layers, including 5 convolutional layers and 3 fully connected layers.
It uses traditional stacked convolutional layers with max-pooling in between. Its deep network structure
allows for the extraction of complex features from images.
The architecture employs overlapping pooling layers to reduce spatial dimensions while retaining the
spatial relationships among neighbouring features.
Activation function: AlexNet uses the ReLU activation function and dropout regularization, which
enhance the model’s ability to capture non-linear relationships within the data.
Visualization of Deep Neural Networks
Observations
Decreasing Filter Size: The filter size reduces in each layer, from
larger filters at the beginning to smaller ones deeper in the
architecture, resulting in a smaller feature map shape.
Dropout
A neuron is removed from the neural network during dropout with a probability of 0.5.
A neuron that is dropped does not make any contribution to either forward or backward propagation.
As seen in the graphic below, each input is processed by a separate Neural Network design.
The acquired weight parameters are therefore more reliable and less prone to overfitting.
After this, we have our first dropout layer. The drop-out rate is set to be 0.5.
Then we have the first fully connected layer with a relu activation function.
The size of the output is 4096. Next comes another dropout layer with the dropout rate
fixed at 0.5.
This followed by a second fully connected layer with 4096 neurons and relu activation.
Finally, we have the last fully connected layer or output layer with 1000 neurons as we
have 10000 classes in the data set.
This is the architecture of the Alexnet model. It has a total of 62.3 million learnable
parameters
In the AlexNet architecture, an input image is passed through a convolutional layer and max-
pooling layer twice.
Then, pass it through a series of three convolutional layers followed by a single max-pooling
layer.
After this step, there are three hidden layers followed by the output.
In AlexNet, the overall computation in the final stage would result in a 4096-D vector for every
image that contains the activations of the hidden layer immediately before the classifier.
While most of the layers would utilize the ReLU activation function, the final layer makes use of
the SoftMax activation
summary:
• It has 8 layers with learnable parameters.
• The input to the Model is RGB images.
• It has 5 convolution layers with a combination of max-pooling layers.
• Then it has 3 fully connected layers.
• The activation function used in all layers is Relu.
• It used two Dropout layers.
• The activation function used in the output layer is Softmax.
VGG-16
The "VGG" in VGG-16 stands for "Visual Geometric Group".
It is associated with the Department of Science and Engineering at Oxford University.
In the initial stages, this architecture was used to study the accuracies of large-scale classification.
Now, the most common use for the VGG-16 architecture is mainly for solving tasks such as face
recognition and image classification.
VGG16 is a type of CNN (Convolutional Neural Network) that is considered to be one of the best
computer vision models to date.
The creators of this model evaluated the networks and increased the depth using an architecture
with very small (3 × 3) convolution filters, which showed a significant improvement on the prior-art
configurations. They pushed the depth to 16–19 weight layers making it approx — 138 trainable
parameters.
The 16 in VGG16 refers to 16 layers that have weights.
In VGG16 there are thirteen convolutional layers, five Max Pooling layers, and three Dense layers
which sum up to 21 layers but it has only sixteen weight layers i.e., learnable parameters layer.
VGG16 takes input tensor size as 224, 244 with 3 RGB channel
Most unique thing about VGG16 is that instead of having a large number of hyper-parameters they
focused on having convolution layers of 3x3 filter with stride 1 and always used the same padding
and maxpool layer of 2x2 filter of stride 2.
The convolution and max pool layers are consistently arranged throughout the whole architecture
Conv-1 Layer has 64 number of filters, Conv-2 has 128 filters, Conv-3 has 256 filters, Conv 4 and
Conv 5 has 512 filters.
Three Fully-Connected (FC) layers follow a stack of convolutional layers: the first two have 4096
channels each, the third performs 1000-way ILSVRC (ImageNet Large Scale Visual Recognition
Challenge ) classification and thus contains 1000 channels (one for each class). The final layer is the
soft-max layer.
VGG-16
Define the VGG16 model as sequential model
→ 2 x convolution layer of 64 channel of 3x3 kernal and same padding
→ 1 x maxpool layer of 2x2 pool size and stride 2x2
→ 2 x convolution layer of 128 channel of 3x3 kernal and same padding
→ 1 x maxpool layer of 2x2 pool size and stride 2x2
→ 3 x convolution layer of 256 channel of 3x3 kernal and same padding
→ 1 x maxpool layer of 2x2 pool size and stride 2x2
→ 3 x convolution layer of 512 channel of 3x3 kernal and same padding
→ 1 x maxpool layer of 2x2 pool size and stride 2x2
→ 3 x convolution layer of 512 channel of 3x3 kernal and same padding
→ 1 x maxpool layer of 2x2 pool size and stride 2x2
After creating all the convolution, pass the data to the dense layer so for that we flatten the
vector which comes out of the convolutions and add:
VGG-16
Limitations Of VGG 16:
1. It is very slow to train (the original VGG model was trained on Nvidia Titan GPU for
2–3 weeks).
2. The size of VGG-16 trained imageNet weights is 528 MB. So, it takes quite a lot of
disk space and bandwidth which makes it inefficient.
3. 138 million parameters lead to exploding gradients problem.
VGG16 Vs VGG19
VGG16 is a CNN architecture that won the “2014 ILSVR (ImageNet)” competition as runner. VGG16
network configuration is a 224 × 224-pixel image with three channels (R, G, and B). VGG16 has 16
layers.
It follows this arrangement of 13 convolutional layers, 3 fully connected layers, and max- pooling layers
that reduce volume size and softmax activation function, followed by the last fully connected layer.
Instead of having a large number of hyper-parameters, VGG16 focuses on 3x3 filter convolution layers
with stride 1 and always utilizes the same padding and MaxPool layer of a 2x2 filter with stride 2.
VGG19 architecture is a variant of the VGG model, consisting of 16 convolutional neural networks, 3 FC
layers, 5 MaxPool layers and 1 SoftMax layer.
The fixed-size input image is a 224 by 224 pixel with three channels (R, G, and B) which means that the
matrix is of shape (224,224,3).
The concept of the VGG19 model (also VGGNet-19) is the same as the VGG16 except that it supports 19
layers. The “16” and “19” stand for the number of weight layers in the model (convolutional layers). This
means that VGG19 has three more convolutional layers than VGG16.
Adavantages:
Accuracy: VGG19 has achieved state-of-the-art results on the ImageNet dataset and has been
used as a benchmark model for image classification tasks.
Transfer Learning: VGG19 has a large number of pre-trained models available, making it easy
to use for transfer learning in other computer vision tasks.
Simple architecture: The VGG19 architecture is relatively simple, making it easy to understand
and implement.
Feature extraction: The VGG19 model learns to extract rich features from the images, which can
be useful in other computer vision tasks.
Disadvantages:
Large model size: VGG19 has a large number of parameters, which can make it computationally
expensive to train and use.
Limited to image classification: VGG19 is primarily used for image classification tasks, and
may not perform as well in other computer vision tasks.
Limited interpretability: Due to the complex nature of deep learning models, it can be difficult
to understand how VGG19 arrives at its classifications.
Limited flexibility: VGG19 has a fixed architecture, which may not be suitable for all computer
vision tasks, and may require modifications or customizations.
VGG
GoogLeNet Model – CNN Architecture
GoogleNet
Google Net (or Inception V1) was proposed by research at Google (with the collaboration of various
universities) in 2014 in the research paper titled “Going Deeper with Convolutions”.
This architecture was the winner at the ILSVRC 2014 image classification challenge.
It has provided a significant decrease in error rate as compared to previous winners AlexNet (Winner of
ILSVRC 2012) and ZF-Net (Winner of ILSVRC 2013) and significantly less error rate than VGG
(2014 runner up).
This architecture uses techniques such as 1×1 convolutions in the middle of the architecture and global
average pooling.
The GoogLeNet architecture is very different from previous state-of-the-art architectures such as
AlexNet and ZF-Net
It uses many different kinds of methods such as 1×1 convolution and global average pooling that
enables it to create deeper architecture
Since AlexNet, the state-of-the-art CNN architecture is going deeper and deeper.
While AlexNet had only 5 convolutional layers, the VGG network and GoogleNet (also code named Inception_v1)
had 19 and 22 layers respectively.
After the first CNN-based architecture (AlexNet) that win the ImageNet 2012 competition, every Subsequent winning
architecture uses more layers in a deep neural network to reduce the error rate.
However, increasing network depth does not work by simply stacking layers together.
Deep networks are hard to train because of the notorious vanishing gradient problem
While backpropagating, the chain rule is followed, the derivatives of each layer are multiplied down the network.
When lot of deeper layers are used, and hidden layers like sigmoid, the derivatives are being scaled down below 0.25
for each layer.
So when n number of layers derivatives are multiplied ,the gradient decreases exponentially as we propagate down to
the initial layers.
So, the gradient is back-propagated to earlier layers, repeated multiplication may make the gradient infinitively small.
As a result, as the network goes deeper, its performance gets saturated or even starts degrading rapidly.
One of the main challenges with increasing the depth of CNNs is that it can lead to the problem of
vanishing gradients. This is because as the network becomes deeper, the gradients of the loss function
with respect to the weights of the network become smaller and smaller. This makes it difficult for the
network to learn effectively.
Another challenge with increasing the depth of CNNs is that it can lead to an explosion in
computational requirements. This is because the number of computations required to propagate an
input through a deep network grows exponentially with the depth of the network.
GoogLeNet addressed the challenges of previous CNN architectures by introducing the concept of
inception modules. Inception modules are a type of building block that allows for the parallel processing
of data at multiple scales. This allows the network to capture features at different scales more efficiently
than previous architectures.
An inception module typically consists of several convolutional layers with different filter sizes.
These layers are arranged in parallel, so that the network can process the input data at multiple
resolutions simultaneously.
The output of the convolutional layers is then concatenated and passed through a pooling layer.
However, later there were various versions of the inception module which was integrated
accordingly in the architecture which consisted of different layers and filter size patterns.
The "naive" version of an Inception module, as used in the original GoogLeNet (Inception v1), directly
applies multiple convolution filters (1x1, 3x3, and 5x5) in parallel to the input, aiming to capture
features at different scales.
Inception module with naive version
This module simultaneously performs 1 * 1 convolutions, 3 * 3 convolutions, 5 * 5 convolutions, and 3 *
3 max pooling operations.
Thereafter, it sums up the outputs from all the operations in a single place and builds the next feature. The
architecture does not follow Sequential model approach where every operation such as pooling or
convolution is performed one after the other.
As the inception module extracts a different kind of data or information from every convolution or
pooling operation different features are extracted from each operation.
After the individual operations have been performed simultaneously all the extracted data will be
combined into a single feature map with all the properties. This will in turn increase the accuracy of the
model as it will focus on multiple features simultaneously.
The output dimension of all the extracted feature maps will be different as the kernel size for every
operation will not be the same. These different feature maps generated through different operations are
concatenated together using padding operation, which will make he output dimension of every operation
the same.
Inception module with dimension reduction
The Inception module with dimension reduction works in a similar manner as the naïve one with only
one difference. Here features are extracted on a pixel level using 1 * 1 convolutions before the 3 * 3
convolutions and 5 * 5 convolutions. When the 1 * 1 convolution operation is performed the dimension
of the image is not changed. However, the output achieved offers better accuracy.
Four Parallel Channel Processing
1 * 1 Convolution Operation:
The input feature map can be reduced in dimension and upgraded without too much loss of input separation
information. This operation has no receptive field as it gathers data on a pixel level.
3 * 3 Convolution Operation:
The operation increases the receptive field of the feature map. This allows the kernel to gather information
regarding various shapes and sizes.
5 * 5 Convolution Operation:
The operation further increases the receptive field of the feature map.
3 * 3 Max Pooling:
The pooling layer will lose space information. However, it will be effectively applied on various space fields,
increasing the effectiveness of the four-channel parallel processing.
While implementing various operations simultaneously we might lose certain information or dimensions. But,
it is completely fine as if one convolution operation does give a certain feature than the other operation will.
Disadvantage:
Larger model models using InceptionNet are prone to overfit especially with limited number of label samples.
The model can be biased towards certain classes that have labels present in high volume than the other.
Global Average Pooling :
In the previous architecture such as AlexNet, the fully connected layers are used at the end of the network.
These fully connected layers contain the majority of parameters of many architectures that causes an increase in
computation cost.
In GoogLeNet architecture, there is a method called global average pooling is used at the end of the network.
This layer takes a feature map of 7×7 and averages it to 1×1. This also decreases the number of trainable
parameters to 0 and improves the top-1 accuracy by 0.6%
Auxiliary Classifier for Training:
Inception architecture used some intermediate classifier branches in the middle of the
architecture, these branches are used during training only.
These branches consist of a 5×5 average pooling layer with a stride of 3, a 1×1 convolutions
with 128 filters, two fully connected layers of 1024 outputs and 1000 outputs and a softmax
classification layer.
The generated loss of these layers added to total loss with a weight of 0.3. These layers help in
combating gradient vanishing problem and also provide regularization.
The overall architecture is 22 layers deep. The architecture was designed to keep computational
efficiency in mind.
The idea behind that the architecture can be run on individual devices even with low computational
resources.
The architecture also contains two auxiliary classifier layer connected to the output of Inception (4a)
and Inception (4d) layers.
Architecture Details
(1,1) filter 64 – layer1
(1,1) 64 + (3,3) 64 + (1,1) 256 repeated 3times – 9layers
(1,1) 128 + (3,3) 128 + (1,1) 512 repeated 4times- 4layers
(1,1) 256 + (3,3) 256 + (1,1) 1024 repeated 6times -18layers
(1,1)512 + (3,3)512 +(1,1)2048 repeated 3times – 9layers
Fully connected layer with 1000 nodes
ResNET50
• ResNet, short for Residual Network is a specific type of neural
network that was introduced in 2015 by Kaiming He, Xiangyu Zhang,
Shaoqing Ren and Jian Sun in their paper “Deep Residual Learning
for Image Recognition”
Resnet
ResNet using Keras
An open-source, Python-based neural network framework called Keras may be used with
TensorFlow, Microsoft Cognitive Toolkit, R, Theano, or PlaidML. It is made to make deep neural
network experimentation quick.
The following ResNet implementations are part of Keras Applications and offer ResNet V1 and
ResNet V2 with 50, 101, or 152 layers.
ResNet50
ResNet101
ResNet152
ResNet50V2
ResNet101V2
ResNet152V2
Transfer learning
The reuse of a pre-trained model on a new problem is known as transfer learning in machine
learning.
From a deep learning perspective, the image classification problem can be solved through transfer
learning.
Transfer learning is a popular method in computer vision because it allows us to build accurate
models in a timesaving way.
With transfer learning, instead of starting the learning process from scratch, you start from patterns
that have been learned when solving a different problem.
This way you leverage previous learnings and avoid starting from scratch.
i.e., a machine exploits the knowledge gained from a previous task to improve generalization
about another.
For example, in training a classifier to predict whether an image contains food, you could use the
knowledge it gained during training to recognize drinks.
In computer vision, transfer learning is usually expressed through the use of pre-trained models.
A pre-trained model is a model that was trained on a large benchmark dataset to solve a problem
similar to the one that we want to solve.
Accordingly, due to the computational cost of training such models, it is common practice to
import and use models from published literature (e.g. VGG, Inception, MobileNet).
CNN Vs Transfer learning
• For CNN you need to more preprocessing of the dataset but with
transfer learning you only need to little processing of dataset like
resize to 227 x 227 or 224 x224 according to selected Pre-trained
Models (AlexNet, GoogLeNet, ResNet, VGG Networks etc. and more)
this saves much time of preprocessing data
• Convolutional neural networks
• Several pre-trained models used in transfer learning are based on
large convolutional neural networks (CNN).
• In general, CNN was shown to excel in a wide range of computer
vision tasks.
• Its high performance and its easiness in training are two of the main
factors driving the popularity of CNN over the last years.
architecture of a model based
on CNN
• A typical CNN has two parts:
• Convolutional base, which is composed by a stack of
convolutional and pooling layers. The main goal of the
convolutional base is to generate features from the image.
• Classifier, which is usually composed by fully connected
layers. The main goal of the classifier is to classify the image
based on the detected features. A fully connected layer is a
layer whose neurons have full connections to all activation in
the previous layer.
• One important aspect of these deep learning models is that they can
automatically learn hierarchical feature representations.
• This means that features computed by the first layer are general and can
be reused in different problem domains, while features computed by the
last layer are specific and depend on the chosen dataset and task.
• The convolutional base of our CNN — especially its lower layers (those
who are closer to the inputs) — refer to general features, whereas the
classifier part, and some of the higher layers of the convolutional base,
refer to specialised features
• Repurposing a pre-trained model
• start by removing the original classifier, then you add a new classifier that fits your
purposes, and finally you have to fine-tune your model according to one of three
strategies:
1.Train the entire model. Use the architecture of the pre-trained model and train it
according to your dataset. You’re learning the model from scratch, so you’ll need a
large dataset (and a lot of computational power).
2.Train some layers and leave the others frozen. lower layers refer to general
features (problem independent), while higher layers refer to specific features (problem
dependent). Usually, if you’ve a small dataset and a large number of parameters, you’ll
leave more layers frozen to avoid overfitting. By contrast, if the dataset is large and the
number of parameters is small, you can improve your model by training more layers to
the new task since overfitting is not an issue.
3.Freeze the convolutional base. This case corresponds to an extreme situation of the
train/freeze trade-off. The main idea is to keep the convolutional base in its original
form and then use its outputs to feed the classifier. You’re using the pre-trained model
as a fixed feature extraction mechanism, which can be useful if you’re short on
computational power, your dataset is small, and/or pre-trained model solves a problem
Fine-tuning strategies
Transfer learning process
• The entire transfer learning process can be summarized as follows:
• 1. Select a pre-trained model. From the wide range of pre-trained models that are available, you pick one that
looks suitable for your problem. Ex: If you’re using Keras, you immediately have access to a set of models, such as
VGG, InceptionV3, ResNet5.
• 2. Classify your problem according to the Size-Similarity Matrix.
• controls your choices.
• classifies your CV problem considering the size of dataset and its similarity to the dataset in which our pre-trained
model was trained.
• As a rule of thumb, consider that your dataset is small if it has less than 1000 images per class. Regarding dataset
similarity, let common sense prevail.
• For example, if your task is to identify cats and dogs, ImageNet would be a similar dataset because it has images of
cats and dogs. However, if your task is to identify cancer cells, ImageNet can’t be considered a similar dataset.
3. Fine-tune your model. you can use the Size-Similarity Matrix to guide your choice and then
refer to the three options we mentioned before about repurposing a pre-trained model.
Size-Similarity matrix (left) and decision map for fine-tuning pre-trained
models (right)
• Quadrant 1. Large dataset, but different from the pre-trained model’s dataset.
This situation will lead you to Strategy 1. Since you have a large dataset, you’re
able to train a model from scratch and do whatever you want. Despite the
dataset dissimilarity, in practice, it can still be useful to initialize your model from
a pre-trained model, using its architecture and weights.
• Quadrant 2. Large dataset and similar to the pre-trained model’s dataset. Here
you’re in la-la land. Any option works. Probably, the most efficient option
is Strategy 2. Since we have a large dataset, overfitting shouldn’t be an issue, so
we can learn as much as we want. However, since the datasets are similar, we can
save ourselves from a huge training effort by leveraging previous knowledge.
Therefore, it should be enough to train the classifier and the top layers of the
convolutional base.
• Quadrant 3. Small dataset and different from the pre-trained model’s
dataset. This is the 2–7 off-suit hand of computer vision problems.
Everything is against you. If complaining is not an option, the only hope you
have is Strategy 2. It will be hard to find a balance between the number of
layers to train and freeze. If you go to deep your model can overfit, if you
stay in the shallow end of your model you won’t learn anything useful.
Probably, you’ll need to go deeper than in Quadrant 2 and you’ll need to
consider data augmentation techniques
• Quadrant 4. Small dataset, but similar to the pre-trained model’s dataset.
You just need to remove the last fully-connected layer (output layer), run the
pre-trained model as a fixed feature extractor, and then use the resulting
features to train a new classifier.
Optimization Algorithms
Training a model in machine learning involves finding the best set of values for the parameter vector
of the model.
We can think of machine learning as an optimization problem in which we minimize the loss function
with respect to the parameters of our prediction function (based on our model).
optimization algorithms are divided into two camps:
First-order
Second-order
First-order optimization algorithms calculate the Jacobian matrix.
The Jacobian has one partial derivative per parameter (to calculate partial derivatives, all other
variables are momentarily treated as constants).
The algorithm then takes one step in the direction specified by the Jacobian.
Second-order algorithms calculate the derivative of the Jacobian (i.e., the derivative of a matrix of
derivatives) by approximating the Hessian.
Second order methods take into account interdependencies between parameters when choosing how
much to modify each parameter.
• Layer size
• Layer size is defined by the number of neurons in a given layer.
• Input and output layers are relatively easy to figure out because they correspond directly to how our modeling
problem handles input and output.
• For the input layer, this will match up to the number of features in the input vector.
• For the output layer, this will either be a single output neuron or a number of neurons matching the number of
classes we are trying to predict.
• Deciding on neuron counts for each hidden layer is where hyper parameter tuning becomes a challenge.
• We can use an arbitrary number of neurons to define a layer and there are no rules about how big or small this
number can be.
• However, how complex of a problem we can model is directly correlated to how many neurons are in the
hidden layers of our networks.
• Depending on the deep network architecture, the connection schema between layers can vary.
• However, the weights on the connections are the parameters we must train.
• As we include more parameters in our model, we increase the amount of effort needed to train the
network.
• Large parameter counts can lead to long training times and models that struggle to find
convergence.
Magnitude hyper parameters
• Hyper parameters in the magnitude group involve the gradient, step size, and momentum.
Learning rate
• a hyper-parameter that controls how much we are adjusting the weights of our network with respect the
loss gradient. The lower the value, the slower we travel along the downward slope.
• The learning rate in machine learning is how fast we change the parameter vector as we move through search
space.
• If the learning rate becomes too high,we can move toward our goal faster
(least amount of error for the function being evaluated), but we might also
take a step so large that we shoot right past the best answer to the problem,
as well.
• A convolutional neural network that can robustly classify objects even if its placed
in different orientations is said to have the property called invariance.
• More specifically, a CNN can be invariant to translation, viewpoint, size or
illumination (Or a combination of the above).
data augmentation techniques
used for images
• Position augmentation. Scaling. Cropping. Flipping. Padding. Rotation.
Translation. Affine transformation.
• Color augmentation. Brightness. Contrast. Saturation. Hue.