0% found this document useful (0 votes)
2 views182 pages

Deep Learning UNIT 5

The document provides an overview of deep learning, including its architecture, components, and applications, particularly focusing on Convolutional Neural Networks (CNNs). It contrasts deep learning with traditional machine learning in terms of data requirements, training time, feature extraction, and outcome interpretation. Key concepts such as layers, activation functions, loss functions, and the significance of deep learning in various fields like healthcare and autonomous systems are also discussed.

Uploaded by

aelurigowri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views182 pages

Deep Learning UNIT 5

The document provides an overview of deep learning, including its architecture, components, and applications, particularly focusing on Convolutional Neural Networks (CNNs). It contrasts deep learning with traditional machine learning in terms of data requirements, training time, feature extraction, and outcome interpretation. Key concepts such as layers, activation functions, loss functions, and the significance of deep learning in various fields like healthcare and autonomous systems are also discussed.

Uploaded by

aelurigowri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 182

Unit-V

Deep Learning
 Layers,
 Activation Functions,
 Optimizers,
 Build CNN Models From Scratch,
 Pretrained Models: Alexnet,vgg-16,vgg-19, Gloogle net,
Residualnet,
 Transfer Learning,
 Practical Implementation Of CNN And Transfer Learning.
SIFT(Scale Invariant Feature Transform
Metric Deep Learning Machine Learning
Deep Learning algorithms rely heavily on a
Machine learning requires a large
huge amount of data, thus we must feed a
Data requirement quantity of data, it may also operate with
significant amount of data in order to get decent
lesser amounts of data.
efficiency.
Machine learning process takes shorter
Training and Deep Learning requires a long implementation
time for training the modeling than deep
implementation time for training the models yet a short
learning, but also it takes a considerable
time execution chance to examine the algorithm.
time to verify the hypothesis.
Deep learning is an improved type of machine
ML algorithms require a stage of
Feature learning that will not require the feature
extracting features by an expert before
extraction extraction for each issue; rather, it attempts to
proceeding
learn high-level features from raw data by itself.
When working with a deep learning algorithm,
When working with machine learning,
we can obtain a more accurate outcome for a
we can readily comprehend the results,
Outcome specific issue than when working with a
which means we can understand why
interpretation machine learning model, but we are unable to
this outcome occurred and what the
determine why this specific result evolved and
approach was.
the logic behind it.
Metric Deep Learning Machine Learning
Machine learning algorithms can be used
Deep learning methods are ideal for tackling difficult
to solve simple to somewhat complicated
Applications challenges.
issues.
Eg: self-driving cars
Eg: Sentiment analysis
Because they depend on the levels of the Artificial
The majority of machine learning
Type of data neural network, Deep Learning models can operate
techniques require structured data
both with structured and non – structured data.
A deep learning algorithm differs from a standard The standard ML paradigm divides an
machine learning algorithm in that it accepts data for issue into sub-parts and solves each portion
Strategy
a particular issue and produces the final result. As a separately before producing the final
result, it takes an end-to-end strategy. solution
Hardware To perform successfully, the deep learning model Machine learning algorithms can run on
requirement requires a large quantity of data, which necessitates low-end devices because they do not
s the use of GPUs and thus the high-end system. require a huge volume of input.
To produce outcomes, machine learning
Human Deep learning is more difficult to establish initially,
needs greater continuing human
intervention but it takes less interaction after that.
involvement.
Deep Learning
 The term 'Deep Learning' is coined because the neural networks have various layers that enable learning,
unlearning, and relearning.
 In terms of the technical field, it uses neural networks that are influenced by the human brain, as these
networks are interconnected nodes of layers which helps in processing information.
 Big Data advancements enable larger neural networks. Computers now quickly understand and respond
to complex events, aiding tasks like language translation and image categorization.
 Deep learning solves pattern recognition issues independently, without any human assistance.
 Deep Learning is a branch of Artificial Intelligence (AI) that enables machines to learn from large
amounts of data.
 It uses neural networks with many layers to automatically find patterns and make predictions.
 It is very useful for tasks like image recognition, language translation, and speech processing.
 Deep learning models learn directly from data, without the need for manual feature extraction.
Why Is Deep Learning Crucial?
Deep learning is crucial because it enables machines to learn complex patterns and make decisions with
minimal human intervention. Its applications drive advancements in fields like healthcare, autonomous
systems, and natural language processing.

1. Managing Huge Data


Deep Learning models are able to quickly analyze enormous amounts of data because of the development of
graphics processing units (GPUs).

2. High Accuracy
In computer vision, audio processing, and natural language processing (NLP), deep learning models yield the
most accurate results.

3. Pattern Recognition
While deep learning models are capable of autonomously detecting a wide range of patterns, most models
require the assistance of a machine learning engineer.

4. Representation Learning
Deep learning models are highly proficient in acquiring hierarchical data representations, automatically
deriving relevant features from unprocessed input.
Popular applications of Deep Learning include self-driving cars, chatbots, medical image
analysis, and recommendation systems.
Deep learning is neural networks with a large number of parameters and layers in one of four
fundamental network architectures:
Unsupervised pre-trained networks
Convolutional neural networks
Recurrent neural networks
Recursive neural networks
Applications of Deep learning

 Self Driving Cars or Autonomous Vehicles


 News Accumulation and Fake News Detection
 Natural Language Processing NLP
 Virtual Assistants
 Visual Recognition
 Deep Learning Applications in Healthcare
 Personalization
 Colorization of Black and White Images
 Adding Sounds to Silent Movies
 Automatic Machine Translation
 Automated image sharpening
 Generating fonts
 Image autofill for missing regions
 Automated image captioning
Common Architectural Principles of Deep Networks

Core components
 Parameters
 Layers
 Activation functions
 Loss functions
 Optimization methods
 Hyper parameters
Parameters
 relate to the x parameter vector in the equation Ax = b in basic machine learning.
 Parameters in neural networks relate directly to the weights on the connections in the network
 use methods of optimization such as gradient descent to find good values for the parameter vector
to minimize loss across our training dataset.
 In deep networks, we still have a parameter vector representing the connection in the network
model we’re trying to optimize.
 The biggest change in deep networks with respect to parameters is how the layers are
connected in the different architectures.
Layers
 In deep learning, "layers" refer to the different stages of processing within a neural network, where data
is transformed through a series of computations,.
 Layers also can be represented by subnetworks in certain architectures, as well.
 Layers are a fundamental architectural unit in deep networks.
 Input, hidden, and output layers define feedforward neural networks.
 Input Layer: Receives the raw input data.
 Hidden Layers: Multiple layers between the input and output layer where most of the computation
happens.
 Output Layer: Produces the final prediction based on the processed information from the hidden layers.
Activation Functions
 Activation functions are mathematical functions applied within each layer to introduce non-linearity
and enable the network to learn complex patterns in data by controlling the flow of information
between layers; essentially, activation functions determine which signals should be passed on to the
next layer based on their input values
 Activation functions are used in specific architectures to drive feature extraction.
 The higher-order features learned from the data in deep networks are a nonlinear transform
applied to the output of the previous layer.
 This allows the network to learn patterns in the data within a constrained space.
 We group these design decisions for network architecture into two main
 areas across all architectures:
Hidden layers
Output layers
 Hidden layers are concerned with extracting progressively higher-order features from the
raw data.
 Depending on the architecture we’re working with, we tend to use certain subsets of layer
activation functions.
 Hidden layer activation functions
 Commonly used functions include:
Sigmoid
Tanh
Hard tanh
Rectified linear unit (ReLU) (and its variants)
Output layer for regression

• If we want to output a single real-valued number from our model, use a linear activation function.

Output layer for binary classification


• use a sigmoid output layer with a single neuron to give us a real value in the range of 0.0 to 1.0
(excluding those values) for the single class.
• This real-valued output is typically interpreted as a probability distribution

 Output layer for multiclass classification


use a softmax output layer with an arg‐max() function to get the highest score of all the classes.
The softmax output layer gives us a probability distribution over all the classes.
Loss Functions
 Loss functions quantify the agreement between the predicted output (or label) and the ground truth
output.
 We use loss functions to determine the penalty for an incorrect classification of an input vector.
 loss functions:
 Mean Squared loss-
Explanation: MSE is a common regression loss function. It measures the average of the
squared differences between predicted and actual values.
 Mean Absolute Error (MAE):
• Explanation: MAE is another regression loss function that calculates the average of the
absolute differences between predicted and actual values.
• Root Mean Squared Error (RMSE):
• Explanation: RMSE is the square root of MSE, providing a measure of the average magnitude of
the errors in the same unit as the target variable. It is a popular choice for regression evaluation.
Huber Loss:
Huber loss is a combination of MSE and MAE. It is less sensitive to outliers than MSE,
providing a compromise between the robustness of MAE and the sensitivity of MSE and it is
suitable for regression tasks
Likelihood Loss:
Likelihood loss is often task-specific and depends on the likelihood function chosen for the
probabilistic model. It measures how well the model’s predicted probabilities align with the
observed data.

Binary Cross-Entropy Loss:


Binary Cross-Entropy is a loss function for binary classification problems. It measures the
dissimilarity between predicted probabilities and actual binary labels, encouraging the model to
correctly predict the class.
Hinge loss
Hinge loss is commonly used in support vector machines (SVMs) and binary classification.
It penalizes misclassifications and encourages a margin between decision values, making the
model more robust.
Convolutional Neural Network (CNN)
 A Convolutional Neural Network (CNN) is a deep learning algorithm that can recognize and classify
features in images for computer vision.
 It is a multi-layer neural network designed to analyze visual inputs and perform tasks such as image
classification, segmentation and object detection, which can be useful for autonomous vehicles.
 CNNs can also be used for deep learning applications in healthcare, such as medical imaging.
 There are two main parts to a CNN:

A convolution tool that splits the various features of the image for analysis

A fully connected layer that uses the output of the convolution layer to predict the best
description for the image.
Basic Convolutional Neural Network Architecture
 CNN architecture is inspired by the organization and functionality of the visual cortex (the part
of the brain that processes visual information from the eyes. It's located in the occipital lobe at
the back of the brain) and designed to mimic the connectivity pattern of neurons within the
human brain.

 The neurons within a CNN are split into a three-dimensional structure, with each set of neurons
analyzing a small region or feature of the image.

 In other words, each group of neurons specializes in identifying one part of the image.

 CNNs use the predictions from the layers to produce a final output that presents a vector of
probability scores to represent the likelihood that a specific feature belongs to a certain class.
CNN layers

 A CNN is composed of several kinds of layers:


 Convolutional layer creates a feature map to predict the class probabilities for each feature by
applying a filter that scans the whole image, few pixels at a time.
 Pooling layer (downsampling)━scales down the amount of information the convolutional layer
generated for each feature and maintains the most essential information (the process of the
convolutional and pooling layers usually repeats several times).
 Fully connected input layer—“flattens” the outputs generated by previous layers to turn them into a
single vector that can be used as an input for the next layer.
 Fully connected layer—applies weights over the input generated by the feature analysis to predict an
accurate label.
 Fully connected output layer━ generates the final probabilities to determine a class for the image.
CNN Architecture Overview
 CNNs transform the input data from the input layer through all connected layers into a set of class scores
given by the output layer.
 There are many variations of the CNN architecture, but they are based on the pattern of layers, as
demonstrated in Figure.
Convolutional Neural Network (CNN) is the extended version of artificial neural networks (ANN)
which is predominantly used to extract the feature from the grid-like matrix dataset. For example visual
datasets like images or videos where data patterns play an extensive role.
Layers

 The input layer accepts three-dimensional input generally in the form spatially of the size (width ×
height) of the image and has a depth representing the color channels (generally three for RGB color
channels).
 The hidden layers carry out feature extraction by performing different calculations and manipulations.
 There are multiple hidden layers like the convolution layer, the ReLU layer, and pooling layer, that
perform feature extraction from the image.
 Finally, there’s a fully connected layer that identifies the object in the image.
 3 major groups:
1. Input layer
2. Feature-extraction (learning) layers
3. Classification layers
 The feature-extraction layers have a general repeating pattern of the sequence:
 1. Convolution layer
 We express the Rectified Linear Unit (ReLU) activation function as a layer in the diagram here to
match up to other literature.
 2. Pooling layer
 These layers find a number of features in the images and progressively construct higher-order
features.
 This corresponds directly to the ongoing theme in deep learning by which features are
automatically learned as opposed to traditionally hand engineered.
 classification layers in which we have one or more fully connected layers to take the higher-order features
and produce class probabilities or scores.
 These layers are fully connected to all of the neurons in the previous layer, as their name implies.
 The output of these layers produces typically a two dimensional output of the dimensions.
Input Layers
 Input layers are where we load and store the raw input data of the image for processing in the
network.
 This input data specifies the width, height, and number of channels. Typically, the number of
channels is three, for the RGB values for each pixel.
 The input layer holds the raw input of the image.
 The input layer represents the pixel matrix of the image.
 The input layer allows the CNN to receive and make sense of information.
 The input layer consists of one or more artificial neurons.
 Each neuron receives input from the outside world.
Convolutional layers
 Convolutional layers are considered the core building blocks of CNN architectures.
 As Figure illustrates, convolutional layers transform the input data by using a patch of locally
connecting neurons from the previous layer.
 The layer will compute a dot product between the region of the neurons in the input layer and the
weights to which they are locally connected in the output layer.
 In convolutional neural networks (CNNs), the primary components are convolutional layers.

 These layers typically involve input vectors, such as an image, filters (or feature detectors), and
output vectors, often referred to as feature maps.

 As the input, such as an image, traverses through a convolutional layer, it undergoes abstraction into a
feature map, also known as an activation map. This process involves the convolution operation,
which enables the detection of more complex features within the image.

 Additionally, Rectified linear units (ReLU) commonly serve as activation functions within these
layers to introduce non-linearity into the network.

 Furthermore, CNNs often employ pooling operations to reduce the spatial dimensions of the feature
maps, leading to a more manageable output volume.

 Overall, convolutional layers play a crucial role in extracting meaningful features from the input data,
making them fundamental in tasks such as image classification and natural language processing,
among others, within the realm of machine learning models.

 Feature Map = Input Image x Feature Detector


 Convolutional layers convolve the input and pass the output to the next layer. This is analogous to a
neuron’s response to a single stimulus in the visual cortex. Each convolutional neuron processes data
only for its assigned receptive field.

 A convolution is a grouping function in mathematics. Convolution occurs in CNNs when two matrices
(rectangular arrays of numbers arranged in columns and rows) combine to generate a third matrix.

 In the convolutional layers of a CNN, these convolutions filter input data to extract information.

 Position the kernel’s center element above the source pixel. Then, replace the source pixel with a
weighted sum of itself and its neighboring pixels.

 Parameter sharing and local connectivity are two principles used in CNNs.

 In a feature map, all neurons share weights, which defines parameter sharing. Local connection means
each neuron connects only to a part of the input image, unlike a fully connected neural network where
all neurons connect to every input. This reduces the number of parameters in the system and speeds up
the calculation
Steps in a Convolution Layer
 Initialize Filters:
Randomly initialize a set of filters with learnable parameters.

 Convolve Filters with Input:


Slide the filters across the width and height of the input data, computing the dot product
between the filter and the input sub-region.

 Apply Activation Function:


Apply a non-linear activation function to the convolved output to introduce non-linearity.

 Pooling (Optional):
Often followed by a pooling layer (like max pooling) to reduce the spatial dimensions of the
feature map and retain the most important information.
Key Components of a CNN
 The convolutional neural network is made of four main parts.

 They help the CNNs mimic how the human brain operates to recognize patterns and features in
images:

 Convolutional layers
 Rectified Linear Unit (ReLU for short)
 Pooling layers
 Fully connected layers

 This section dives into the definition of each one of these components through the example of
the following example of classification of a handwritten digit.
Convolutional layers
 The convolution operation, is known as the feature detector of a CNN.
 The input to a convolution can be raw data or a feature map output from another convolution.
 It is often interpreted as a filter in which the kernel filters input data for certain kinds of
information; for example, an edge kernel lets pass through only information from the edge of an
image.
 Convolutional layers have parameters for the layer and additional hyper parameters.
 Gradient descent is used to train the parameters in this layer such that the class scores are
consistent with the labels in the training set.
 Following are the major components of convolutional layers:
Filters
Activation maps
Parameter sharing
Layer-specific hyper parameters
Filters
 The parameters for a convolutional layer configure the layer’s set of filters.
 Filters are a function that has a width and height smaller than the width and height of the input
volume.
 Filters (e.g., convolutions) are applied across the width and height of the input volume in a sliding
window manner, as demonstrated in Figure
 Filters are also applied for every depth of the input volume. We compute the output of the filter by
producing the dot product of the filter and the input region.
ReLU activation functions as layers
 With CNNs, we often see ReLU layers used.
 The ReLU layer will apply an element-wise activation function over the input data thresholding — for example,
max(0,x) — at zero, giving us the same dimension output as the input to the layer.
 Running this function over the input volume will change the pixel values but will not change the spatial dimensions of
the input data in the output.
 ReLU layers do not have parameters nor additional hyperparameters.
• ReLU (Rectified Linear Unit): This is the most commonly used activation function in CNNs. It returns 0 if it
receives any negative input, but for any positive value x, it returns that value back. Hence, it can be written as f(x) =
max(0, x). The function is non-linear, which means the output is not proportional to the input. It helps to alleviate the
vanishing gradient problem.
• Leaky ReLU: Leaky ReLU is a variant of ReLU. Instead of being 0 when x < 0, a leaky ReLU allows a small, non-
zero, constant gradient α (Normally, α=0.01). Hence, the function could be written as f(x)=max(αx,x). It mitigates the
dying ReLU problem which refers to the problem when the ReLU neurons become inactive and only output 0 for any
input.
• ReLUs also prevent the emergence of the so-called “vanishing gradient” problem, which is common when using
sigmoidal functions. This problem refers to the tendency for the gradient of a neuron to approach zero for high values
of the input.
• While sigmoidal functions have derivatives that tend to 0 as they approach positive infinity, ReLU always remains at a
constant 1. This allows backpropagation of the error and learning to continue, even for high values of the input to the
activation function:
How ReLU Helps:

Non-Saturating: ReLU does not saturate for positive inputs, meaning its gradient remains 1, whereas
functions like sigmoid and tanh have small gradients in their saturated regions.

Avoiding Exponential Decay: The constant gradient of 1 for positive inputs ensures that the gradients do
not diminish exponentially as they propagate backward through the network.

Faster Convergence: ReLU's simple and efficient gradient calculation allows for faster training and
convergence of neural networks compared to saturating activation functions.

Sparsity: ReLU creates sparse activations, which can help in learning more informative representations,
making it suitable for many types of problems.

Limitations of ReLU:
"Dead Neurons": If a neuron receives a constant negative input, its activation and gradient become 0,
causing the neuron to become inactive, or "dead," which can be problematic.

Leaky ReLU and Alternatives: To address the "dead neuron" issue, variations like Leaky ReLU (which
allows a small gradient for negative inputs) or other activation functions can be used.
Convolutional layer hyper parameters
 In Convolutional Neural Networks (CNNs), hyperparameters are settings that control the
training process and network structure, like kernel size, stride, pooling size, and the number of
layers, which need to be set before training begins and directly impact performance.
 Following are the hyper parameters that dictate the spatial arrangement and size of the output
volume from a convolutional layer are:
 No. of layers/ depth of the layers
 Filter (or kernel) size (field size)
 Stride
 padding
Number of Layers

The number of layers in a CNN is a critical hyperparameter that determines the depth of the network.

A deeper network can learn more complex features and patterns from the data, but it is also more prone to overfitting.

Therefore, it is important to strike a balance between the number of layers and the complexity of the problem. A good
starting point is to use a small number of layers and gradually increase their depth until the desired performance is
achieved.
Filter Size
The filter size is another important hyperparameter that determines the receptive field of each
convolutional layer.
A larger filter size can capture more information from the input image, but it also increases the number of
parameters in the network.

A smaller filter size can reduce the number of parameters, but it may not be able to capture all the
relevant features in the image.

Therefore, it is important to experiment with different filter sizes and choose the one that gives the best
performance.
Stride

 The stride is a hyperparameter that determines the number of pixels by which the filter moves across
the input image.

 A larger stride can reduce the size of the output feature maps, but it can also lead to information loss.

 A smaller stride can preserve more information, but it also increases the computation time and
memory requirements.

 Therefore, it is important to choose an appropriate stride that balances the trade-off between
information loss and computational efficiency. Default stride in CNN is 1
Pooling Layers
• Pooling layers are commonly inserted between successive convolutional layers.
• We want to follow convolutional layers with pooling layers to progressively reduce the spatial size
(width and height) of the data representation.
• Pooling layers reduce the data representation progressively over the network and help control overfitting.
• The pooling layer operates independently on every depth slice of the input .

• Pooling layer is used in CNNs to reduce the spatial dimensions (width and
height) of the input feature maps while retaining the most important
information. It involves sliding a two-dimensional filter over each channel
of a feature map and summarizing the features within the region covered
by the filter.
Max pooling: As the filter moves across the input, it selects the pixel with the maximum value to
send to the output array. As an aside, this approach tends to be used more often compared to
average pooling.

Average pooling: As the filter moves across the input, it calculates the average value within the
receptive field to send to the output array.
Padding

 Padding is a technique used to preserve the spatial dimensions of the input image while applying
convolutional layers.

 It involves adding zeros around the border of the input image to create a padded image that can be
convolved with the filter.

 Padding can help preserve the information at the edges of the image and prevent the loss of spatial
resolution.

 However, it also increases the memory requirements and computation time of the network.

 Therefore, it is important to experiment with different padding techniques and choose the one that
gives the best performance.
Learning Rate

The learning rate is a hyperparameter that determines the step size at which the network updates its
parameters during training.

A large learning rate can lead to rapid convergence but may result in unstable and oscillating training.

A small learning rate can ensure stable and smooth training but may result in slower convergence.

Therefore, it is important to experiment with different learning rates and choose the one that gives the
best trade-off between training speed and stability.
Fully Connected Layers
• We use this layer to compute class scores that we’ll use as output of the network (e.g., the output layer at
the end of the network).
• The dimensions of the output volume is [ 1 × 1 × N ], where N is the number of output classes we’re
evaluating.
• Fully connected layers have the normal parameters for the layer and hyperparameters.
• Fully connected layers perform transformations on the input data volume that are a function of the
activations in the input volume and the parameters (weights and biases of the neurons).
Batch Size
The batch size is a hyperparameter that determines the number of samples that are processed by the
network in each training iteration. A larger batch size can reduce the variance of the gradient
estimates and improve the stability of the training. However, it also increases the memory
requirements and may lead to slower convergence. A smaller batch size can reduce the memory
requirements and improve the convergence speed but may lead to noisy gradient estimates.
Therefore, it is important to experiment with different batch sizes and choose the one that gives the
best trade-off between stability and speed.
Other Applications of CNNs
 Beyond normal two-dimensional image data, we also see CNNs applied to three dimensional datasets.
Here are some examples of these alternative uses:
 MRI data
 3D shape data
 Graph data
 NLP applications
 The position-invariant nature of CNNs has proven useful in these domains because we’re not limited
to hand-coding our features to appear in certain “spots” in the feature vector.
Convolution Layer

This is the opening layer in the architecture of CNN used to get characteristics from the image.
Convolution of an image with various filters used to perform operations like sharpening, edge
detection.
Pooling Layer
The use of this layer is to train the algorithm to take a few numbers of parameters of an image
when it is too large. They are three types. One is Maximum pooling, it considers the largest
element from a feature map. Another one is Average pooling, it considers the average of all
the elements from a feature map and another is Sum pooling, it considers the sum of all
elements of a feature map.
ReLU Layer
ReLU denoted as Rectified Linear Unit. It is an activation function used for the outputs of CNN neurons.
The mathematical formula for Rectified Linear Unit is defined as y = max (0, x).
Fully Connected Layer

Fully connected (FC) layer is a feature vector for input. The operation of the FC layer is to smoothing of
high-level characteristics which are learnable by Convolutional layers and merging features. It passes
flattened output which is generated to the output layer.
Pretrained models:

1. Alexnet
2. VGG-16
3. VGG-19
4. Googlenet,
5. Residualnet,
Alexnet The convolutional neural network (CNN) architecture known as AlexNet was created
by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, who served as Krizhevsky’s
PhD advisor
 AlexNet was created to be more computationally efficient than earlier CNN topologies.

 It introduced parallel computing by utilizing two GPUs during training.

 It has eight layers, which makes it simpler to train and less prone to overfitting on smaller datasets.

 Several architectural improvements were introduced by AlexNet, including the use of rectified linear
units (ReLU) as activation functions, overlapping pooling, and dropout regularisation.

 AlexNet consists of 8 layers, including 5 convolutional layers and 3 fully connected layers.

 It uses traditional stacked convolutional layers with max-pooling in between. Its deep network structure
allows for the extraction of complex features from images.

 The architecture employs overlapping pooling layers to reduce spatial dimensions while retaining the
spatial relationships among neighbouring features.

 Activation function: AlexNet uses the ReLU activation function and dropout regularization, which
enhance the model’s ability to capture non-linear relationships within the data.
Visualization of Deep Neural Networks
Observations

Increasing Filters: The number of filters increases as we go deeper,


allowing for more complex feature extraction.

Decreasing Filter Size: The filter size reduces in each layer, from
larger filters at the beginning to smaller ones deeper in the
architecture, resulting in a smaller feature map shape.
Dropout
 A neuron is removed from the neural network during dropout with a probability of 0.5.
 A neuron that is dropped does not make any contribution to either forward or backward propagation.
 As seen in the graphic below, each input is processed by a separate Neural Network design.
 The acquired weight parameters are therefore more reliable and less prone to overfitting.
 After this, we have our first dropout layer. The drop-out rate is set to be 0.5.

 Then we have the first fully connected layer with a relu activation function.

 The size of the output is 4096. Next comes another dropout layer with the dropout rate
fixed at 0.5.

 This followed by a second fully connected layer with 4096 neurons and relu activation.

 Finally, we have the last fully connected layer or output layer with 1000 neurons as we
have 10000 classes in the data set.

 The activation function used at this layer is Softmax.

 This is the architecture of the Alexnet model. It has a total of 62.3 million learnable
parameters
 In the AlexNet architecture, an input image is passed through a convolutional layer and max-
pooling layer twice.
 Then, pass it through a series of three convolutional layers followed by a single max-pooling
layer.
 After this step, there are three hidden layers followed by the output.
 In AlexNet, the overall computation in the final stage would result in a 4096-D vector for every
image that contains the activations of the hidden layer immediately before the classifier.
 While most of the layers would utilize the ReLU activation function, the final layer makes use of
the SoftMax activation
summary:
• It has 8 layers with learnable parameters.
• The input to the Model is RGB images.
• It has 5 convolution layers with a combination of max-pooling layers.
• Then it has 3 fully connected layers.
• The activation function used in all layers is Relu.
• It used two Dropout layers.
• The activation function used in the output layer is Softmax.
VGG-16
 The "VGG" in VGG-16 stands for "Visual Geometric Group".
 It is associated with the Department of Science and Engineering at Oxford University.
 In the initial stages, this architecture was used to study the accuracies of large-scale classification.
 Now, the most common use for the VGG-16 architecture is mainly for solving tasks such as face
recognition and image classification.
 VGG16 is a type of CNN (Convolutional Neural Network) that is considered to be one of the best
computer vision models to date.
 The creators of this model evaluated the networks and increased the depth using an architecture
with very small (3 × 3) convolution filters, which showed a significant improvement on the prior-art
configurations. They pushed the depth to 16–19 weight layers making it approx — 138 trainable
parameters.
 The 16 in VGG16 refers to 16 layers that have weights.
 In VGG16 there are thirteen convolutional layers, five Max Pooling layers, and three Dense layers
which sum up to 21 layers but it has only sixteen weight layers i.e., learnable parameters layer.
 VGG16 takes input tensor size as 224, 244 with 3 RGB channel
 Most unique thing about VGG16 is that instead of having a large number of hyper-parameters they
focused on having convolution layers of 3x3 filter with stride 1 and always used the same padding
and maxpool layer of 2x2 filter of stride 2.
 The convolution and max pool layers are consistently arranged throughout the whole architecture
 Conv-1 Layer has 64 number of filters, Conv-2 has 128 filters, Conv-3 has 256 filters, Conv 4 and
Conv 5 has 512 filters.
 Three Fully-Connected (FC) layers follow a stack of convolutional layers: the first two have 4096
channels each, the third performs 1000-way ILSVRC (ImageNet Large Scale Visual Recognition
Challenge ) classification and thus contains 1000 channels (one for each class). The final layer is the
soft-max layer.
VGG-16
Define the VGG16 model as sequential model
→ 2 x convolution layer of 64 channel of 3x3 kernal and same padding
→ 1 x maxpool layer of 2x2 pool size and stride 2x2
→ 2 x convolution layer of 128 channel of 3x3 kernal and same padding
→ 1 x maxpool layer of 2x2 pool size and stride 2x2
→ 3 x convolution layer of 256 channel of 3x3 kernal and same padding
→ 1 x maxpool layer of 2x2 pool size and stride 2x2
→ 3 x convolution layer of 512 channel of 3x3 kernal and same padding
→ 1 x maxpool layer of 2x2 pool size and stride 2x2
→ 3 x convolution layer of 512 channel of 3x3 kernal and same padding
→ 1 x maxpool layer of 2x2 pool size and stride 2x2
After creating all the convolution, pass the data to the dense layer so for that we flatten the
vector which comes out of the convolutions and add:

→ 1 x Dense layer of 256 units


→ 1 x Dense layer of 128 units
→ 1 x Dense Softmax layer of 2 units
 Input Layer: Input dimensions: (224, 224, 3)
 Convolutional Layers (64 filters, 3×3 filters, same padding):
Two consecutive convolutional layers with 64 filters each and a filter size of 3×3.
Same padding is applied to maintain spatial dimensions.
 Max Pooling Layer (2×2, stride 2):
Max-pooling layer with a pool size of 2×2 and a stride of 2.
 Convolutional Layers (128 filters, 3×3 filters, same padding):
Two consecutive convolutional layers with 128 filters each and a filter size of 3×3.
 Max Pooling Layer (2×2, stride 2):
Max-pooling layer with a pool size of 2×2 and a stride of 2.
 Convolutional Layers (256 filters, 3×3 filters, same padding):
Two consecutive convolutional layers with 256 filters each and a filter size of 3×3.
 Convolutional Layers (512 filters, 3×3 filters, same padding):
Two sets of three consecutive convolutional layers with 512 filters each and a filter size of 3×3.
 Max Pooling Layer (2×2, stride 2):
Max-pooling layer with a pool size of 2×2 and a stride of 2.
 Stack of Convolutional Layers and Max Pooling:
Two additional convolutional layers after the previous stack.
 Filter size: 3×3.
 Flattening:
Flatten the output feature map (7x7x512) into a vector of size 25088.
 Fully Connected Layers:
 Three fully connected layers with ReLU activation.
 First layer with input size 25088 and output size 4096.
 Second layer with input size 4096 and output size 4096.
 Third layer with input size 4096 and output size 1000, corresponding to the 1000 classes
in the ILSVRC challenge.
 Softmax activation is applied to the output of the third fully connected layer for
classification.
This architecture follows the specifications provided, including the use of ReLU activation
function and the final fully connected layer outputting probabilities for 1000 classes using softmax
activation.
VGG-16

 Convolution layer_1 has 64 number of filters


 Convolution layer_9 has 512 number of filters
 Convolution layer_2 has 64 number of filters
 Convolution layer_10 has 512 number of filters
 Convolution layer 3 has 128 number of filters  Convolution layer_11 has 512 number of filters
 Convolution layer 4 has 128 number of filters  Convolution layer_12 has 512 number of filters
 Convolution layer_ 5 has 256 number of filters  Convolution layer_13 has 512 number of filters
  5 Max pooling layer with pool size(2,2)
Convolution layer 6 has 256 number of filters
 3 Fully connected layers
 Convolution layer_7 has 256 number of filters

 Convolution layer_8 has 512 number of filters

VGG-16
Limitations Of VGG 16:

1. It is very slow to train (the original VGG model was trained on Nvidia Titan GPU for
2–3 weeks).
2. The size of VGG-16 trained imageNet weights is 528 MB. So, it takes quite a lot of
disk space and bandwidth which makes it inefficient.
3. 138 million parameters lead to exploding gradients problem.
VGG16 Vs VGG19
 VGG16 is a CNN architecture that won the “2014 ILSVR (ImageNet)” competition as runner. VGG16
network configuration is a 224 × 224-pixel image with three channels (R, G, and B). VGG16 has 16
layers.
 It follows this arrangement of 13 convolutional layers, 3 fully connected layers, and max- pooling layers
that reduce volume size and softmax activation function, followed by the last fully connected layer.
 Instead of having a large number of hyper-parameters, VGG16 focuses on 3x3 filter convolution layers
with stride 1 and always utilizes the same padding and MaxPool layer of a 2x2 filter with stride 2.
 VGG19 architecture is a variant of the VGG model, consisting of 16 convolutional neural networks, 3 FC
layers, 5 MaxPool layers and 1 SoftMax layer.
 The fixed-size input image is a 224 by 224 pixel with three channels (R, G, and B) which means that the
matrix is of shape (224,224,3).
 The concept of the VGG19 model (also VGGNet-19) is the same as the VGG16 except that it supports 19
layers. The “16” and “19” stand for the number of weight layers in the model (convolutional layers). This
means that VGG19 has three more convolutional layers than VGG16.
Adavantages:
 Accuracy: VGG19 has achieved state-of-the-art results on the ImageNet dataset and has been
used as a benchmark model for image classification tasks.
 Transfer Learning: VGG19 has a large number of pre-trained models available, making it easy
to use for transfer learning in other computer vision tasks.
 Simple architecture: The VGG19 architecture is relatively simple, making it easy to understand
and implement.
 Feature extraction: The VGG19 model learns to extract rich features from the images, which can
be useful in other computer vision tasks.
Disadvantages:
 Large model size: VGG19 has a large number of parameters, which can make it computationally
expensive to train and use.
 Limited to image classification: VGG19 is primarily used for image classification tasks, and
may not perform as well in other computer vision tasks.
 Limited interpretability: Due to the complex nature of deep learning models, it can be difficult
to understand how VGG19 arrives at its classifications.
 Limited flexibility: VGG19 has a fixed architecture, which may not be suitable for all computer
vision tasks, and may require modifications or customizations.
VGG
GoogLeNet Model – CNN Architecture

GoogleNet
 Google Net (or Inception V1) was proposed by research at Google (with the collaboration of various
universities) in 2014 in the research paper titled “Going Deeper with Convolutions”.

 This architecture was the winner at the ILSVRC 2014 image classification challenge.

 It has provided a significant decrease in error rate as compared to previous winners AlexNet (Winner of
ILSVRC 2012) and ZF-Net (Winner of ILSVRC 2013) and significantly less error rate than VGG
(2014 runner up).

 This architecture uses techniques such as 1×1 convolutions in the middle of the architecture and global
average pooling.

 The GoogLeNet architecture is very different from previous state-of-the-art architectures such as
AlexNet and ZF-Net

 It uses many different kinds of methods such as 1×1 convolution and global average pooling that
enables it to create deeper architecture
 Since AlexNet, the state-of-the-art CNN architecture is going deeper and deeper.
 While AlexNet had only 5 convolutional layers, the VGG network and GoogleNet (also code named Inception_v1)
had 19 and 22 layers respectively.
 After the first CNN-based architecture (AlexNet) that win the ImageNet 2012 competition, every Subsequent winning
architecture uses more layers in a deep neural network to reduce the error rate.
 However, increasing network depth does not work by simply stacking layers together.
 Deep networks are hard to train because of the notorious vanishing gradient problem
 While backpropagating, the chain rule is followed, the derivatives of each layer are multiplied down the network.
 When lot of deeper layers are used, and hidden layers like sigmoid, the derivatives are being scaled down below 0.25
for each layer.
 So when n number of layers derivatives are multiplied ,the gradient decreases exponentially as we propagate down to
the initial layers.
 So, the gradient is back-propagated to earlier layers, repeated multiplication may make the gradient infinitively small.
 As a result, as the network goes deeper, its performance gets saturated or even starts degrading rapidly.
One of the main challenges with increasing the depth of CNNs is that it can lead to the problem of
vanishing gradients. This is because as the network becomes deeper, the gradients of the loss function
with respect to the weights of the network become smaller and smaller. This makes it difficult for the
network to learn effectively.

Another challenge with increasing the depth of CNNs is that it can lead to an explosion in
computational requirements. This is because the number of computations required to propagate an
input through a deep network grows exponentially with the depth of the network.

Bringing Inception Modules into the picture

GoogLeNet addressed the challenges of previous CNN architectures by introducing the concept of
inception modules. Inception modules are a type of building block that allows for the parallel processing
of data at multiple scales. This allows the network to capture features at different scales more efficiently
than previous architectures.
 An inception module typically consists of several convolutional layers with different filter sizes.

 These layers are arranged in parallel, so that the network can process the input data at multiple
resolutions simultaneously.

 The output of the convolutional layers is then concatenated and passed through a pooling layer.

 However, later there were various versions of the inception module which was integrated
accordingly in the architecture which consisted of different layers and filter size patterns.
The "naive" version of an Inception module, as used in the original GoogLeNet (Inception v1), directly
applies multiple convolution filters (1x1, 3x3, and 5x5) in parallel to the input, aiming to capture
features at different scales.
Inception module with naive version
 This module simultaneously performs 1 * 1 convolutions, 3 * 3 convolutions, 5 * 5 convolutions, and 3 *
3 max pooling operations.

 Thereafter, it sums up the outputs from all the operations in a single place and builds the next feature. The
architecture does not follow Sequential model approach where every operation such as pooling or
convolution is performed one after the other.

 As the inception module extracts a different kind of data or information from every convolution or
pooling operation different features are extracted from each operation.

 For instance, 1 * 1 convolutions and 3 * 3 convolutions will generate different information.

 After the individual operations have been performed simultaneously all the extracted data will be
combined into a single feature map with all the properties. This will in turn increase the accuracy of the
model as it will focus on multiple features simultaneously.

 The output dimension of all the extracted feature maps will be different as the kernel size for every
operation will not be the same. These different feature maps generated through different operations are
concatenated together using padding operation, which will make he output dimension of every operation
the same.
Inception module with dimension reduction
The Inception module with dimension reduction works in a similar manner as the naïve one with only
one difference. Here features are extracted on a pixel level using 1 * 1 convolutions before the 3 * 3
convolutions and 5 * 5 convolutions. When the 1 * 1 convolution operation is performed the dimension
of the image is not changed. However, the output achieved offers better accuracy.
Four Parallel Channel Processing

1 * 1 Convolution Operation:
The input feature map can be reduced in dimension and upgraded without too much loss of input separation
information. This operation has no receptive field as it gathers data on a pixel level.
3 * 3 Convolution Operation:
The operation increases the receptive field of the feature map. This allows the kernel to gather information
regarding various shapes and sizes.
5 * 5 Convolution Operation:
The operation further increases the receptive field of the feature map.
3 * 3 Max Pooling:
The pooling layer will lose space information. However, it will be effectively applied on various space fields,
increasing the effectiveness of the four-channel parallel processing.
While implementing various operations simultaneously we might lose certain information or dimensions. But,
it is completely fine as if one convolution operation does give a certain feature than the other operation will.
Disadvantage:
Larger model models using InceptionNet are prone to overfit especially with limited number of label samples.
The model can be biased towards certain classes that have labels present in high volume than the other.
Global Average Pooling :

In the previous architecture such as AlexNet, the fully connected layers are used at the end of the network.

These fully connected layers contain the majority of parameters of many architectures that causes an increase in
computation cost.

In GoogLeNet architecture, there is a method called global average pooling is used at the end of the network.

This layer takes a feature map of 7×7 and averages it to 1×1. This also decreases the number of trainable
parameters to 0 and improves the top-1 accuracy by 0.6%
Auxiliary Classifier for Training:

Inception architecture used some intermediate classifier branches in the middle of the
architecture, these branches are used during training only.

These branches consist of a 5×5 average pooling layer with a stride of 3, a 1×1 convolutions
with 128 filters, two fully connected layers of 1024 outputs and 1000 outputs and a softmax
classification layer.

The generated loss of these layers added to total loss with a weight of 0.3. These layers help in
combating gradient vanishing problem and also provide regularization.
 The overall architecture is 22 layers deep. The architecture was designed to keep computational
efficiency in mind.

 The idea behind that the architecture can be run on individual devices even with low computational
resources.

 The architecture also contains two auxiliary classifier layer connected to the output of Inception (4a)
and Inception (4d) layers.

 The architectural details of auxiliary classifiers as follows:

 An average pooling layer of filter size 5×5 and stride 3.


 A 1×1 convolution with 128 filters for dimension reduction and ReLU activation.
 A fully connected layer with 1025 outputs and ReLU activation
 Dropout Regularization with dropout ratio = 0.7
 A softmax classifier with 1000 classes output similar to the main softmax classifier.
RESNET (Residual Network: )
 Every consecutive winning architecture uses more layers in a deep neural network to lower the error
rate after the first CNN-based architecture (AlexNet) that won the ImageNet 2012 competition. This
is effective for smaller numbers of layers, but when we add more layers, a typical deep learning issue
known as the Vanishing/Exploding gradient arises. This results in the gradient becoming zero or
being overly large. Therefore, the training and test error rate similarly increases as the number of
layers is increased.
 Increasing the number of layers, there is a common problem in deep learning associated with that
called the Vanishing/Exploding gradient.
 This causes the gradient to become 0 or too large. Thus when we increases number of layers, the
training and test error rate also increases.
 Residual Network: In order to solve the problem of vanishing/exploding gradient, a novel architecture
called Residual Network was launched by Microsoft Research experts in 2015 with the proposal of ResNet
introduced the concept called Residual Blocks.
 In this network, a technique called skip connections is used which are the major part of ResNet.
 The skip connection connects activations of a layer to further layers by skipping some layers in between.
This forms a residual block.
 Resnets are made by stacking these residual blocks together.
 ResNet makes it possible to train up to hundreds or even thousands of layers and still achieves compelling
performance.
 Taking advantage of its powerful representational ability, the performance of many computer vision
applications other than image classification have been boosted, such as object detection and face
recognition.
 The idea is to connect the input of a layer directly
to the output of a layer after skipping a few
connections.
 x is the input to the layer which we are directly
using to connect to a layer after skipping the
identity connections and if we think the output from
identity connection to be F(x). Then we can say the
output will be F(x) + x.
 The advantage of adding this type of skip connection
is that if any layer hurt the performance of
architecture then it will be skipped by regularization.

 So, this results in training a very deep neural network


without the problems caused by vanishing/exploding
gradient Residual Learning – Building block
 Network Architecture: This network uses a 34-layer plain network architecture inspired by
VGG-19 in which then the shortcut connection is added.
 These shortcut connections then convert the architecture into a residual network.
 The benefit of including this kind of skip link is that regularisation will skip any layer that
degrades architecture performance. As a result, training an extremely deep neural network is
possible without encountering issues with vanishing or expanding gradients
ResNET50

Architecture Details
 (1,1) filter 64 – layer1
 (1,1) 64 + (3,3) 64 + (1,1) 256 repeated 3times – 9layers
 (1,1) 128 + (3,3) 128 + (1,1) 512 repeated 4times- 4layers
 (1,1) 256 + (3,3) 256 + (1,1) 1024 repeated 6times -18layers
 (1,1)512 + (3,3)512 +(1,1)2048 repeated 3times – 9layers
 Fully connected layer with 1000 nodes

ResNET50
• ResNet, short for Residual Network is a specific type of neural
network that was introduced in 2015 by Kaiming He, Xiangyu Zhang,
Shaoqing Ren and Jian Sun in their paper “Deep Residual Learning
for Image Recognition”
Resnet
ResNet using Keras

An open-source, Python-based neural network framework called Keras may be used with
TensorFlow, Microsoft Cognitive Toolkit, R, Theano, or PlaidML. It is made to make deep neural
network experimentation quick.

The following ResNet implementations are part of Keras Applications and offer ResNet V1 and
ResNet V2 with 50, 101, or 152 layers.

ResNet50
ResNet101
ResNet152
ResNet50V2
ResNet101V2
ResNet152V2
Transfer learning
 The reuse of a pre-trained model on a new problem is known as transfer learning in machine
learning.
 From a deep learning perspective, the image classification problem can be solved through transfer
learning.
 Transfer learning is a popular method in computer vision because it allows us to build accurate
models in a timesaving way.
 With transfer learning, instead of starting the learning process from scratch, you start from patterns
that have been learned when solving a different problem.
 This way you leverage previous learnings and avoid starting from scratch.
 i.e., a machine exploits the knowledge gained from a previous task to improve generalization
about another.
 For example, in training a classifier to predict whether an image contains food, you could use the
knowledge it gained during training to recognize drinks.
 In computer vision, transfer learning is usually expressed through the use of pre-trained models.
 A pre-trained model is a model that was trained on a large benchmark dataset to solve a problem
similar to the one that we want to solve.
 Accordingly, due to the computational cost of training such models, it is common practice to
import and use models from published literature (e.g. VGG, Inception, MobileNet).
CNN Vs Transfer learning
• For CNN you need to more preprocessing of the dataset but with
transfer learning you only need to little processing of dataset like
resize to 227 x 227 or 224 x224 according to selected Pre-trained
Models (AlexNet, GoogLeNet, ResNet, VGG Networks etc. and more)
this saves much time of preprocessing data
• Convolutional neural networks
• Several pre-trained models used in transfer learning are based on
large convolutional neural networks (CNN).
• In general, CNN was shown to excel in a wide range of computer
vision tasks.
• Its high performance and its easiness in training are two of the main
factors driving the popularity of CNN over the last years.
architecture of a model based
on CNN
• A typical CNN has two parts:
• Convolutional base, which is composed by a stack of
convolutional and pooling layers. The main goal of the
convolutional base is to generate features from the image.
• Classifier, which is usually composed by fully connected
layers. The main goal of the classifier is to classify the image
based on the detected features. A fully connected layer is a
layer whose neurons have full connections to all activation in
the previous layer.
• One important aspect of these deep learning models is that they can
automatically learn hierarchical feature representations.
• This means that features computed by the first layer are general and can
be reused in different problem domains, while features computed by the
last layer are specific and depend on the chosen dataset and task.
• The convolutional base of our CNN — especially its lower layers (those
who are closer to the inputs) — refer to general features, whereas the
classifier part, and some of the higher layers of the convolutional base,
refer to specialised features
• Repurposing a pre-trained model
• start by removing the original classifier, then you add a new classifier that fits your
purposes, and finally you have to fine-tune your model according to one of three
strategies:

1.Train the entire model. Use the architecture of the pre-trained model and train it
according to your dataset. You’re learning the model from scratch, so you’ll need a
large dataset (and a lot of computational power).

2.Train some layers and leave the others frozen. lower layers refer to general
features (problem independent), while higher layers refer to specific features (problem
dependent). Usually, if you’ve a small dataset and a large number of parameters, you’ll
leave more layers frozen to avoid overfitting. By contrast, if the dataset is large and the
number of parameters is small, you can improve your model by training more layers to
the new task since overfitting is not an issue.

3.Freeze the convolutional base. This case corresponds to an extreme situation of the
train/freeze trade-off. The main idea is to keep the convolutional base in its original
form and then use its outputs to feed the classifier. You’re using the pre-trained model
as a fixed feature extraction mechanism, which can be useful if you’re short on
computational power, your dataset is small, and/or pre-trained model solves a problem
Fine-tuning strategies
Transfer learning process
• The entire transfer learning process can be summarized as follows:
• 1. Select a pre-trained model. From the wide range of pre-trained models that are available, you pick one that
looks suitable for your problem. Ex: If you’re using Keras, you immediately have access to a set of models, such as
VGG, InceptionV3, ResNet5.
• 2. Classify your problem according to the Size-Similarity Matrix.
• controls your choices.
• classifies your CV problem considering the size of dataset and its similarity to the dataset in which our pre-trained
model was trained.
• As a rule of thumb, consider that your dataset is small if it has less than 1000 images per class. Regarding dataset
similarity, let common sense prevail.
• For example, if your task is to identify cats and dogs, ImageNet would be a similar dataset because it has images of
cats and dogs. However, if your task is to identify cancer cells, ImageNet can’t be considered a similar dataset.
3. Fine-tune your model. you can use the Size-Similarity Matrix to guide your choice and then
refer to the three options we mentioned before about repurposing a pre-trained model.
Size-Similarity matrix (left) and decision map for fine-tuning pre-trained
models (right)
• Quadrant 1. Large dataset, but different from the pre-trained model’s dataset.
This situation will lead you to Strategy 1. Since you have a large dataset, you’re
able to train a model from scratch and do whatever you want. Despite the
dataset dissimilarity, in practice, it can still be useful to initialize your model from
a pre-trained model, using its architecture and weights.
• Quadrant 2. Large dataset and similar to the pre-trained model’s dataset. Here
you’re in la-la land. Any option works. Probably, the most efficient option
is Strategy 2. Since we have a large dataset, overfitting shouldn’t be an issue, so
we can learn as much as we want. However, since the datasets are similar, we can
save ourselves from a huge training effort by leveraging previous knowledge.
Therefore, it should be enough to train the classifier and the top layers of the
convolutional base.
• Quadrant 3. Small dataset and different from the pre-trained model’s
dataset. This is the 2–7 off-suit hand of computer vision problems.
Everything is against you. If complaining is not an option, the only hope you
have is Strategy 2. It will be hard to find a balance between the number of
layers to train and freeze. If you go to deep your model can overfit, if you
stay in the shallow end of your model you won’t learn anything useful.
Probably, you’ll need to go deeper than in Quadrant 2 and you’ll need to
consider data augmentation techniques
• Quadrant 4. Small dataset, but similar to the pre-trained model’s dataset.
You just need to remove the last fully-connected layer (output layer), run the
pre-trained model as a fixed feature extractor, and then use the resulting
features to train a new classifier.
Optimization Algorithms
 Training a model in machine learning involves finding the best set of values for the parameter vector
of the model.
 We can think of machine learning as an optimization problem in which we minimize the loss function
with respect to the parameters of our prediction function (based on our model).
 optimization algorithms are divided into two camps:
 First-order
 Second-order
 First-order optimization algorithms calculate the Jacobian matrix.
 The Jacobian has one partial derivative per parameter (to calculate partial derivatives, all other
variables are momentarily treated as constants).
 The algorithm then takes one step in the direction specified by the Jacobian.
 Second-order algorithms calculate the derivative of the Jacobian (i.e., the derivative of a matrix of
derivatives) by approximating the Hessian.
 Second order methods take into account interdependencies between parameters when choosing how
much to modify each parameter.
• Layer size
• Layer size is defined by the number of neurons in a given layer.
• Input and output layers are relatively easy to figure out because they correspond directly to how our modeling
problem handles input and output.
• For the input layer, this will match up to the number of features in the input vector.
• For the output layer, this will either be a single output neuron or a number of neurons matching the number of
classes we are trying to predict.
• Deciding on neuron counts for each hidden layer is where hyper parameter tuning becomes a challenge.
• We can use an arbitrary number of neurons to define a layer and there are no rules about how big or small this
number can be.
• However, how complex of a problem we can model is directly correlated to how many neurons are in the
hidden layers of our networks.
• Depending on the deep network architecture, the connection schema between layers can vary.
• However, the weights on the connections are the parameters we must train.
• As we include more parameters in our model, we increase the amount of effort needed to train the
network.
• Large parameter counts can lead to long training times and models that struggle to find
convergence.
Magnitude hyper parameters
• Hyper parameters in the magnitude group involve the gradient, step size, and momentum.
Learning rate
• a hyper-parameter that controls how much we are adjusting the weights of our network with respect the
loss gradient. The lower the value, the slower we travel along the downward slope.
• The learning rate in machine learning is how fast we change the parameter vector as we move through search
space.
• If the learning rate becomes too high,we can move toward our goal faster
(least amount of error for the function being evaluated), but we might also
take a step so large that we shoot right past the best answer to the problem,
as well.

try a few different values and see which


one gives you the best loss without
sacrificing speed of training. We might
start with a large value like 0.1, then
try exponentially lower values: 0.01, 0.001, etc.
• If we make our learning rate too small, it might take a lot
longer than we’d like for our training process to complete.
• A low learning rate can make our learning algorithm
inefficient.
• Learning rates are tricky because they end up being specific to
the dataset and even to other hyper parameters.
• This creates a lot of overhead for finding the right setting for
hyper parameters.
• Nesterov’s momentum
• The “vanilla” version of SGD uses gradient directly, and this can be problematic
• because gradient can be nearly zero for any parameter.
• This causes SGD to take tiny steps in some cases, and steps that are too big for
situations in which the gradient is too large.
• To alleviate these issues, we can use techniques such as the following:
• Nesterov’s momentum
• RMSProp
• Adam
• AdaDelta
• We can speed up our training by increasing momentum, but we might lower the chance
that the model will reach minimal error by overshooting the optimal parameter values.
• Momentum is a factor between 0.0 and 1.0 that is applied to the change rate of the
weights over time.
• Typically, we see the value for momentum between 0.9 and 0.99.
AdaGrad
• AdaGrad is one technique that has been developed to help augment finding the “right”
learning rate.
• AdaGrad is named in reference to how it “adaptively” uses subgradient methods to
dynamically control the learning rate of an optimization algorithm.
• AdaGrad is monotonically decreasing and never increases the learning rate above
whatever the base learning rate was set at initially.
• AdaGrad is the square root of the sum of squares of the history of gradient
• computations.
• AdaGrad speeds our training in the beginning and slows it appropriately toward
RMSProp
• RMSprop is a very effective, but currently unpublished adaptive learning rate
method.
• AdaDelta
• AdaDelta is a variant of AdaGrad that keeps only the most recent history rather than
accumulating it like AdaGrad does.
ADAM
• ADAM (a more recently developed updating technique from the University of
Toronto) derives learning rates from estimates of first and second moments of the
gradients.
Regularization
• Regularization is a measure taken against overfitting.
• Overfitting occurs when a model describes the training set but cannot generalize
well over new inputs.
• Overfitted models have no predictive capacity for data that they haven’t seen.
• Regularization for hyperparameters helps modify the gradient so that it doesn’t
step in directions that lead it to overfit.
• Regularization includes the following:
• Dropout
• DropConnect
• L1 penalty
• L2 penalty
• Dropout
• Dropout is a mechanism used to improve the training of neural
networks by omitting a hidden unit.
• It also speeds training.
• Dropout is driven by randomly dropping a neuron so that it will not
contribute to the forward pass and backpropagation.
DropConnect
• DropConnect does the same thing as Dropout, but instead of choosing a
hidden unit, it mutes the connection between two neurons.
• L1
• The penalty methods L1 and L2, in contrast, are a way of preventing the neural network
parameter space from getting too big in one direction.
• They make large weights smaller. L1 regularization is considered computationally
inefficient in the nonsparse case, has sparse outputs, and includes built-in feature
selection. L1 regularization multiplies the absolute value of weights rather than their
squares.
• This function drives many weights to zero while allowing a few to grow large, making it
easier to interpret the weights.
• L2
• In contrast, L2 regularization is computationally efficient due to it having analytical
solutions and nonsparse outputs, but it does not do feature selection automatically for us.
• The “L2” regularization function, a common and simple hyperparameter, adds a term to
the objective function that decreases the squared weights. You multiply half the sum of
the squared weights by a coefficient called the weight-cost.
• L2 improves generalization, smooths the output of the model as input changes, and helps
the network ignore weights it does not use.
• Mini-batching
• With mini-batching, we send more than one input vector (a group or batch
of vectors) to be trained in the learning system.
• This allows us to use hardware and resources more efficiently at the
computer-architecture level. This method also allows us to compute certain
linear algebra operations (specifically matrix-to- matrix multiplications) in a
vectorized fashion.
• In this scenario we also have the option of sending the vectorized
computations to GPUs if they are present.
What is augmentation?
• Data augmentation is a process of artificially increasing the amount of data by
generating new data points from existing data.
• This includes adding minor alterations to data or using machine learning models
to generate new data points in the latent space of original data to amplify the
dataset.
• What is augmentation in CNN?

• A convolutional neural network that can robustly classify objects even if its placed
in different orientations is said to have the property called invariance.
• More specifically, a CNN can be invariant to translation, viewpoint, size or
illumination (Or a combination of the above).
data augmentation techniques
used for images
• Position augmentation. Scaling. Cropping. Flipping. Padding. Rotation.
Translation. Affine transformation.
• Color augmentation. Brightness. Contrast. Saturation. Hue.

You might also like