Unit-3
Unit-3
Topics:
Introduction to CNNs, Architecture, Convolution/pooling layers, LeNet,
AlexNet, ZF-Net, VGGNet, GoogLeNet, ResNet.
Vision Application: Object Detection – As classification, region proposals,
RCNN, YOLO architectures
Introduction to CNNs:
Convolutional Neural Networks (CNNs) are a class of deep learning models
primarily used for processing structured grid data, such as images. Here’s a
brief introduction:
What is a CNN?
1. Architecture: CNNs consist of several key layers:
o Convolutional Layers: These layers apply convolutional filters to
the input data, capturing spatial hierarchies and features like edges,
textures, and shapes.
o Activation Functions: Typically, a non-linear activation function
like ReLU (Rectified Linear Unit) is applied after convolutions to
introduce non-linearity.
o Pooling Layers: These reduce the dimensionality of the data,
retaining important features while decreasing computational
complexity. Max pooling is a common technique.
o Fully Connected Layers: At the end of the network, fully
connected layers integrate the features learned by previous layers
for classification or regression tasks.
2. Feature Learning: One of the strengths of CNNs is their ability to
automatically learn features from raw data without the need for manual
feature extraction.
3. Applications: CNNs excel in various applications, including:
o Image recognition and classification (e.g., identifying objects in
photos).
o Object detection (e.g., finding and classifying multiple objects in
an image).
o Image segmentation (e.g., partitioning an image into segments for
analysis).
o Medical image analysis (e.g., detecting anomalies in X-rays or
MRIs).
4. Training: CNNs are trained using large datasets and utilize techniques
like backpropagation and gradient descent to minimize the loss function.
Advantages of CNNs
Translation Invariance: CNNs are robust to the position of features in
an image, making them effective for image-related tasks.
Parameter Sharing: The use of shared weights in convolutional layers
significantly reduces the number of parameters, making the model easier
to train and less prone to overfitting.
Conclusion
CNNs have revolutionized the field of computer vision and continue to be a
critical component of many AI systems. Their ability to automatically extract
hierarchical features makes them particularly powerful for a wide range of
tasks.
Architecture of CNN:
CNNs typically consist of several types of layers that work together to
extract features and make predictions. Here’s a breakdown of each component:
1. Input Layer
Function: The input layer accepts the raw data, usually an image in the
case of CNNs. Images are typically represented as 3D tensors (width,
height, channels), where channels correspond to color channels (e.g.,
RGB).
2. Convolutional Layers
Purpose: These layers perform convolution operations to extract features
from the input.
Mechanism:
o A set of learnable filters (or kernels) slides over the input image.
o Each filter convolves across the input to produce feature maps,
capturing local patterns.
o The size of the filter (e.g., 3x3, 5x5) and the number of filters
determine how many features can be learned.
Output: Produces a 3D tensor representing the features detected at
different spatial locations.
3. Activation Functions
Purpose: Introduce non-linearity into the model, allowing it to learn
complex patterns.
Common Choices:
o ReLU (Rectified Linear Unit): Defined as f(x)=max(0,x)f(x) = \
max(0, x)f(x)=max(0,x). It's widely used due to its simplicity and
effectiveness.
o Leaky ReLU: A variant of ReLU that allows a small, non-zero
gradient when the input is negative.
o Sigmoid/Tanh: Less commonly used in hidden layers but can be
found in the output layer for binary classification.
4. Pooling Layers
Purpose: Down sample the feature maps to reduce dimensionality and
computation while preserving important features.
Types:
o Max Pooling: Selects the maximum value from a patch (e.g., 2x2)
of the feature map.
o Average Pooling: Computes the average value of a patch.
Effect: Helps achieve translation invariance and reduces the risk of
overfitting.
5. Fully Connected Layers (Dense Layers)
Purpose: Connect every neuron from the previous layer to each neuron in
the fully connected layer.
Function: These layers combine the features learned in previous layers to
make final predictions.
Output: Often uses a Softmax activation for multi-class classification or
sigmoid for binary classification.
6. Output Layer
Purpose: Generates the final predictions of the network.
Common Activations:
o Softmax: Converts raw scores into probabilities for multi-class
classification.
o Sigmoid: Outputs a probability for binary classification.
7. Regularization Techniques
Dropout: Randomly sets a fraction of input units to 0 during training to
prevent overfitting.
Batch Normalization: Normalizes the output of a previous layer to
stabilize learning and accelerate training.
8. Training Process
Backpropagation: The method used to calculate gradients and update
weights.
Loss Function: Measures how well the CNN is performing (e.g.,
categorical cross-entropy for multi-class classification).
Optimizer: Algorithms like Adam, SGD, or RMSprop are used to update
the weights based on gradients.
Convolution/Pooling Layers:
Convolution and pooling layers are two core building blocks of
Convolutional Neural Networks (CNNs), which are widely used for tasks such
as image classification, object detection, and other visual recognition tasks.
Let's break down each layer in detail.
1. Convolution Layer:
The convolution layer is the most essential part of a Convolutional Neural
Network. Its primary function is to detect local patterns such as edges, textures,
or simple shapes in an image, and pass on this information to the deeper layers
in the network for more complex feature extraction.
How it Works:
Filters/Kernels: The convolution layer consists of a set of small filters
(or kernels). These filters are small in spatial size (e.g., 3x3, 5x5) but
extend through the depth of the input image (e.g., if the input is an RGB
image, the filter would have a depth of 3).
Convolution Operation: The filter moves (or "slides") across the input
image. At each position, the filter performs an element-wise
multiplication between its values and the corresponding values of the
input image, followed by a summation. This process is known as the
convolution operation.
o For example, if the input image is of size 5x5 and the filter is 3x3,
the filter moves across the image and computes a single value at
each location. As it slides across the image, it produces a feature
map (or activation map) showing where specific features are
detected.
Stride: The stride controls how much the filter moves after each
operation. A stride of 1 means the filter moves one pixel at a time,
whereas a stride of 2 means it jumps two pixels.
Padding: Padding is used to ensure that the spatial dimensions of the
input image don't shrink too much after convolution. Zero padding
(adding zeros around the image) is commonly used. Padding helps retain
the image's dimensions after convolution, allowing the network to
preserve important features near the edges of the image.
Output Feature Map: The result of the convolution operation is a
feature map. This map highlights the presence of particular features
detected by the filter at different locations in the input image.
Advantages:
Parameter Sharing: Each filter is applied to the entire image, meaning
the same set of weights is reused across all spatial locations, drastically
reducing the number of parameters.
Local Connectivity: The filters focus on local regions of the image,
which helps in detecting local patterns (such as edges, corners, and
textures).
Example:
If we apply a 3x3 filter on a 5x5 input image (with padding and stride of 1), the
resulting feature map will have reduced spatial dimensions (e.g., 3x3).
Input Image: 5x5x3 (RGB image)
Filter: 3x3x3
Stride: 1
Padding: 1
Output: 5x5 feature map (if padding is used)
2. Pooling Layer:
Pooling layers are used to reduce the spatial dimensions (height and width) of
the input feature maps while retaining important information. This helps to
reduce the number of parameters and computation in the network and also helps
with the invariance of small translations (slight shifts in the position of
features).
Types of Pooling:
Max Pooling: This is the most commonly used form of pooling. It works
by selecting the maximum value from a region (typically a 2x2 or 3x3
window) of the feature map.
o For example, if a 2x2 window is applied to a feature map:
[1, 3]
[4, 2]
The max pooling operation would return 4, which is the maximum value in that
window.
Average Pooling: This method takes the average of all the values within
the window, rather than the maximum.
o For the same 2x2 window:
[1, 3]
[4, 2]
The average pooling would return (1 + 3 + 4 + 2) / 4 = 2.5.
Global Pooling: This type of pooling reduces the entire feature map to a
single value by taking the average or maximum value over the whole
map. Global average pooling is commonly used before the fully
connected layers in modern CNN architectures.
How it Works?
Window Size: Pooling is typically applied using a fixed-size window,
such as 2x2 or 3x3.
Stride: Similar to convolution, pooling also has a stride that controls how
much the pooling window moves.
Output Size: Pooling reduces the spatial dimensions of the input feature
map. For example, if you apply a 2x2 pooling with a stride of 2 on a 4x4
feature map, the output will be a 2x2 map.
Advantages:
Dimensionality Reduction: Pooling reduces the size of the feature maps,
which lowers the number of parameters and computation in subsequent
layers.
Translation Invariance: Pooling helps the network become invariant to
small translations of the object in the image. This means that small
changes in the position of the feature don’t affect the final output
drastically.
Prevents Overfitting: By reducing the spatial resolution, pooling helps
in making the model less prone to overfitting.
Example:
For a 4x4 feature map with 2x2 max pooling and stride of 2, the operation
would look like this:
Input:
[1, 3, 2, 4]
[5, 6, 7, 8]
[9, 10, 11, 12]
[13, 14, 15, 16]
Summary:
1. Convolution Layer:
o Detects local patterns (edges, corners, textures).
o Applies filters/kernels to the input image.
o Reduces the spatial dimensions based on stride and padding.
o Produces feature maps highlighting the detected features.
2. Pooling Layer:
o Reduces spatial dimensions of feature maps.
o Helps with computational efficiency and reduces overfitting.
o Common types: Max Pooling (picks maximum value), Average
Pooling (averages values).
o Introduces translation invariance.
Together, convolution and pooling layers allow CNNs to effectively detect
features at different scales, reduce dimensionality, and enable efficient learning
in deep networks.
LeNet-5:
LeNet-5 is one of the pioneering Convolutional Neural Network (CNN)
architectures, developed by Yann LeCun and his collaborators in 1998. LeNet-5
was originally designed for handwritten digit recognition, particularly for the
MNIST dataset, which contains images of handwritten digits (0-9). This
architecture demonstrated the power of CNNs and helped set the stage for more
advanced deep learning models.
Here’s a detailed explanation of LeNet-5:
Architecture Overview:
LeNet-5 consists of 7 layers (not counting the input layer) and includes
convolutional layers, subsampling (or pooling) layers, and fully connected
layers. The architecture is relatively simple compared to modern deep networks,
but it was groundbreaking in its time and laid the foundation for many future
CNN architectures.
Below is the architecture layout of LeNet-5:
1. Input Layer:
o Size: 32x32x1
o The input to LeNet-5 is an image of size 32x32 pixels. The images
used in the original LeNet-5 model were pre-processed to this size
(MNIST images are 28x28, so they are zero-padded to 32x32).
o Channels: 1 (grayscale image)
2. Convolutional Layer 1 (C1):
o Size: 6 feature maps, each 28x28 (after padding)
o Filter Size: 5x5
o Stride: 1
o Activation Function: Sigmoid (in the original paper)
o In this layer, 6 convolutional filters (or kernels) of size 5x5 are
applied to the 32x32 input image. This results in 6 feature maps,
each of size 28x28, because the image size reduces by 4 pixels (due
to the filter size and stride).
o This layer detects basic features such as edges or textures in the
image.
3. Subsampling Layer 1 (S2):
o Size: 6 feature maps, each 14x14
o Filter Size: 2x2
o Stride: 2
o Type: Average pooling (in the original paper, although max pooling
is more commonly used today)
o This layer performs subsampling (or pooling) to reduce the spatial
size of the feature maps. It uses a 2x2 filter with a stride of 2,
effectively halving the width and height of the input feature maps.
After this layer, the size of each feature map is reduced to 14x14.
4. Convolutional Layer 2 (C3):
o Size: 16 feature maps, each 10x10
o Filter Size: 5x5
o Stride: 1
o Activation Function: Sigmoid (in the original paper)
o The second convolutional layer consists of 16 filters of size 5x5.
However, not all the filters in this layer connect to all the feature
maps from the previous layer. Instead, some feature maps from the
previous layer are used as input to specific filters, making this layer
more complex than the first convolutional layer. The output is a set
of 16 feature maps of size 10x10.
o The filters in this layer learn to detect more complex features (e.g.,
combinations of edges or textures).
5. Subsampling Layer 2 (S4):
o Size: 16 feature maps, each 5x5
o Filter Size: 2x2
o Stride: 2
o Type: Average pooling
o This second subsampling layer also uses a 2x2 filter with a stride
of 2, halving the size of each feature map from 10x10 to 5x5.
6. Fully Connected Layer 1 (C5):
o Size: 120 units
o After the convolutional and subsampling layers, the output feature
maps are flattened into a 1D vector of 120 units. These are fully
connected neurons, meaning that each neuron is connected to all
neurons from the previous layer. This layer extracts high-level
features based on the previous layer's outputs.
o Activation Function: Sigmoid (in the original paper)
7. Fully Connected Layer 2 (F6):
o Size: 84 units
o The output from the previous layer (120 units) is fed into another
fully connected layer with 84 neurons. This layer further processes
the extracted features.
8. Output Layer (Softmax Layer):
o Size: 10 units (one for each digit 0-9)
o The final fully connected layer outputs 10 values, representing the
predicted probabilities for each of the 10 digits in the MNIST
dataset. The Softmax function is applied here, which converts these
raw outputs into probabilities, where the sum of all 10 values is 1.
o The predicted class is the digit with the highest probability.
Detailed Working:
Convolution Layers (C1 and C3): The convolution layers are
responsible for detecting patterns in the input image. Each filter detects
specific features (e.g., edges, corners, or textures). As the network goes
deeper, the filters learn more abstract and complex features.
Subsampling Layers (S2 and S4): These layers reduce the size of the
feature maps, which helps in reducing the computational complexity and
makes the network less prone to overfitting. Subsampling also makes the
model more invariant to small translations and distortions in the input
images.
Fully Connected Layers (C5 and F6): These layers are typical of
traditional neural networks and help combine the features extracted by the
convolutional layers to make the final prediction. The fully connected
layers are responsible for learning complex patterns from the feature
maps and generating the final output.
Training LeNet-5:
LeNet-5 was trained using the backpropagation algorithm and gradient
descent optimization. The network uses the cross-entropy loss for
classification tasks, such as digit recognition.
Key Concepts and Innovations in LeNet-5:
1. Convolutional Layers: LeNet-5 uses convolutional layers to
automatically learn features from the input images, unlike traditional
machine learning methods that required manual feature extraction.
2. Subsampling (Pooling) Layers: Pooling layers help reduce the spatial
size of the feature maps and provide translation invariance. This was a
significant advancement over previous architectures.
3. ReLU Activation (Not in Original): The original LeNet-5 used sigmoid
as the activation function. However, modern implementations of LeNet-5
often use ReLU (Rectified Linear Unit), which helps prevent vanishing
gradients and speeds up training.
4. End-to-End Training: LeNet-5 was one of the first networks to be
trained end-to-end with backpropagation, meaning that all the layers
(convolutional, subsampling, and fully connected) were optimized
together.
Applications:
Handwritten Digit Recognition: LeNet-5 was designed for the MNIST
dataset, where it achieved an accuracy of around 99.2% in digit
recognition.
Character Recognition: The architecture laid the foundation for various
character and digit recognition applications in early machine learning and
computer vision.
LeNet-5 Architecture Summary:
Number of Filter
Layer Type Output Size Stride
Filters Size
Input - 32x32x1 - - -
C1 Convolution 28x28x6 6 5x5 1
Subsampling
S2 14x14x6 - 2x2 2
(Pooling)
C3 Convolution 10x10x16 16 5x5 1
Subsampling
S4 5x5x16 - 2x2 2
(Pooling)
C5 Fully Connected 120 - - -
F6 Fully Connected 84 - - -
Output Softmax 10 - - -
Conclusion:
LeNet-5 is a landmark in the history of deep learning, being one of the first
CNN architectures to achieve practical success in image recognition tasks.
Although it is relatively simple compared to modern architectures like ResNet,
VGG, or Inception, its influence is immense, and it laid the groundwork for the
rapid evolution of deep learning in the years that followed.
AlexNet:
AlexNet is a deep convolutional neural network (CNN) architecture that
was introduced by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton in
2012. It won the ImageNet Large Scale Visual Recognition Challenge
(ILSVRC) in 2012 by a significant margin, reducing the top-5 error rate from
25.7% to 16.4%. This breakthrough brought CNNs to the forefront of deep
learning and revolutionized the field of computer vision. AlexNet's success
demonstrated the power of deep learning models when trained on large datasets
with sufficient computational resources.
Here is a detailed breakdown of AlexNet:
Architecture Overview:
AlexNet consists of 8 layers in total, with 5 convolutional layers and 3 fully
connected layers. The architecture also incorporates techniques that were novel
at the time to help train deep networks effectively, such as ReLU activation,
dropout, and data augmentation.
Conclusion:
AlexNet was a pivotal moment in the history of deep learning, marking the
beginning of a new era in computer vision. Its success in 2012 demonstrated
that deep neural networks, when trained on large datasets with enough
computational power, could drastically outperform traditional machine learning
methods. The principles introduced in AlexNet—such as ReLU activation,
dropout, and the use of GPUs—have become essential techniques in modern
deep learning architectures.
ZF-Net
ZF-Net (Zeiler and Fergus Net) is a deep learning model introduced in the
paper Visualizing and Understanding Convolutional Networks by Matthew D.
Zeiler and Rob Fergus in 2014. ZF-Net is an improved version of AlexNet and
focuses on both the architecture and the insights gained from visualizing the
intermediate layers of the network. It demonstrated better performance in visual
recognition tasks and helped improve our understanding of how convolutional
neural networks (CNNs) work.
Key Features and Innovations in ZF-Net:
1. Improved Architecture Over AlexNet: ZF-Net made several
modifications to the original AlexNet architecture to improve its
performance:
o Smaller Kernel Sizes: ZF-Net used smaller convolutional filter
sizes compared to AlexNet, improving the accuracy of feature
extraction.
o Deeper Layers: ZF-Net introduced additional layers to the model,
enhancing its ability to capture more abstract features from the
data.
o Strides and Padding: ZF-Net adjusted strides and padding to
optimize the flow of data through the network and increase the
representational power of the convolutional layers.
2. Visualization of Filters and Activations: One of the most important
contributions of ZF-Net was its work on visualizing the filters and
activations of the network. This helped researchers better understand
how CNNs process and represent visual information. The authors used
deconvolutional networks (also called "deconvnets") to visualize what
features each layer in the network was learning.
3. Learning from Visualization: By examining the filters and activations at
various layers, Zeiler and Fergus were able to identify how the network
learned low-level to high-level features and refined its architecture to
make learning more efficient. This process helped fine-tune the
hyperparameters of the network, such as filter sizes and the number of
filters.
4. Improved Accuracy: ZF-Net was able to achieve a significant
improvement in classification accuracy over AlexNet, both on the
ImageNet challenge and other benchmark datasets. The architecture
improvements allowed ZF-Net to generalize better to unseen data,
reducing overfitting and boosting its performance in visual recognition
tasks.
ZF-Net Architecture:
The architecture of ZF-Net is based on AlexNet but with several key changes
that aim to improve the performance of the network. Here’s a detailed
breakdown of the ZF-Net architecture:
1. Input Layer:
Input size: 224x224x3 (RGB image)
The input consists of images resized to 224x224 pixels.
2. Convolutional Layer 1 (Conv1):
Output size: 55x55x96
Filter size: 7x7
Number of filters: 96
Stride: 2
Activation function: ReLU
The first convolutional layer uses 96 filters of size 7x7 with a stride of 2,
which reduces the spatial resolution of the input. The activation function
used here is ReLU.
3. Max Pooling Layer 1 (MaxPool1):
Output size: 27x27x96
Filter size: 3x3
Stride: 2
Max pooling is applied with a 3x3 filter and a stride of 2, which reduces
the spatial size of the feature map.
4. Convolutional Layer 2 (Conv2):
Output size: 27x27x256
Filter size: 5x5
Number of filters: 256
Stride: 1
Activation function: ReLU
The second convolutional layer consists of 256 filters of size 5x5 with a
stride of 1. This layer extracts more complex features from the previous
layer.
5. Max Pooling Layer 2 (MaxPool2):
Output size: 13x13x256
Filter size: 3x3
Stride: 2
Another max pooling operation reduces the spatial size to 13x13x256.
6. Convolutional Layer 3 (Conv3):
Output size: 13x13x512
Filter size: 3x3
Number of filters: 512
Stride: 1
Activation function: ReLU
The third convolutional layer consists of 512 filters of size 3x3 with a
stride of 1. This layer extracts more abstract and high-level features.
7. Convolutional Layer 4 (Conv4):
Output size: 13x13x512
Filter size: 3x3
Number of filters: 512
Stride: 1
Activation function: ReLU
Similar to Conv3, the fourth convolutional layer consists of 512 filters of
size 3x3. This layer further refines the feature representations.
8. Convolutional Layer 5 (Conv5):
Output size: 13x13x256
Filter size: 3x3
Number of filters: 256
Stride: 1
Activation function: ReLU
The fifth convolutional layer uses 256 filters of size 3x3, leading to a
13x13x256 output size.
Performance of ZF-Net:
Accuracy: ZF-Net achieved significantly better performance than
AlexNet on the ImageNet dataset and other benchmark datasets. It
provided valuable insights into the inner workings of CNNs and set the
stage for further architectural improvements in later models like VGG,
GoogLeNet, and ResNet.
Visualization: The deconvolutional network used in ZF-Net helped
visualize the features learned by each layer in the network. This technique
became an important tool in understanding how CNNs learn hierarchical
representations of visual data.
Summary:
Layer Type Output Size Filter Size Number of Filters Stride
Input - 224x224x3 - - -
Conv1 Convolution 55x55x96 7x7 96 2
MaxPool1 Max Pooling 27x27x96 3x3 - 2
Conv2 Convolution 27x27x256 5x5 256 1
MaxPool2 Max Pooling 13x13x256 3x3 - 2
Conv3 Convolution 13x13x512 3x3 512 1
Layer Type Output Size Filter Size Number of Filters Stride
Conv4 Convolution 13x13x512 3x3 512 1
Conv5 Convolution 13x13x256 3x3 256 1
MaxPool3 Max Pooling 6x6x256 3x3 - 2
Flatten - 9216 - - -
FC1 Fully Connected 4096 - - -
FC2 Fully Connected 4096 - - -
FC3 Softmax 1000 - - -
VGGNet:
VGGNet (Visual Geometry Group Network) is a convolutional neural
network (CNN) architecture introduced by Karen Simonyan and Andrew
Zisserman in 2014, from the University of Oxford's Visual Geometry Group.
VGGNet became widely known for its simplicity and effectiveness in
deep learning, particularly in image classification tasks.
It was the architecture that performed outstandingly well in the ImageNet
Large Scale Visual Recognition Challenge (ILSVRC) 2014, securing the 1st
and 2nd places in the object localization and classification challenges,
respectively.
VGGNet's key contribution lies in its simplicity and the use of very small
convolutional filters (3x3). Despite its simplicity, it demonstrated that depth is
crucial for learning complex visual representations and that networks can be
more powerful by increasing their depth and using uniform structures.
GoogLeNet:
GoogleNet is a deep convolutional neural network architecture
introduced by Szegedy et al. in the 2014 ImageNet Large Scale Visual
Recognition Challenge (ILSVRC). It won the ILSVRC 2014 classification
and detection challenges, achieving state-of-the-art performance.
GoogleNet's key innovation is the Inception Module, which enables it to
have a deep architecture while keeping computational costs relatively low.
The architecture is designed to be more efficient than earlier models like
VGGNet and AlexNet by allowing the network to learn features at multiple
scales while maintaining a relatively small computational footprint.
This innovation laid the foundation for subsequent architectures like
InceptionV2, InceptionV3, and EfficientNet.
Summary:
GoogleNet (Inception v1) introduced the Inception module, a powerful
and efficient way of processing images with multiple filter sizes in
parallel.
1x1 convolutions help reduce computational cost by shrinking feature
maps before applying larger convolutions.
The architecture uses global average pooling instead of fully connected
layers to significantly reduce the number of parameters.
GoogleNet has set the stage for future developments in deep learning
architectures, especially in the realm of efficient and scalable CNN
models.
Summary of GoogLeNet (Inception V1) Architecture:
Innovative Inception modules using various filter sizes (1x1, 3x3, 5x5)
and pooling in parallel.
A global average pooling layer at the end instead of fully connected
layers.
Deep architecture with 22 layers.
Achieved high accuracy on the ImageNet challenge with a relatively
low number of parameters compared to previous architectures like
VGGNet.
GoogLeNet laid the groundwork for Inception V2, V3, and V4 architectures,
which introduced further improvements in terms of optimization, depth, and
accuracy.
ResNet:
ResNet (Residual Networks) is a groundbreaking deep learning
architecture introduced by Kaiming He et al. in 2015 in their paper "Deep
Residual Learning for Image Recognition".
ResNet revolutionized the design of deep convolutional neural networks
(CNNs) by introducing the concept of residual learning, which enables the
training of very deep networks without the problem of vanishing gradients.
ResNet achieved state-of-the-art performance in the ILSVRC 2015
(ImageNet Large Scale Visual Recognition Challenge), winning the competition
with an impressive error rate of 3.57%.
Key Concepts and Innovations in ResNet:
1. Residual Learning:
o The core innovation in ResNet is residual learning, which
introduces residual connections (also known as skip connections)
between layers. Instead of learning the desired output directly, the
network learns the residual, or the difference between the input
and output. This approach helps the network to learn the residual
mapping (i.e., the difference between the identity function and the
target function), which makes training deep networks much easier
and more efficient.
o Mathematically, if the original function that we want to learn is
H(x), ResNet learns the residual F(x)=H(x)−x, where x is the input.
The actual output of the residual block is F(x)+x. This helps to
avoid degradation problems in very deep networks.
2. Skip Connections:
o A skip connection is a shortcut from one layer to a later layer,
skipping over intermediate layers. The key idea is that the skip
connection allows gradients to flow more easily back through the
network during training, mitigating the vanishing gradient problem
and making it possible to train networks with many more layers
(e.g., 50, 101, or even 152 layers).
o These connections directly add the input of a layer to its output,
allowing the network to learn identity mappings, which helps the
network to learn residuals without degrading performance as the
depth of the network increases.
3. Deeper Networks:
o By using residual connections, ResNet enables the training of very
deep networks. Traditional networks would suffer from the
vanishing gradient problem as the number of layers increased,
making it difficult to train them. However, ResNet allows deeper
architectures by bypassing the intermediate layers, which helps to
maintain the gradient flow and improves convergence during
training.
o ResNet demonstrated that networks could be much deeper (e.g.,
ResNet-152 with 152 layers) and still achieve superior
performance without suffering from overfitting or degradation
issues.
4. Building Blocks of ResNet – Residual Blocks:
o A residual block is the basic unit of ResNet. A typical residual
block contains two or more convolutional layers with a skip
connection that bypasses the intermediate layers.
o For instance, in the simplest form of residual block (a 2-layer
residual block):
Input (x) is passed through two convolutional layers with
activations (e.g., ReLU) and batch normalization.
The input x is then added to the output of the convolutional
layers (this is the residual connection).
The output of the block is F(x)+x, where F(x) is the output
of the stacked layers and x is the original input.
o The structure helps prevent the degradation problem in deeper
networks.
5. Batch Normalization (BN):
o Batch normalization is used in ResNet to improve training speed
and stability. BN normalizes the activations of each layer by
adjusting and scaling them, reducing the internal covariate shift.
This helps improve convergence rates and also reduces the
dependence on careful initialization of weights.
6. Global Average Pooling (GAP):
o At the end of the network, global average pooling (GAP) is used
instead of fully connected layers. GAP reduces each feature map to
a single value by averaging over the spatial dimensions, which
significantly reduces the number of parameters in the network. This
makes the network more computationally efficient and helps to
prevent overfitting.
ResNet Architecture:
ResNet's architecture is built using residual blocks, and the depth of the
network (e.g., ResNet-18, ResNet-50, ResNet-101, ResNet-152) determines
how many residual blocks are stacked. Here is an overview of the architecture
for ResNet-50, which is a common variant of ResNet with 50 layers:
1. Input Layer:
Input Size: 224x224x3 (RGB image)
The input image is resized to 224x224 pixels with 3 colour channels
(RGB).
2. Initial Convolution Layer:
Convolutional Layer (Conv1):
o Filter Size: 7x7
o Number of Filters: 64
o Stride: 2
o Activation: ReLU
o The initial layer uses a large kernel to capture low-level features.
3. Max Pooling Layer:
Max Pooling Layer (MaxPool1):
o Filter Size: 3x3
o Stride: 2
o Pooling helps reduce spatial dimensions and computational cost.
4. Residual Blocks:
The network consists of several residual blocks stacked together. Each
block contains two or three convolutional layers, and the input is added to
the output via a skip connection.
For ResNet-50, the residual blocks are organized into 4 stages:
o Stage 1: 3 residual blocks (with 64 filters)
o Stage 2: 4 residual blocks (with 128 filters)
o Stage 3: 6 residual blocks (with 256 filters)
o Stage 4: 3 residual blocks (with 512 filters)
o Each block contains 1x1, 3x3, and 1x1 convolutions, where the
1x1 convolutions are used for dimensionality reduction and
expansion, and the 3x3 convolutions are used for feature
extraction.
5. Global Average Pooling:
After the last residual block, global average pooling is applied to the
feature maps. This operation reduces each spatial feature map to a single
scalar value by averaging the values over the spatial dimensions.
6. Fully Connected Layer (FC):
Fully Connected Layer:
o After global average pooling, the output is passed through a fully
connected layer to produce the final classification output.
o The final output is a vector of class scores, with a softmax
activation to produce probabilities.
7. Output Layer:
Softmax Activation:
o The output is passed through a Softmax layer, which produces a
probability distribution over the classes in the classification task
(e.g., for 1000 ImageNet classes).
ResNet Variants:
ResNet-18: Contains 18 layers, suitable for smaller and less complex
tasks.
ResNet-34: Contains 34 layers, commonly used in many practical
applications.
ResNet-50: Contains 50 layers, widely used for deeper architectures.
ResNet-101: Contains 101 layers, offering a more expressive model for
complex tasks.
ResNet-152: Contains 152 layers, the deepest ResNet model, providing
state-of-the-art accuracy.
Region Proposals:
Region proposals in Convolutional Neural Networks (CNNs) are a
fundamental aspect of many object detection algorithms. The purpose of region
proposals is to identify candidate regions in an image that might contain objects,
which are then processed by the network to classify and localize the objects.
These proposals are crucial for reducing the computational cost of object
detection and improving the accuracy of the network.
1. What are Region Proposals?
In the context of object detection, a region proposal is a potential bounding
box in an image where an object might be located. The network generates a set
of these proposals, and each one is then evaluated to determine whether it
contains an object and which object it contains.
Region proposals are used in conjunction with techniques like Region-based
CNNs (R-CNN), Fast R-CNN, Faster R-CNN, and Mask R-CNN to speed up
and improve the accuracy of object detection tasks.
2. Traditional Region Proposal Methods:
Before deep learning, region proposals were generated using traditional
computer vision techniques like selective search and edge boxes.
Selective Search:
Selective search is an algorithm that combines multiple strategies, such
as the segmentation of an image into super pixels and merging these
regions iteratively based on similarity criteria (colour, texture, etc.).
It uses a graph-based approach to propose regions that are likely to
contain objects, but it is computationally expensive.
Edge Boxes:
Edge Boxes is another method that relies on detecting edges and
generating candidate regions that are likely to have an object based on
edge strength and compactness.
It is more efficient than selective search but still not fully integrated with
deep learning models.
3. Region Proposal Networks (RPN)
Region Proposal Networks (RPNs) are a type of CNN-based method used
to generate region proposals. They were introduced in the Faster R-CNN
framework, which is one of the most widely known approaches in object
detection. The RPN can predict the likelihood of whether a region contains an
object and, if so, generate bounding boxes for the detected objects.
Key components of RPN:
Sliding Window: The RPN uses a sliding window approach, where the
network moves a small window across the feature map generated by a
backbone CNN (such as ResNet or VGG).
Anchor Boxes: At each sliding window position, the RPN generates
multiple anchor boxes of different aspect ratios and scales. These anchor
boxes are predefined bounding boxes, and they serve as reference
points for generating proposals.
Objectness Score: For each anchor box, the RPN predicts two things:
o Objectness Score: A binary classification score indicating whether
the anchor box contains an object (as opposed to background).
o Bounding Box Refinement: Adjustments (translations and
scalings) to the anchor boxes to make them more precise and match
the object boundaries better.
Region Proposal Layer (RPL): The RPN layer outputs a set of proposals
(bounding boxes), which are ranked based on the objectness score.
Typically, a non-maximum suppression (NMS) technique is applied to
filter out redundant and overlapping proposals.
4. Importance of Region Proposals in Object Detection
Region proposals significantly enhance the efficiency of object detection.
Without proposals, a model would need to scan the entire image (or feature
map) to detect all possible object locations, which is computationally expensive.
Proposals focus the model on high-confidence regions and reduce the
computational burden.
Before RPNs, methods like Selective Search were commonly used for
generating region proposals. These methods were computationally expensive
and not end-to-end trainable. The RPN, by contrast, is part of the CNN and can
be trained end-to-end to optimize both feature extraction and proposal
generation, improving speed and accuracy.
5. Integration with CNNs
Backbone CNN: A backbone CNN (e.g., VGG, ResNet) is used to
extract feature maps from the input image. These feature maps capture
the spatial hierarchies of the image, which are essential for detecting
objects.
RPN Layer: The RPN is typically applied after the backbone CNN. It
takes the feature map from the backbone network and generates region
proposals.
Proposal Refinement: After the region proposals are generated, they
undergo a further stage of classification and refinement (through a second
CNN) to detect specific objects and refine bounding boxes.
R-CNN:
R-CNN (Region-based Convolutional Neural Network) is a pioneering
deep learning-based approach to object detection in computer vision. It was
introduced in 2014 by Ross B. Girshick, along with other researchers, and
represented a significant breakthrough in how computers detect and localize
objects in images.
R-CNN made use of convolutional neural networks (CNNs) for feature
extraction and integrated traditional computer vision techniques (like region
proposals) to detect objects. R-CNN marked the beginning of deep learning's
dominance in object detection tasks and has influenced subsequent
developments in the field.
Key Aspects of R-CNN in Computer Vision
R-CNN is designed to address the problem of object detection, which
involves both localizing objects in an image (i.e., identifying their position) and
classifying them (i.e., determining what object it is).
Prior to R-CNN, traditional methods often relied on hand-crafted features
such as HOG (Histograms of Oriented Gradients), SIFT (Scale-Invariant
Feature Transform), or Haar features, combined with machine learning
classifiers (like SVMs).
However, these methods had limitations in terms of robustness and
performance.
R-CNN changed the landscape by combining CNNs (which automatically
learn features from raw pixel data) with region proposal techniques,
significantly improving the accuracy and performance of object detection
systems.
R-CNN Architecture in Detail
The core idea of R-CNN is to first generate region proposals and then
use a CNN to extract features from those regions, followed by a classification
and bounding box regression step to detect objects in the image. Here's a
breakdown of R-CNN's architecture:
1. Region Proposal Generation
Selective Search is used to generate region proposals.
The idea behind region proposals is to identify parts of an
image that are likely to contain objects. Instead of scanning
the entire image with sliding windows (which is
computationally expensive), R-CNN uses Selective Search,
a method that combines multiple strategies for generating
regions. These strategies include over-segmentation, where
the image is divided into small regions (called super pixels),
and then merging similar regions based on colour, texture,
size, and other criteria.
Selective Search generates around 2,000 region proposals
per image, which are then passed to the next stage. These
proposals are potential areas where objects could be located.
2. Feature Extraction Using a CNN
After generating the region proposals, R-CNN resizes each region
proposal to a fixed size (e.g., 224x224 pixels) and feeds them into a pre-
trained CNN (typically AlexNet in the original paper, though other
architectures like VGGNet or ResNet could also be used).
The CNN processes the region and extracts high-level
features. These features represent important information
about the content of the region, such as texture, shape, and
object parts.
The CNN is a deep network trained to detect features
relevant to object recognition. In the case of R-CNN, the
features from the final layers of the CNN are used to
describe the region proposal.
3. Object Classification Using SVMs
Once the CNN has extracted features from the region proposals,
these features are used for classification.
For each region proposal, a Support Vector Machine
(SVM) is used to classify the object. R-CNN trains an SVM
for each object class (e.g., "car," "dog," "cat") to determine
which object, if any, the region proposal corresponds to.
R-CNN uses a binary classifier for each class to determine
whether the object in the region proposal belongs to that
class or is background. This classification process works by
feeding the CNN features of each region into a trained SVM
for each class.
YOLO Architecture:
The YOLO (You Only Look Once) architecture is a deep learning model
for real-time object detection, introduced by Joseph Redmon and colleagues in
2015. YOLO is known for its efficiency and speed, making it highly suitable for
real-time applications like video analysis, autonomous vehicles, and
surveillance systems.
The key idea behind YOLO is to frame object detection as a single
regression problem, which allows the model to predict bounding boxes and
class probabilities directly from the image in one pass, as opposed to previous
methods like R-CNN that required multiple stages.
Key Components of the YOLO Architecture
1. Input Image
YOLO takes an image as input, typically resized to a fixed size
(e.g., 416x416 or 608x608). The image is fed into the neural network, which
outputs bounding boxes, class probabilities, and objectness scores.
2. Grid Division
YOLO divides the input image into a grid of cells. For example, an
image of size 416x416 might be divided into a 13x13 grid, where each cell is
responsible for detecting objects whose centre falls within the cell.
3. Prediction for Each Grid Cell
For each grid cell, YOLO predicts several things:
o Bounding boxes: Each grid cell predicts multiple bounding boxes
(usually 2-5) with associated coordinates (x, y, width, height).
o Objectness score: A probability indicating whether an object is
present in the bounding box.
o Class probabilities: A vector representing the likelihood of each
class for the detected object.
Each bounding box prediction includes:
o (x, y): Coordinates of the centre of the box, relative to the grid cell.
o (w, h): Width and height of the box, relative to the entire image.
o Confidence score: Measures how confident the model is that the
box contains an object and the accuracy of its bounding box.
4. Output Layer
The output layer of YOLO is typically a 1D tensor where the
length depends on the number of grid cells and the number of bounding boxes
and classes. For example:
Number of grid cells: If the image is divided into S×S grid cells (e.g.,
13x13 for 416x416 image), the model produces a tensor of shape
S×S×B×(5+C), where:
o S is the number of grid cells.
o B is the number of bounding boxes predicted per cell (usually 2-5).
o 5 corresponds to the bounding box parameters: (x, y, w, h,
confidence)
o C is the number of object classes.