CNN notes unit-3
CNN notes unit-3
Input Layers: It’s the layer in which we give input to our model. In CNN, Generally, the input
will be an image or a sequence of images. This layer holds the raw input of the image with
width 32, height 32, and depth 3.
Convolutional Layers: This is the layer, which is used to extract the feature from the input
dataset. It applies a set of learnable filters known as the kernels to the input images. The
filters/kernels are smaller matrices usually 2×2, 3×3, or 5×5 shape. it slides over the input
image data and computes the dot product between kernel weight and the corresponding input
image patch. The output of this layer is referred as feature maps. Suppose we use a total of 12
filters for this layer we’ll get an output volume of dimension 32 x 32 x 12.
Activation Layer: By adding an activation function to the output of the preceding layer,
activation layers add nonlinearity to the network. it will apply an element-wise activation
function to the output of the convolution layer. Some common activation functions are RELU:
max(0, x), Tanh, Leaky RELU, etc. The volume remains unchanged hence output volume will
have dimensions 32 x 32 x 12.
Pooling layer: This layer is periodically inserted in the covnets and its main function is to
reduce the size of volume which makes the computation fast reduces memory and also prevents
overfitting. Two common types of pooling layers are max pooling and average pooling. If we
use a max pool with 2 x 2 filters and stride 2, the resultant volume will be of dimension
16x16x12.
Flattening: The resulting feature maps are flattened into a one-dimensional vector after the
convolution and pooling layers so they can be passed into a completely linked layer for
categorization or regression.
Fully Connected Layers: It takes the input from the previous layer and computes the final
classification or regression task.
Output Layer: The output from the fully connected layers is then fed into a logistic function
for classification tasks like sigmoid or softmax which converts the output of each class into the
probability score of each class.
Advantages of CNNs
1. Good at detecting patterns and features in images, videos, and audio signals.
Disadvantages of CNNs
4. Interpretability is limited, it’s hard to understand what the network has learned.
• Convolution layers consist of a set of learnable filters (or kernels) having small widths
and heights and the same depth as that of input volume (3 if the input layer is image
input).
• During the forward pass, we slide each filter across the whole input volume step by
step where each step is called stride (which can have a value of 2, 3, or even 4 for
high-dimensional images) and compute the dot product between the kernel weights
and patch from input volume.
• As we slide our filters we’ll get a 2-D output for each filter and we’ll stack them
together as a result, we’ll get output volume having a depth equal to the number of
filters. The network will learn all the filters.
Using CNNs to Classify Hand-written Digits on MNIST Dataset
The CIFAR-10 dataset consists of 60,000 32 x 32 colour images in 10 classes, with 6,000 images
per class. There are 50,000 training images and 10,000 test images.
The ImageNet dataset has more than 14 million images, hand-labeled across 20,000 categories.
Processing a dataset of this size requires a great amount of computing power in terms of CPU,
GPU, and RAM.
Imagine there’s an image of a bird, and you want to identify whether it’s really a bird or some
other object. The first thing you do is feed the pixels of the image in the form of arrays to the
input layer of the neural network (multi-layer networks used to classify things). The hidden
layers carry out feature extraction by performing different calculations and manipulations.
There are multiple hidden layers like the convolution layer, the ReLU layer, and pooling layer,
that perform feature extraction from the image. Finally, there’s a fully connected layer that
identifies the object in the image.
The original image is scanned with multiple convolutions and ReLU layers for locating the
features.
Pooling Layer
Pooling is a down-sampling operation that reduces the dimensionality of the feature map. The
rectified feature map now goes through a pooling layer to generate a pooled feature map.
The pooling layer uses various filters to identify different parts of the image like edges,
corners, body, feathers, eyes, and beak.
Here’s how the structure of the convolution neural network looks so far:
The next step in the process is called flattening. Flattening is used to convert all the resultant
2-Dimensional arrays from pooled feature maps into a single long continuous linear vector.
The flattened matrix is fed as input to the fully connected layer to classify the image.
How exactly CNN recognizes a bird:
• The pixels from the image are fed to the convolutional layer that performs the
convolution operation
• It results in a convolved map
• The convolved map is applied to a ReLU function to generate a rectified feature map
• The image is processed with multiple convolutions and ReLU layers for locating the
features
• Different pooling layers with various filters are used to identify specific parts of the
image
• The pooled feature map is flattened and fed to a fully connected layer to get the final
output
• Activation Layer
The activation layer introduces nonlinearity into the network by applying an activation function
to the output of the previous layer. This is crucial for the network to learn complex patterns.
Common activation functions, such as ReLU, Tanh, and Leaky ReLU, transform the input
while keeping the output size unchanged.
• Flattening
After the convolution and pooling operations, the feature maps still exist in a multi-dimensional
format. Flattening converts these feature maps into a one-dimensional vector. This process is
essential because it prepares the data to be passed into fully connected layers for classification
or regression tasks.
• Output Layer
In the output layer, the final result from the fully connected layers is processed through a
logistic function, such as sigmoid or softmax. These functions convert the raw scores into
probability distributions, enabling the model to predict the most likely class label.
CNN Evaluation
Several key metrics are used to evaluate your Convolutional Neural Network after its training
process is complete:
• Accuracy
Accuracy tells you the overall percentage of test images that the CNN correctly classifies. It’s
a straightforward measure of how often the model gets the right label.
• Precision
Precision focuses on how precise the CNN is when it predicts a particular class. It measures
the percentage of test images that were predicted as a specific class and actually belong to that
class. High precision means that when the CNN predicts a class, it’s likely correct.
• Recall
Recall looks at how well the CNN identifies all instances of a particular class. It measures the
percentage of test images that are of a certain class and were correctly identified as that class
by the CNN. High recall indicates that the CNN is good at finding all relevant examples of a
class.
• F1 Score
The F1 Score combines precision and recall into a single metric by calculating their harmonic
mean. This is particularly useful for evaluating the CNN’s performance on classes where there’s
an imbalance, meaning some classes are much more common than others. The F1 Score
provides a balanced measure that considers both false positives and false negatives, offering a
more comprehensive view of the CNN’s performance.
Types of Convolutional Neural Networks
• LeNet : is one of the earliest CNN architectures designed for handwritten digit
recognition. LeNet achieved high accuracy on the MNIST dataset and laid the
groundwork for modern CNNs.
• AlexNet : Its architecture includes five convolutional layers and three fully connected
layers, with innovations like ReLU activation and dropout. AlexNet demonstrated the
power of deep learning, leading to the development of even deeper networks.
• ResNet : Residual Networks, introduced the concept of residual connections, allowing
the training of very deep networks without overfitting. Its architecture uses skip
connections to help gradients flow through the network effectively.
• GoogleNet: also known as InceptionNet, is known for its efficiency and high
performance in image classification. It introduces the Inception module, which allows
the network to process features at multiple scales simultaneously.
• MobileNet: MobileNets are designed for mobile and embedded devices, offering a
balance of high accuracy and computational efficiency. MobileNets reduce the model
size and computational demand while maintaining strong performance in image
classification and keypoint detection.
• VGG : popular in various image recognition tasks, including object detection in self-
driving cars.
Applications of CNN
• Image Classification
CNN in deep learning excels at image classification, which involves sorting images into
predefined categories. They can effectively identify whether an image depicts a cat, dog, car,
or flower, making them indispensable for tasks that require sorting and labeling large volumes
of visual data.
• Object Detection
CNNs are particularly skilled in object detection, allowing them to identify and pinpoint
specific items within an image. Whether it's recognizing people, cars, or buildings, CNNs can
locate these objects and highlight their positions, which is crucial for applications needing
accurate object placement and identification.
• Image Segmentation
CNNs are highly effective for tasks that involve breaking down an image into distinct parts.
Image segmentation allows CNNs to distinguish and label different objects or regions within
an image. This capability is essential in fields like medical imaging, where detailed analysis of
structures is required, and in robotics, where intricate scenes need to be understood.
• Video Analysis
CNNs are also adept at video analysis, where they can track objects and detect events over
time. This makes them valuable for applications like surveillance and traffic monitoring, where
continuously analyzing dynamic scenes helps in understanding and managing real-time
activities.
Advantages of CNN
• High Accuracy
Convolutional Neural Networks are known for their exceptional accuracy in image recognition
tasks. They perform impressively in areas like classifying images, detecting objects, and
segmenting visuals, setting a high benchmark for performance in these fields.
• Efficiency
These networks are particularly efficient when used with specialized hardware such as GPUs.
This efficiency allows CNNs to process large amounts of data quickly, which is crucial for
applications that require heavy computational power.
• Robustness
Convolutional Neural Networks handle noisy or inconsistent input data with impressive
resilience. Their ability to maintain performance despite data imperfections makes them
dependable for real-world applications where conditions can vary.
• Flexibility
Another key advantage of Convolutional Neural Networks is their adaptability. They can be
tailored to different tasks simply by altering their architecture. This makes them versatile tools
that can be easily repurposed for diverse applications, from medical imaging to autonomous
vehicles.
Disadvantages of Convolutional Neural Networks (CNNs)
Although Convolutional Neural Networks (CNNs) are powerful, they come with their own
set of challenges:
• Complexity and Training Difficulty
CNNs are intricate, and this complexity can make them difficult to train, especially when
working with large datasets. Managing and fine-tuning the layers requires a deep understanding
of the architecture, making it challenging even for seasoned professionals.
• High Computational Demands
Another significant disadvantage is the high computational power required to train and deploy
CNNs effectively. Advanced hardware, such as GPUs, is often necessary, which increases costs
and limits access for those without these resources. This makes it difficult for smaller
organizations to utilize CNNs efficiently.
• Large Data Requirements
CNNs need a large amount of labeled data to perform well. Gathering and labeling data is time-
consuming and expensive. For more complex applications, such as medical imaging, the
precision needed in data labeling further increases the cost and effort involved.
• Lack of Interpretability
One of the most notable challenges with CNNs is their black-box nature. It’s often difficult to
understand why a CNN makes a certain prediction, which can be a significant issue in areas
where decision-making transparency is important. This lack of interpretability can limit the
trust placed in CNN-based systems, especially in critical applications like healthcare.