0% found this document useful (0 votes)
3 views

lecture_4 (5)

The document discusses the application of Convolutional Neural Networks (CNNs) in computer vision, highlighting their advantages over Multilayer Perceptrons (MLPs) for image classification. It covers the limitations of MLPs, such as high computation costs and sensitivity to object position changes, and explains how CNNs address these issues through techniques like convolution, parameter sharing, and pooling. The document also introduces key concepts like feature maps, convolution layers, and the architecture of deep CNNs, exemplified by AlexNet.

Uploaded by

hafsaladhasse7
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

lecture_4 (5)

The document discusses the application of Convolutional Neural Networks (CNNs) in computer vision, highlighting their advantages over Multilayer Perceptrons (MLPs) for image classification. It covers the limitations of MLPs, such as high computation costs and sensitivity to object position changes, and explains how CNNs address these issues through techniques like convolution, parameter sharing, and pooling. The document also introduces key concepts like feature maps, convolution layers, and the architecture of deep CNNs, exemplified by AlexNet.

Uploaded by

hafsaladhasse7
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Machine Learning for Computer Vision

Convolutional Neural Networks

Mehdi Zakroum

International University of Rabat

Acknowledgments for slides: Courtesy of Prof. Mounir Ghogho.


Outline

1. Introduction

2. Limitations of MLPs for Computer Vision

3. The Convolution Operation

4. Convolutional Neural Networks

Mehdi Zakroum International University of Rabat 1 / 31 (3%)


1. Introduction
1. Introduction

Image Classification Using Computer Vision

Assume we would like to develop a classifier to detect a swan in images.

Image analysis: the swan has certain characteristics that can be used to
help determine whether a swan is present or not, such as its long neck, its
white color, etc..

Mehdi Zakroum International University of Rabat 2 / 31 (6%)


1. Introduction

Image Classification is Challenging!

For some images, it may be more difficult to determine whether a swan is


present:

The features are still present in the above image, but it is more difficult for
us to pick out these characteristic features.

Mehdi Zakroum International University of Rabat 3 / 31 (9%)


1. Introduction

Image Classification is Challenging!

Extreme cases of swan classification:

Mehdi Zakroum International University of Rabat 4 / 31 (12%)


1. Introduction

Image Classification is Challenging!

Worst case scenarios:

Mehdi Zakroum International University of Rabat 5 / 31 (16%)


1. Introduction

Image Classification: Representation Learning

▶ Classical classifiers were based on manual feature engineering.


Researchers built multiple computer vision techniques to deal with
the issues of hard detection: SIFT, FAST, SURF, BRIEF, etc. But these
classifiers are either too general or too over-engineered.
▶ We need a system that can do Representation Learning (or Feature
Learning), which is a technique that allows a system to automatically
find relevant features for a given task. It replaces manual feature
engineering.
▶ There are several techniques for Feature Learning: Supervised
(Neural Networks!), Unsupervised (K-means, PCA, ...)

Mehdi Zakroum International University of Rabat 6 / 31 (19%)


2. Limitations of MLPs for Computer Vision
2. Limitations of MLPs for Computer Vision

Image Classification Using Multilayer Perceptron (MLP)

A multilayer perceptron (MLP) is a class of feedforward artificial neural


network (ANN). An MLP consists of at least three layers of nodes: an input
layer, a hidden layer and an output layer. Except for the input nodes, each
node is a neuron that uses a nonlinear activation function.

MLP has no sharing of weights...

Mehdi Zakroum International University of Rabat 7 / 31 (22%)


2. Limitations of MLPs for Computer Vision

Limitations of MLP: High Computation Cost

▶ Images used in computer vision are often 224 × 224 or larger. So an


MLP to classify color (RGB) images would have
224 × 224 × 3 = 150528 input features! If the hidden layer has 1024
nodes, the ANN would have 150528 × 1024 = 150+ million weights
in the first layer alone. So, a deep MLP would be nearly impossible to
train.
▶ So, we need to find a way to reduce the number of weights while
keeping the MLP deep.

Mehdi Zakroum International University of Rabat 8 / 31 (25%)


2. Limitations of MLPs for Computer Vision

Limitations of MLP: Object’s Position Changes

▶ An ANN trained to detect cats should to be able to do so regardless


of where the cat appears in the image. Imagine training an MLP
that works well on a certain cat image, but when testing it on a slightly
shifted version of the same image, the cat would not activate the
same neurons, so the network’s output may be very different!
▶ So, we need to find a way to make the ANN shift-invariant.

Mehdi Zakroum International University of Rabat 9 / 31 (29%)


2. Limitations of MLPs for Computer Vision

Limitations of MLP: Object’s Position Changes

Why does MLP work for MNIST dataset?

Each image in the MNIST dataset is 28 × 28 and contains a centered,


grayscale digit. An MLP would work well only because the MNIST dataset
contains small images that are centered; so no issues with size or shifting.
However, most real-world image classification problems aren’t this easy!

Mehdi Zakroum International University of Rabat 10 / 31 (32%)


2. Limitations of MLPs for Computer Vision

Limitations of MLP: Variability of Image Size

▶ A MLP trained to detect suricates on images with a fixed size will not
work on images with a different size, because the suricate would not
activate the same neurons, so the network’s output may be very
different!
▶ Therefore, we need to find a way to make the ANN work with inputs
of variable size.

Mehdi Zakroum International University of Rabat 11 / 31 (35%)


2. Limitations of MLPs for Computer Vision

Limitations of MLP

Convolutional Neural Networks (CNN) address these


3 limitations!

Mehdi Zakroum International University of Rabat 12 / 31 (38%)


3. The Convolution Operation
3. The Convolution Operation

Discrete 1D and 2D Convolutions

▶ 1-dimensional convolution:
+∞
X
s(n) = (x ⋆ k)(n) = x(n − i) × k(i)
i=−∞

▶ 2-dimensional convolution:
+∞
X +∞
X
S(m, n) = (X ⋆ K)(m, n) = X(m − i, n − j) × K(i, j)
i=−∞ j=−∞

▶ 1D convolutions are used to process time-series and 2D convolutions


are used to process images.

Mehdi Zakroum International University of Rabat 13 / 31 (41%)


3. The Convolution Operation

Discrete 1D and 2D Convolutions

▶ 1-dimensional convolution of a signal with a finite-support kernel (or


filter) k(n):

I
X
s(n) = (x ⋆ k)(n) = x(n − i) × k(i)
i=−I

▶ 2-dimensional convolution a of 2D signal (e.g. image) with a


finite-support 2D kernel (or filter) K(m, n):

I
X J
X
S(m, n) = (X ⋆ K)(m, n) = X(m − i, n − j) × K(i, j)
i=−I j=−J

Mehdi Zakroum International University of Rabat 14 / 31 (45%)


3. The Convolution Operation

Example of 2D Convolution

Figure 1: Image image processing, The Sobel Gx filter is used for edge detection (in
the horizontal direction).

Mehdi Zakroum International University of Rabat 15 / 31 (48%)


3. The Convolution Operation

Example of 2D convolutions

▶ Sobel kernel for horizontal edge detection:

▶ A filter to sharpen an image:

Mehdi Zakroum International University of Rabat 16 / 31 (51%)


3. The Convolution Operation

Convolution at the Edge: Padding


▶ What happens at the edges of the image? How to avoid the border effect?
▶ What to do to get the convolution output to have the same size as the input image? The answer is
padding (with zeros)
▶ Two padding variants:

Mehdi Zakroum International University of Rabat 17 / 31 (54%)


3. The Convolution Operation

Example of Padding

Mehdi Zakroum International University of Rabat 18 / 31 (58%)


4. Convolutional Neural Networks
4. Convolutional Neural Networks

Main Motivation for CNN


▶ In natural images, we know pixels are most useful in the context of their neighbors. Objects in
images are made up of small, localized features, like the circular iris of an eye. Doesn’t it seem
wasteful for every node in the first hidden layer to look at every pixel?
▶ This spatial information property can be leveraged to reduce the number of weights in a deep neural
network.
▶ One of the main problems with MLP is that spatial information is lost when the image is flattened into
an input vector.

Mehdi Zakroum International University of Rabat 19 / 31 (61%)


4. Convolutional Neural Networks

Main Ideas for CNN: Sparse Interactions

Sparse interactions: The influence of nearby pixels on a pixel can be


measured by weighted combinations of the pixel with its nearby pixels. In
CNN, this operation is the convolution. The weights of the combinations
are the weights of the kernels/filters in the convolutions. A filter of a
specific size (often 3x3 or 5x5) is moved across the image from top left to
bottom right.

Figure 2: Sparse Interactions in CNN Figure 3: Dense Interactions in CNN

Mehdi Zakroum International University of Rabat 20 / 31 (64%)


4. Convolutional Neural Networks

Main Ideas for CNN: Sparse Interactions

▶ Parameter Sharing: In addition to sparsity, the same parameters are


used to analyze different parts of the input. This further reduces the
number of weights to learn. (In a MLP, each weight is used exactly
once when computing the output of a layer.)
▶ In CNN, a few dozens of kernels/filters are used in order to detect
different features. Example: For human face detection, one filter
could be associated with seeing noses; the nose filter would give us
an indication of how strongly a nose seems to appear in the image,
and how many times and in what locations they occur.
▶ With CNN, we need to store fewer parameters, which both reduces
the memory requirements of the model and improves its statistical
efficiency. It also means that computing the output requires fewer
operations.

Mehdi Zakroum International University of Rabat 21 / 31 (67%)


4. Convolutional Neural Networks

Convolution Layer

Mehdi Zakroum International University of Rabat 22 / 31 (70%)


4. Convolutional Neural Networks

Convolution Layer Example

Mehdi Zakroum International University of Rabat 23 / 31 (74%)


4. Convolutional Neural Networks

Feature Maps: The Convolution Layer Output

▶ Each filter of the convolution layer is convolved with the entirety of


the 3D input cube but generates a 2D feature map.
▶ Since we have multiple filters, we obtain multiple feature maps which
are represented by a 3D output: each slice is a 2D feature map; the
depth of the 3D representation is given by the number of filters
▶ Convolving the image with a filter produces a feature map that
highlights the presence of a given feature in the image.
▶ In a convolution layer, multiple filters are applied to the image to
extract different features. But most importantly, we are learning
those filters!

Mehdi Zakroum International University of Rabat 24 / 31 (77%)


4. Convolutional Neural Networks

Feature Maps Example

Figure 4: Visualization of conv1 Figure 5: Visualization of feature maps at


filters from AlexNet different layers from AlexNet

Mehdi Zakroum International University of Rabat 25 / 31 (80%)


4. Convolutional Neural Networks

The Convolution Stride

▶ The convolution stride specifies how much we move the filter window
at each step.
▶ In the previous examples, the stride was equal to 1, i.e. the filter
convolves around the input volume by shifting one unit at a time.
▶ In practice, the stride may be different from 1.

Figure 6: stride = 1 Figure 7: stride = 2

Mehdi Zakroum International University of Rabat 26 / 31 (83%)


4. Convolutional Neural Networks

Relationship Between Different Dimensions

H − K + 2P
O= +1
S

▶ O: output height/length
▶ H : input height/length
▶ K : the filter size
▶ P : the padding
▶ S : the stride.

Mehdi Zakroum International University of Rabat 27 / 31 (87%)


4. Convolutional Neural Networks

Non-linearity

▶ For any kind of ANN to be powerful, it needs to contain non-linearity.


CNN is no different. We pass the result of the convolution operation
through a non linear activation function (often a ReLU).
▶ So the values in the final feature maps are not actually the results of
the convolutions, but the outputs of the activation function.
▶ The activation function helps to decide if the neuron would fire or not;
firing would indicate the presence of the feature that the feature map
aims to highlight.
▶ The non-linearity is often omitted in the CNN figures for simplicity.

Mehdi Zakroum International University of Rabat 28 / 31 (90%)


4. Convolutional Neural Networks

Pooling

▶ After a convolution operation and non-linearity, we usually perform


pooling to reduce the dimensionality.
▶ Pooling enables to reduce the number of elements on each feature
map, thus reducing the height and width, but keeping the depth
intact.
▶ Pooling layers downsample each feature map independently.
▶ The purpose of pooling is to be able to perform convolutions at
different scales.
▶ Pooling reduces the computational complexity of the model and
helps combat overfitting.

Mehdi Zakroum International University of Rabat 29 / 31 (93%)


4. Convolutional Neural Networks

Pooling Hyperparameters:

▶ The pooling operation is characterized by 3 hyperparameters:


The pooling window size
The pooling stride which specifies how much we move the pooling
window at each step.
The resizing operation; in practice, we often use max pooling, which
consists of choosing the largest value in each region. An alternative is
average pooling.
▶ Typically, max pooling with 2x2 window and stride 2 are used; this
shrinks the image by a factor of 2, and the area of the image by a
factor of 4.

Mehdi Zakroum International University of Rabat 30 / 31 (96%)


4. Convolutional Neural Networks

Example of Deep CNN: AlexNet

AlexNet (2012): one of the first Deep CNN to achieve considerable accuracy
on the 2012 ILSVRC challenge with an accuracy of 84.7% as compared to
the second-best with an accuracy of 73.8%.
ILSVRC: ImageNet Large Scale Visual Recognition Challenge

Mehdi Zakroum International University of Rabat 31 / 31 (100%)

You might also like