Lecture 3 Updated
Lecture 3 Updated
Convolutional networks
“beak” detector
Why Convolutional Networks?
Same pattern appears in different places:
They can be compressed!
What about training a lot of such “small” detectors and each detector
must “move around”.
“upper-left
beak” detector
“middle beak”
detector
Convolutional Networks
• Convolutional networks also known as convolutional neural
networks or CNNs
• The name “convolutional neural network” indicates that the
network employs a mathematical operation called convolution.
• Convolution is a specialized kind of linear operation.
• Convolutional networks are neural networks that use convolution
instead of general matrix multiplication in at least one layer.
• A convolutional layer has a number of filters that do convolutional
operations.
Convolutional Networks
• The main idea of CNNs is to use kernels or filters
Convolution Kernels
• A kernel is a small 2D matrix whose contents are based upon the operations to be performed.
• A kernel maps on the input image by simple matrix multiplication and addition, the output
obtained is of lower dimensions and therefore easier to work with.
• For input images with 3 or more channels such as RGB a filter is applied
• Filters are one dimension higher than kernels and can be seen as multiple kernels stacked on
each other where every kernel is for a particular channel.
A Convolution Operation
Input
Grey scale image
21 19 17 25 28
71 76 73 68 59
5x5
A Convolution Operation
3x3
(n x n) * (f x f) = (n-f+1) x (n-f+1)
5x5
3x3
21*(-1) + 71*(-1) + 153*(-1) + 19*(-1) +76*8 + 164 *(-1) + 17 *(-1) + 73 *(-1) + 164 *(-1) = (-74)
19*(-1) + 76*(-1) + 164*(-1) + 17*(-1) +73 + 164 *(-1) + 25 *(-1) + 68 *(-1) + 157 *(-1) = (-96)
➢ The convolution operation is responsible for detecting edges and features
from the images
Above is an example of a kernel for applying Sharpen image (enhance the depth of
edges) and edge detection.
Convolution with two filters
These are the network
parameters to be learned.
1 -1 -1
1 0 0 0 0 1 -1 1 -1 Filter 1
0 1 0 0 1 0 -1 -1 1
0 0 1 1 0 0
1 0 0 0 1 0 -1 1 -1
-1 1 -1 Filter 2
0 1 0 0 1 0
0 0 1 0 1 0 -1 1 -1
……
6 x 6 image
Each filter detects a
small pattern (3 x 3).
1 -1 -1
-1 1 -1 Filter 1
stride=1
-1 -1 1
1 0 0 0 0 1 Dot
product
0 1 0 0 1 0 3 -1
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
6 x 6 image
Stride is a parameter of the convolution operation that refers to the number of pixels by
which the filter matrix moves across the input matrix.
1 -1 -1
-1 1 -1 Filter 1
-1 -1 1
If stride=2
1 0 0 0 0 1
0 1 0 0 1 0 3 -3
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
6 x 6 image
1 -1 -1
-1 1 -1 Filter 1
-1 -1 1
stride=1
1 0 0 0 0 1
0 1 0 0 1 0 3 -1 -3 -1
0 0 1 1 0 0
1 0 0 0 1 0 -3 1 0 -3
0 1 0 0 1 0
0 0 1 0 1 0 -3 -3 0 1
6 x 6 image 3 -2 -2 -1
-1 1 -1
-1 1 -1 Filter 2
-1 1 -1
Repeat this for each filter
stride=1
1 0 0 0 0 1 3 -1 -3 -1
-1 -1 -1 -1
0 1 0 0 1 0
-3 1 0 -3
0 0 1 1 0 0 -1 -1 -2 1
Feature
1 0 0 0 1 0
-3 -3 Map0 1
0 1 0 0 1 0 -1 -1 -2 1
0 0 1 0 1 0 3 -2 -2 -1
-1 0 -4 3
6 x 6 image
Two 4 x 4 images
Forming 4 x 4 x 2 matrix
(n x n x 3) * (f x f x 3) = (n-f+1) x (n-f+1) x 1
Color image: RGB 3 channels
(n x n x 3) * (f x f x 3) = (n-f+1) x (n-f+1) x 1
Padding in Convolutional Neural Networks
• Process of adding additional layers of pixels around the border of an image.
• When we perform a convolution operation, we slide a filter across the image. If we only slide the
filter across the original pixels of the image, the resulting output will be smaller than the input
image. i.e. (n x n) * (f x f) = (n-f+1) x (n-f+1)
• In some cases, it’s beneficial to maintain the spatial dimensions (i.e., the width and the height)
of the output the same as the input.
• By adding a layer of zeros around the border of the image, we can apply the filter to more
positions, effectively preserving the spatial dimensions of the output.
Padding in Convolutional Neural Networks
There are two main types of padding:
1. Valid Padding (No padding): In this case, the filter is applied only to valid positions inside the
image, not going beyond the border. This results in smaller output dimensions.
2. Same Padding: The image is padded with enough zeros around the border so that the output
dimensions after the convolution operation are the same as the input dimensions.
(n x n) → (n+2p) x (n+2p) ;p – number of padding
𝑛−𝑓 𝑛−𝑓
(n x n ) * (f x f ) = ( +1) x ( +1)
𝑠 𝑠
Floor function – greatest integer which is less than or EQUAL TO the given number
n x n – image size
f x f – filter size
s - stride length
Stride in Convolutional Neural Networks
The stride’s value impacts the model in a few ways:
1. Dimensionality Reduction: Larger stride values result in smaller output dimensions, effectively
performing a form of dimensionality reduction.
2. Computation Speed: Larger strides can also speed up computation, as the filter needs to be
applied fewer times.
3. Model Capacity: However, larger strides may result in the model losing some detailed information
because the filter doesn’t cover every single pixel; it’s skipping over some. This could potentially
reduce model accuracy, particularly in tasks that require capturing fine-grained details.
The whole CNN
cat dog ……
Convolution
Max Pooling
Can
Fully Connected repeat
Feedforward network
Convolution many
times
Max Pooling
Flattened
Pooling layer
• In practice, (max) pooling layers are placed after convolutional
layers in a CNN.
• After a convolutional layer extracts features from the input image,
the max pooling layer reduces the spatial size of the convolved
feature map, keeping only the most salient information.
• This process is repeated for multiple convolutional and pooling
layers, allowing the network to learn a hierarchy of features at
various levels of abstraction.
Why Pooling
1. Pooling layers are used to reduce the dimensions of the feature maps. Thus, it
reduces the number of parameters to learn and the amount of computation
performed in the network.
New image
1 0 0 0 0 1 but smaller
0 1 0 0 1 0 Conv
3 0
0 0 1 1 0 0 -1 1
1 0 0 0 1 0
0 1 0 0 1 0 Max 3 1
0 3
0 0 1 0 1 0 Pooling
2 x 2 image
6 x 6 image
Each filter
is a channel
Why Pooling
2. Enhances Features
• Types of Pooling:
1. Max Pooling
• Max pooling is a pooling operation that selects the maximum element from
the region of the feature map covered by the filter. Thus, the output after
max-pooling layer would be a feature map containing the most prominent
features of the previous feature map.
1 -1 -1 -1 1 -1
-1 1 -1 Filter 1 -1 1 -1 Filter 2
-1 -1 1 -1 1 -1
3 -1 -3 -1 -1 -1 -1 -1
-3 1 0 -3 -1 -1 -2 1
-3 -3 0 1 -1 -1 -2 1
3 -2 -2 -1 -1 0 -4 3
3 0 -1 1
3 1 0 3
2. Average Pooling
• Average pooling computes the average of the elements present in the region
of feature map covered by the filter. Thus, while max pooling gives the most
prominent feature in a particular patch of the feature map, average pooling
gives the average of features present in a patch.
3. Global Pooling
• Global pooling reduces each channel in the feature map to a single value.
Thus, an nh x nw x nc feature map is reduced to 1 x 1 x nc feature map.
• This is equivalent to using a filter of dimensions nh x nw i.e. the dimensions of
the feature map.
• Further, it can be either global max pooling or global average pooling.
3
Flattening
0
1
3 0
-1 1 3
3 1 -1
0 3 Flattened
1 Fully Connected
Feedforward network
3
Fully connected (FC) layers (Dense layer)
• Used in artificial neural networks where each neuron or node from
the previous layer is connected to each neuron of the current
layer.
• FC layers are typically found towards the end of a neural network
architecture and are responsible for producing final output
predictions.
Fully connected (FC) layers (Dense layer)
• Key Features:
• In CNNs, FC layers often follow the convolutional and pooling layers. They are
used to flatten the 2D spatial structure of the data into a 1D vector and process it
for tasks like classification.
• The weights and biases in FC layers are learned during the training process,
making them adapt to the specific problem at hand.
• The number of neurons in the final FC layer usually matches the number of
output classes in a classification problem. For instance, in a 10-class digit
classification problem, there would be 10 neurons in the final FC layer, each
outputting a score for one class.
Convolution v.s. Fully Connected
1 0 0 0 0 1 1 -1 -1 -1 1 -1
0 1 0 0 1 0 -1 1 -1 -1 1 -1
0 0 1 1 0 0 -1 -1 1 -1 1 -1
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
convolution
image
x1
1 0 0 0 0 1
0 1 0 0 1 0 x2
Fully- 0 0 1 1 0 0
1 0 0 0 1 0
connected
…
…
…
…
0 1 0 0 1 0
0 0 1 0 1 0
x36
• The flattened array will be used as input to the fully connected layer.
• Every neuron of the layer is connected to all the neurons in the previous layer and the next
layer. Thus, it is called a “Fully Connected Layer.”
• The final FC layer/ output layer has neurons equal to labels.
• In the output layer, softmax activation will be used to classify the image
• For binary classification, sigmoid activation will be used
The whole CNN
3 0
-1 1 Convolution
3 1
0 3
Max Pooling
A new image Can
repeat
Convolution many
Smaller than the original
times
image
The number of channels Max Pooling
Max Pooling
Max Pooling
{
{
CNN in Keras
input
Convolutional
1 -1 -1 layer
-1 1 -1
-1 1 -1
-1 1 -1 … There are
25 3x3 Max Pooling
-1 -1 1
-1 1 -1 … layer
filters.
Input_shape = ( 28 , 28 , 1)
28 x 28 pixels Convolutional
1: black/white, 3: RGB
layer
Max Pooling
3 -1 3 layer
-3 1
CNN in Keras
Input
28 x 28 x 1
Convolution
Convolution
How many parameters 225=
for each filter? 11 x 11 x 50
25x9
Max Pooling
5 x 5 x 50
CNN in Keras
Input
28 x 28 x 1
Output Convolution
26 x 26 x 25
Fully connected Max Pooling
feedforward network
13 x 13 x 25
Convolution
11 x 11 x 50
Max Pooling
1250 5 x 5 x 50
Flattened
Image classification using CNN
Speech recognition using CNN
Image Time
Spectrogram
Text classification using CNN
Source of image:
http://citeseerx.ist.psu.edu/viewdoc/downlo
ad?doi=10.1.1.703.6858&rep=rep1&type=p
df
The popular CNN
• LeNet, 1998
• AlexNet, 2012
• VGGNet, 2014
• ResNet, 2015
VGGNet
• 16 layers
• Only 3*3
convolutions
• 138 million
parameters
ResNet
• 152 layers
• ResNet50
Computational complexity
• The memory bottleneck
• GPU, a few GB
Question - 1
The CNN architecture has 3 layers, a flattened layer and 3 fully connected layers. The 3 x 3 filters are
32, 64 and 128, used in the 1st, 2nd and 3rd layers. The pooling layer has a 2 x 2 filter. 1024, 512 and
10 neurons are adopted in fully connected layers. This CNN architecture is used to classify the RGB
image with a size of 256 x 256. Unless specified, assume no padding and stride 1 where appropriate.
𝑛−𝑓+2𝑃 𝑛−𝑓+2𝑃
(n x n ) * (f x f ) = ( +1) x ( +1)
𝑠 𝑠
Floor function – greatest integer which is less than or EQUAL TO the given number
n x n – image size s - stride length
f x f – filter size P - padding
Activation Volume Bias term for each filter
Layer Number of parameters
Dimensions
Input 256 × 256 × 3 0 Adding up all parameters:
CONV3-32 254 x 254 x 32 (3 × 3 × 3 + 1) × 32 = 896
ReLU 254 x 254 x 32 0 896 + 18,496 + 73,856 +
0
117,965,824 + 524,800 +
POOL-2 127 x 127 x 32
5,130
CONV3-64 125 x 125 x 64 (3 × 3 × 32 + 1) × 64 = 18,496 =118,589,002
ReLU 125 x 125 x 64 0
POOL-2 62 x 62 x 64 0 This is the total number of
(3 × 3 × 64 + 1) × 128 = 73,856
weights and biases in the
CONV3-128 60 x 60 x 128
model.
ReLU 60 x 60 x 128 0
POOL-2 30 x 30 x 128 0
FLATTEN 115200 0
FC-1024 1024 (115200 + 1) × 1024 = 117,965,824
FC-512 512 (1024 + 1) × 512 = 524,800
FC-10 10 (512 + 1) × 10 = 5,130
Question - 2
Consider the convolutional neural network defined by the layers in the left column below. Fill in the
shape of the output volume and the number of parameters at each layer. You can write the
activation shapes in the format (H, W, C), where H, W, C are the height, width and channel
dimensions, respectively. Unless specified, assume padding 1 and stride 1 where appropriate.
Activation Volume
Layer Number of parameters
Dimensions
Input 32 × 32 × 3 0 Adding up all parameters:
CONV3-8 32 x 32 x 8 (3 × 3 × 3 + 1) × 8 = 224
ReLU 32 x 32 x 8 0 224 + 32 + 1,168 + 10,250
=11,674
POOL-2 16 x 16 x 8 0
16 x 16 x 8 8 scales (γ) + 8 shifts (β) + 8 This is the total number of
BATCHNORM
mean + 8 variance = 32 weights and biases in the
CONV3-16 16 x 16 x 16 (3 × 3 × 8 + 1) × 16 = 1,168 model.
ReLU 16 x 16 x 16 0
POOL-2 8 x 8 x 16 0
FLATTEN 1024 0
FC-10 10 (1024 + 1) × 10 = 10,250
Question - 3 assume padding 1 and stride 2 where appropriate.
Activation Volume
Layer Number of parameters
Dimensions
Input 32 × 32 × 3 0
Total Parameters Calculation:
CONV3-8 16 x 16 x 8 (3 × 3 × 3 + 1) × 8 = 224
ReLU 16 x 16 x 8 0 •Convolutional + BatchNorm Layers:
224+32+1,168=1,424
POOL-2 (s = 4) 4x4x8 0
4x4x8 8 scales (γ) + 8 shifts (β) + 8 •Fully Connected Layers:
BATCHNORM 8,704+131,328+2,570=142,602
mean + 8 variance = 32
CONV3-16 2 x 2 x 16 (3 × 3 × 8 + 1) × 16 = 1,168
•Total Parameters: 144,026
ReLU 2 x 2 x 16 0
•Trainable Parameters: 144,010
POOL-2 (s = 4) 1 x 1 x 16 0
FLATTEN 16 0 •Non-Trainable Parameters: 16 (from
FC-512 512 (16 + 1) × 512 = 8,704 batch normalization)
FC-256 256 (512 + 1) × 256 = 131,328
FC-10 10 (256 + 1) × 10 = 2,570
References
1. Goodfellow, I., Bengio, Y., and A., C., “Deep Learning,” MIT Press, 2016.
2. Slides: 6.S191, Dana Erlich, Param Vir Singh, David Gifford, Alexander Amini,
Ava Soleimany.
4. https://d2l.ai/chapter_convolutional-neural-networks/index.html.
Thank you!