0% found this document useful (0 votes)
2 views40 pages

465-Lecture 5-6

The document discusses Convolutional Neural Networks (CNNs) and their application in image categorization by using spatial features and filters for feature extraction. It highlights the importance of convolution operations, pooling layers, and 1x1 convolutions in reducing dimensionality and computational complexity. Additionally, it mentions the training of CNNs through backpropagation and their applications in fields like self-driving cars and image captioning.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views40 pages

465-Lecture 5-6

The document discusses Convolutional Neural Networks (CNNs) and their application in image categorization by using spatial features and filters for feature extraction. It highlights the importance of convolution operations, pooling layers, and 1x1 convolutions in reducing dimensionality and computational complexity. Additionally, it mentions the training of CNNs through backpropagation and their applications in fields like self-driving cars and image captioning.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 40

CSE 465

Lecture 5 & 6
CNN – Convolutional Neural Network
CNN in a nutshell
Images/videos are just matrix with numbers
What do we want to do?
How to categorize images?

• Use features

• Rectangular shape • Long thick lines • Has nose


• Has bevels • Hanging ropes • Has long ears
• There is a logo of apple • Structure hanging over poles • Has eyes
• White color
Manual feature detection

• Must have domain knowledge


• Have previous experiences with best practices
• Define features
• Define previously known features
• Generate new features
• Use the features to classify
• The classifier will use the features
Manual feature detection: Problems

• Occlusion
• Blocking partially
• Different illumination
• Amount of light/brightness change
• Scale variation/deformation
• Viewpoint variation
How can we use machine to learn features?
Implementation so far

• Use fully connected feed-forward neural network with many layers


• Hopefully each layer will learn some important features
• And, finally we will be able to represent the correct function
• However, no spatial information
• Many, Many features
• Input
• Flattened 1-D vector of image representing numbers
• However, important spatial information gone!!!
What can we do?

• We need to use the spatial features


• How
• We can use filters to detect visual features like lines/segments etc.
• Do not use fully connected layers for the entire input
• Image size could be more than 256X256 nowadays
• If we use 1000 neurons for the first hidden layer
• We need to learn around 200 million parameters for the first layer
• Any bigger than that --- impossible to fit inside a single computer memory
• Use an architecture which reduces the images into features
• Learn the features first
• Then use the features to classify/recognize
First: Use spatial feature
And a match made in heaven - convolution
What does it do?
What does it do?
In practice: same filter sliding window algorithm
What does it do?
Feature extraction with convolution

• Filter of size 4x4 : 16 different weights


• Apply this same filter to 4x4 patches
in input
• Shift by 1/2 (stride) pixels for next
patch

• Apply a set of weights – a filter – to


extract local features
• Use multiple filters to extract different
features
• Spatially share parameters of each
filter
Convolution Operation
Convolution operation (2)
Filter sliding
Padding

• Padding preserves the spatial size of the


input image/volume
• So the input and output width and height
remain the same
• This is important for building
deeper networks
• Otherwise, the height/width would shrink as
we go deeper layers
Stride

• The number of pixels the filter slides over the image is called stride
• For example, to slide the convolution filter one pixel at a time, the strides value is 1
• If we want to jump two pixels at a time, the strides value is 2
• Strides of 3 or more are uncommon and rare in practice
• Jumping pixels produces smaller output volumes spatially
• Strides of 1 will make the output image roughly the same height and width of the
input image, while strides of 2 will make the output image roughly half of the input
image size
Pooling

• The goal of the pooling layer is to down sample the feature maps produced by the
convolutional layer into a smaller number of parameters, thus reducing
computational complexity
• Pooling filters do not have weights or any values
• All they do is slide over the feature map created by the previous convolutional layer and select
the pixel value to pass along to the next layer, ignoring the remaining values
Pooling

• Max pooling: Selects the max of the numbers on the window


• Average pooling: Selects the average of the numbers on the window
• Global average pooling: Selects the average of all the pixels in the feature
map
Convolution blocks
The complete network
Parameter Count for CNN

• Parameters
• Values inside the filters
• W matrix
• b vector
• Number of operations
• Number of multiplications for the CNN operations
• How does it change for different values of the padding or stride
• Number of additions for the CNN operations
• Number of filters
1X1 Convolution
Why 1X1 Convolution

• Most simplistic explanation would be that 1x1 convolution leads to


dimension reduction
• For example, an image of 200 x 200 with 50 features on convolution with
20 filters of 1x1 would result in size of 200 x 200 x 20
• Is this the best way to do dimensionality reduction in the convolutional
neural network? What about the efficacy vs efficiency?
Why 1X1 Convolution

• Although 1x1 convolution is a ‘feature pooling’ technique, there is more to


it than just sum pooling of features across various channels/feature-maps
of a given layer
• 1x1 convolution acts like coordinate-dependent transformation in the filter
space
• This transformation is strictly linear, but in most of application of 1x1
convolution is succeeded by a non-linear activation layer like ReLU
• This transformation is learned through the (stochastic) gradient descent
• An important distinction is that it suffers with less over-fitting due to
smaller kernel size (1x1)
1X1 Convolution
The feature extractor

cat dog ……
Convolution

Max Pooling
The
Fully Connected Feature
Feedforward network
Convolution extractor

Max Pooling

Flattened
A CNN compresses a fully connected network

• Reduces number of connections


• Shared weights on the edges
• Max pooling further reduces the complexity
Training CNN

• Learn weights for convolutional filters and fully connected layers using
backpropagation and the log loss (cross-entropy loss) function
CNN Application: Self driving cars/drones

• Convolution then de-convolution


• Get a pixel level map
Self driving cars/Drones
Generate image caption/description

• Use CNN to detect the image features


• Then instead of using the FCN
• Use RCN to generate the description

You might also like