Convolutional Neural Network
Convolutional Neural Network
network
Dense neural network and Convolutional neural
network
Gray scale vs color image
Convolutional kernel
1 -1 -1
Convolution -1 1 -1 Filter 1
-1 -1 1
stride=1
1 0 0 0 0 1 Dot
product
0 1 0 0 1 0 3 -1
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
6 x 6 image
1 -1 -1
Convolution -1 1 -1 Filter 1
-1 -1 1
If stride=2
1 0 0 0 0 1
0 1 0 0 1 0 3 -3
0 0 1 1 0 0
Importance of stride:
1 0 0 0 1 0 Less overlaps between the
image pixels and filter mask
0 1 0 0 1 0 Less output volume
0 0 1 0 1 0
6 x 6 image
A convolutional layer
A CNN is a neural network with some convolutional layers
(and some other layers). A convolutional layer has a number
of filters that does convolutional operation.
Kernel
A filter
Convolutional kernel
Convolution These are the network
parameters to be learned.
1 -1 -1
1 0 0 0 0 1 -1 1 -1 Filter 1
0 1 0 0 1 0 -1 -1 1
0 0 1 1 0 0
1 0 0 0 1 0 -1 1 -1
-1 1 -1 Filter 2
0 1 0 0 1 0
0 0 1 0 1 0 -1 1 -1
…
…
6 x 6 image
Each filter detects a
small pattern (3 x 3).
Consider learning an image:
• Some patterns are much smaller than the whole
image
“beak” detector
1=edge
-1= not edge
Rectified linear unit,ReLU
Pooling:
1 0 0 0 0 1 1 -1 -1 -1 1 -1
0 1 0 0 1 0 -1 1 -1 -1 1 -1
0 0 1 1 0 0 -1 -1 1 -1 1 -1
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
convolution
image
x1
1 0 0 0 0 1
0 1 0 0 1 0 x2
0 0 1 1 0 0
1 0 0 0 1 0
…
…
…
…
0 1 0 0 1 0
0 0 1 0 1 0
Final image after set of x36
convolution operations
Fully-connected
Softmax unit:
The softmax function is often used as the last activation function of a neural network to
normalize the output of a network to a probability distribution over predicted output classes
Final block diagram of CNN
0.001
0.11
0.12
0.101
0.80 4 will
be activated
0.13
0.113
softmax 0.14
0.15
0.12
To be covered:
Sparsely connected image matrix
Padding
In a convolutional layer, we observe that the pixels located on the corners
and the edges are used much less than those in the middle.
•
Example:
Dilated convolution
• Another way to get larger output size is to spread out input images by
inserting padding –both around and between input elements.
• This is called Dilated convolution.
• Let’s say we have a 3x3 input image.
• Rather than putting 3x3 image as a whole, we split the image into
individual pixels and add padding between the pixels and along the
boundaries as well.
Convolution Algorithm:
1. Stride convolution
2. Padding convolution
3.Transposed convolution
4. Dilated Convolution
Parameters
• Besides this other parameters are learning rate, loss function, batch
size, initial weights also to be chosen according to the problem
Other Architectural considerations
Efficient Architectures in
Neural Networks
Major Architectures
• All Convolutional Net:
no pooling layers, just use strided convolution to shrink representation size
• Inception:
complicated architecture designed to achieve high accuracy with low computational
cost
• ResNet:
blocks of layers with same spatial size, with each layer’s output added to the same
buffer that is repeatedly updated. Very many updates = very deep net, but without
vanishing gradient.
Convolution and Pooling as an
Infinitely Strong Prior
Prior Parameter Distribution
• What is prior?
Probability distribution over the parameters of the model that encode our believe about what
models are reasonable.
• What is a weight prior?
Assumptions about the weights (before learning) in terms of acceptable values and range are
encoded into the prior distribution of the weights.
These assumptions are based on the types of operations performed by convolution and pooling
layers, which impose specific characteristics or expectations on the data.
• Prior parameter distribution
Role of a prior probability distribution over the parameters of a model is:
Encode our belief as to what models are reasonable before seeing the data.
For example
• The "prior" enforced by convolution is
Important patterns or features are likely to be found in local regions of the input data.
• The "prior" enforced by pooling is that
Reducing the spatial dimensions of the data while preserving important information can
be beneficial for recognition tasks.
• Convolution and Pooling as an Infinitely Strong Prior means that
These operations provide a very strong and effective set of assumptions or constraints
that guide the neural network's learning process and make it better at tasks like image
recognition.
Weak and Strong Priors
• A weak prior
• A distribution with high entropy
• e.g., Gaussian with high variance
• A weak prior has a high variance and shows that there is low confidence in the initial
value of the weight.
• Data can move parameters freely
• A strong prior
• It has very low entropy
• E.g., a Gaussian with low variance
• A strong prior in turn shows a narrow range of values about which we are confident
before learning begins.
• Such a prior plays a more active role in determining where the parameters end up
Infinitely strong prior
• An infinitely strong prior places zero probability on some parameters
• It says that some parameter values are forbidden regardless of support from data
With an infinitely strong prior, irrespective of the data the prior cannot be
Changed.
Convolutional Network
• Convolutional networks are simply neural
networks that use convolution in place of
general matrix multiplication in at least
one of their layers.
Convolution as infinitely strong prior
• Convolutional net is similar to a fully connected net but with an infinitely strong prior over its weights.
• It says that the weights for one hidden unit must be identical to the weights of its neighbor, but
shifted in space.
• Prior also says that the weights must be zero, except for in the small spatially contiguous receptive
field assigned to that hidden unit.
• Convolution introduces an infinitely strong prior probability distribution over the parameters of a layer
• This prior says that the function the layer should learn contains only local interactions and is
equivariant to translation
Convolution as infinitely strong prior
• In CNNs, convolution involves sliding a small filter (a matrix of weights) over the
input data (e.g., an image).
• At each step, the filter multiplies its values with the corresponding values in the
input and then sums them up.
• This helps detect patterns or features in different parts of the input.
• The key idea is that it enforces locality – meaning it looks for patterns in small,
nearby regions of the input.
• Example:
Imagine you want to detect edges in a black-and-white image.
Convolutional layers will help the network focus on local patterns like edges,
corners, or textures by sliding a small filter over the image to detect these
features.
Pooling as infinitely strong prior
• After convolution, pooling is often applied.
• Pooling reduces the spatial dimensions of the data by selecting the most important
information from a group of neighboring values.
• Max pooling, for instance, takes the maximum value from a group of values, which
helps preserve the most significant features while reducing the amount of data.
• Example:
Suppose you have an image with a cat, and you want to recognize it.
After convolution, pooling helps focus on the most important parts of the image
like the cat's ears, eyes, and nose, while reducing less important details.
Why is this a "strong prior"?
• Locality:
Convolution enforces the idea that important features are found in local regions of the data.
This is a strong prior knowledge because, in many real-world scenarios, objects or patterns have
local characteristics.
For example, in an image, edges or textures are typically found in small regions.
• Hierarchy of Features:
By using multiple layers of convolution and pooling, CNNs build a hierarchy of features.
Early layers detect basic features like edges, while deeper layers combine them to detect more
complex patterns.
This hierarchy is a strong prior because it mimics how our brains perceive and recognize objects –
from simple features to complex objects.
Efficient Convolution Algorithms
Efficient Convolution Algorithms
• How to speed up convolution?
Parallel Computation Resources
Selecting Appropriate Algorithms
Fourier transform:
o converting input and kernel into frequency space.
o Perform point-wise multiplication.
o Convert them back to time domain using an inverse Fourier transform.
When d-D kernel can be expressed as outer product of o vectors, the kernel is called separable.
o Composing d 1-D convolution with each of these vectors is significantly faster than performing 1 d-D
convolution with their outer product.
o The naive approach requires O(wd) runtime and parameter storage place. Separable approach
requires O(w∗d) runtime and storage place.
Even techniques that improve the efficiency of only forward propagation are useful because in the
commercial settings, it is typical to devote more resource to deployment of network than to its training.
Random or Unsupervised
Features
Random or Unsupervised Features
• Typically, the most expensive part of conv network training is learning the features. There are 3
basic strategies for obtaining convolution kernels without supervised training.
1. Simply initialize convolutional kernels randomly:
Random filters work well in convolutional networks. Inexpensive way to choose the
architecture of a convolutional network:
2. Design them by hand
3. Learn the kernel with an unsupervised method:
Learning the features from unsupervised method allows them to be determined separately
from the classifier layer at the top of the architecture.
• Intermediate approach, greedy layer-wise pretraining, e.g.: Convolutional Deep Believe Network.
• Instead of training an entire convolutional layer at a time, we can train a model of small patch, we
can use the parameters from this patch-based to define the kernels of a convolutional layer.
• Today, most convolution networks are trained in a purely supervised fashion, using full forward
and back-propagation through the entire network on each training iteration.
Popular Architectures in
Neural Networks