Module 3
Module 3
CNN
Convolutional Networks
Convolution supports three important ideas that can help improve a machine
learning system:
1. Sparse interactions
2. Parameter sharing
3. Equivariant representations
Sparse interactions
• Convolutional networks typically have sparse interactions (also referred to as sparse connectivity or sparse weights).
This is accomplished by making the kernel smaller than the input.
• For example, when processing an image, the input image might have thousands or millions of pixels, but we can detect
small,meaningful features such as edges with kernels that occupy only tens or hundreds of pixels. This means that we need
to store fewer parameters, which both reduces the memory requirements of the model and improves its statistical efficiency.
• It also means that computing the output requires fewer operations. These improvements in efficiency are usually quite large.
• If there are m inputs and n outputs, then matrix multiplication requires m×n parameters and the algorithms used in practice
have O(m × n) runtime (per example).
• If we limit the number of connections each output may have to k, then the sparsely connected approach requires only k × n
parameters and O(k × n) runtime.
• For many practical applications, it is possible to obtain good performance on the machine learning task while keeping k
several orders of magnitude smaller than m.
Parameter sharing
• Parameter sharing refers to using the same parameter for more than one function in a model.
• In a traditional neural net, each element of the weight matrix is used exactly once when computing
the output of a layer. It is multiplied by one element of the input and then never revisited
• It is said that a network has tied weights, because the value of the weight applied to one input is
tied to the value of a weight applied elsewhere.
• The parameter sharing used by the convolution operation means that rather than learning a separate
set of parameters for every location, we learn only one set.
• This does not affect the runtime of forward propagation—it is still O(k × n)—but it does further
reduce the storage requirements of the model to k parameters.
• Convolution is thus dramatically more efficient than dense matrix multiplication in terms of the
memory requirements and statistical efficiency.
Equivariant representations
• In the case of convolution, the particular form of parameter sharing causes the
layer to have a property called equivariance to translation.
• To say a function is equivariant means that if the input changes, the output
changes in the same way.
• Specifically, a function f (x) is equivariant to a function g if f(g(x)) = g(f (x)).
• In the case of convolution, if we let g be any function that translates the
input,i.e., shifts it, then the convolution function is equivariant to g.
• If we move the object in the input, its representation will move the same amount
in the output. This is useful for when we know that some function of a small
number of neighboring pixels is useful when applied to multiple input locations.
CNN - components
Eg:-Beak detector
A filter
Convolution
These are the network
parameters to be learned.
1 -1 -1
1 0 0 0 0 1 -1 1 -1 Filter 1
0 1 0 0 1 0 -1 -1 1
0 0 1 1 0 0
1 0 0 0 1 0 -1 1 -1
-1 1 -1 Filter 2
0 1 0 0 1 0
0 0 1 0 1 0 -1 1 -1
…
…
6 x 6 image
Each filter detects a
small pattern (3 x 3).
Convolution
Filter 1
1 -1 -1
stride=1
-1 1 -1
1 0 0 0 0 1 Dot
-1 -1 1
product
0 1 0 0 1 0 3 -1
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
6 x 6 image
Convolution
Filter 1
If stride=2
1 -1 -1
1 0 0 0 0 1
-1 1 -1
0 1 0 0 1 0 3 -3
-1 -1 1
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
6 x 6 image
1 -1 -1
Convolution -1 1 -1 Filter 1
-1 -1 1
stride=1
1 0 0 0 0 1
0 1 0 0 1 0 3 -1 -3 -1
0 0 1 1 0 0
1 0 0 0 1 0 -3 1 0 -3
0 1 0 0 1 0
0 0 1 0 1 0 -3 -3 0 1
6 x 6 image 3 -2 -2 -1
Convolution Filter 2 -1 1 -1
-1 1 -1
stride=1 -1 1 -1
Repeat this for each filter
1 0 0 0 0 1
0 1 0 0 1 0 3 -1 -3 -1
0 0 1 1 0 0 -1 -1 -1 -1
1 0 0 0 1 0 -3 1 0 -3
-1 -1 -2 1 Feature
0 1 0 0 1 0
-3 -3 0 1 Map
0 0 1 0 1 0 -1 -1 -2 1
6 x 6 image 3 -2 -2 -1
-1 0 -4 3
Two 4 x 4 images
Forming 2 x 4 x 4 matrix
Color image: RGB 3 channels
1 1 -1-1 -1-1 -1-1 1 1 -1-1
Filter 1
-1-1 11 -1 -1 -1-1 1-11 -11-1 -1 Filter 2
-1
-1-1 -1-1 1 -1 -1-1 1-11 -11-1 -1
1
-1 -1 1 -1 1 -1
Color image
1 0 0 0 0 1
1 0 0 0 0 1
0 11 00 00 01 0 1
0 1 0 0 1 0
0 00 11 01 00 10 0
0 0 1 1 0 0
1 00 00 10 11 0 0
1 0 0 0 1 0
0 11 00 00 01 10 0
0 1 0 0 1 0
0 00 11 00 01 10 0
0 0 1 0 1 0
0 0 1 0 1 0
Convolution v.s. Fully Connected
1 0 0 0 0 1 1 -1 -1 -1 1 -1
0 1 0 0 1 0 -1 1 -1 -1 1 -1
0 0 1 1 0 0 -1 -1 1 -1 1 -1
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
convolution
image
x1
1 0 0 0 0 1
0 1 0 0 1 0 x2
Fully- 0 0 1 1 0 0
1 0 0 0 1 0
connected
…
…
…
…
0 1 0 0 1 0
0 0 1 0 1 0
x36
3
Flattening
0
3
3 0
-1 1 1
3 1 -1
0 3 Flattened
Fully Connected
1 Feedforward network
3
The whole CNN
cat dog ……
Convolution
Max Pooling
Can repeat
Fully Connected many
Feedforward network
Convolution times
Max Pooling
Flattened
Max Pooling
1 -1 -1 -1 1 -1
-1 1 -1 Filter 1 -1 1 -1 Filter 2
-1 -1 1 -1 1 -1
3 -1 -3 -1 -1 -1 -1 -1
-3 1 0 -3 -1 -1 -2 1
-3 -3 0 1 -1 -1 -2 1
3 -2 -2 -1 -1 0 -4 3
Max Pooling
New image
1 0 0 0 0 1 but smaller
0 1 0 0 1 0 Conv
3 0
0 0 1 1 0 0
-1 1
1 0 0 0 1 0
0 1 0 0 1 0 Max 3 1
Pooling 0 3
0 0 1 0 1 0
2 x 2 image
6 x 6 image
Each filter
is a channel
EXAMPLE
Figures (i), (ii) and (iii) that max-pools the 3 colour channels for an example input volume
for the pooling layer. The operation uses a stride value of [2, 2].
Figure (iv) shows the operation applied for a stride value of [1,1], resulting in a 3×3
matrix Here we observe overlap between regions.
Variants of the Basic Convolution Function
1.Valid
2.Same
3.Full
Zero padding -VALID
• (Top figure)In this convolutional network, we do not use any implicit zero
padding. This causes the representation to shrink by five pixels at each
layer. Starting from an input of sixteen pixels, we are only able to have
three convolutional layers, and the last layer does not ever move the kernel,
so arguably only two of the layers are truly convolutional. The rate of
shrinking can be mitigated by using smaller kernels, but smaller kernels are
less expressive and some shrinking is inevitable in this kind of architecture.
• (Bottom figure)By adding five implicit zeroes to each layer, we prevent the
representation from shrinking with depth. This allows us to make an
arbitrarily deep convolutional network.
Tiled convolution