0% found this document useful (0 votes)
29 views

Module 3

Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views

Module 3

Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 46

MODULE 3

CNN
Convolutional Networks

• Convolutional networks also known as convolutional neural networks or


CNNs, are a specialized kind of neural network for processing data that has a
known, grid-like topology.
• Examples
• Time-series data, which can be thought of as a 1D grid taking samples at regular time
intervals.
• Image data,which can be thought of as a 2D grid of pixels

• Convolutional networks are simply neural networks that use convolution in


place of general matrix multiplication in at least one of their layers.
Convolution Operation
Convolution is an operation on two functions of a real valued argument.
•The convolution operation is typically denoted with an asterisk.
s(t) = (x ∗ w)(t)
•First argument (here, the function x) to the convolution is often referred to
as the input and the second argument (the function w) as the kernel. The
output is sometimes referred to as the feature map.
•In machine learning applications, the input is usually a multidimensional
array of data and the kernel is usually a multidimensional array of
parameters that are adapted by the learning algorithm.
•We will refer to these multidimensional arrays as tensors.
An example of 2-D convolution
Motivation

Convolution supports three important ideas that can help improve a machine
learning system:
1. Sparse interactions
2. Parameter sharing
3. Equivariant representations
Sparse interactions

• Convolutional networks typically have sparse interactions (also referred to as sparse connectivity or sparse weights).
This is accomplished by making the kernel smaller than the input.
• For example, when processing an image, the input image might have thousands or millions of pixels, but we can detect
small,meaningful features such as edges with kernels that occupy only tens or hundreds of pixels. This means that we need
to store fewer parameters, which both reduces the memory requirements of the model and improves its statistical efficiency.
• It also means that computing the output requires fewer operations. These improvements in efficiency are usually quite large.
• If there are m inputs and n outputs, then matrix multiplication requires m×n parameters and the algorithms used in practice
have O(m × n) runtime (per example).
• If we limit the number of connections each output may have to k, then the sparsely connected approach requires only k × n
parameters and O(k × n) runtime.
• For many practical applications, it is possible to obtain good performance on the machine learning task while keeping k
several orders of magnitude smaller than m.
Parameter sharing

• Parameter sharing refers to using the same parameter for more than one function in a model.
• In a traditional neural net, each element of the weight matrix is used exactly once when computing
the output of a layer. It is multiplied by one element of the input and then never revisited
• It is said that a network has tied weights, because the value of the weight applied to one input is
tied to the value of a weight applied elsewhere.
• The parameter sharing used by the convolution operation means that rather than learning a separate
set of parameters for every location, we learn only one set.
• This does not affect the runtime of forward propagation—it is still O(k × n)—but it does further
reduce the storage requirements of the model to k parameters.
• Convolution is thus dramatically more efficient than dense matrix multiplication in terms of the
memory requirements and statistical efficiency.
Equivariant representations
• In the case of convolution, the particular form of parameter sharing causes the
layer to have a property called equivariance to translation.
• To say a function is equivariant means that if the input changes, the output
changes in the same way.
• Specifically, a function f (x) is equivariant to a function g if f(g(x)) = g(f (x)).
• In the case of convolution, if we let g be any function that translates the
input,i.e., shifts it, then the convolution function is equivariant to g.
• If we move the object in the input, its representation will move the same amount
in the output. This is useful for when we know that some function of a small
number of neighboring pixels is useful when applied to multiple input locations.
CNN - components

A typical layer of a convolutional network consists of three stages.


1.In the first stage, the layer performs several convolutions in parallel to
produce a set of linear activations.
2.In the second stage, each linear activation is run through a nonlinear
activation function, such as the rectified linear activation function.This stage is
sometimes called the detector stage.
3. In the third stage, we use a pooling function to modify the output of the
layer further
The components of a typical convolutional neural network layer
Pooling
• A pooling function replaces the output of the net at a certain location with a
summary statistic of the nearby outputs.
• Max pooling operation reports the maximum output within a rectangular
neighborhood.
• Other popular pooling functions include the average of a rectangular neighborhood,
the L2 norm of a rectangular neighborhood, or a weighted average based on the
distance from the central pixel.
• In all cases, pooling helps to make the representation become approximately
invariant to small translations of the input. Invariance to translation means that if we
translate the input by a small amount, the values of most of the pooled outputs do not
change.
• Invariance to local translation can be a very useful property if we care more about whether
some feature is present than exactly where it is.
Pooling
• Pooling summarizes the responses over a whole neighborhood.
• This improves the computational efficiency of the network because the
next layer has fewer inputs to process.
• When the number of parameters in the next layer is a function of its
input size this reduction in the input size can also result in improved
statistical efficiency and reduced memory requirements for storing the
parameters.
• For many tasks, pooling is essential for handling inputs of varying size.
• For example, if we want to classify images of variable size, the input to
the classification layer must have a fixed size. This is usually
accomplished by varying the size of an offset between pooling regions
so that the classification layer always receives the same number of
summary statistics regardless of the input size.
A convolutional layer
A CNN is a neural network with some convolutional layers
(and some other layers). A convolutional layer has a number
of filters that does convolutional operation.

Eg:-Beak detector

A filter
Convolution
These are the network
parameters to be learned.

1 -1 -1
1 0 0 0 0 1 -1 1 -1 Filter 1
0 1 0 0 1 0 -1 -1 1
0 0 1 1 0 0
1 0 0 0 1 0 -1 1 -1
-1 1 -1 Filter 2
0 1 0 0 1 0
0 0 1 0 1 0 -1 1 -1



6 x 6 image
Each filter detects a
small pattern (3 x 3).
Convolution
Filter 1

1 -1 -1
stride=1
-1 1 -1
1 0 0 0 0 1 Dot
-1 -1 1
product
0 1 0 0 1 0 3 -1
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0

6 x 6 image
Convolution
Filter 1
If stride=2
1 -1 -1
1 0 0 0 0 1
-1 1 -1
0 1 0 0 1 0 3 -3
-1 -1 1
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0

6 x 6 image
1 -1 -1
Convolution -1 1 -1 Filter 1
-1 -1 1
stride=1

1 0 0 0 0 1
0 1 0 0 1 0 3 -1 -3 -1
0 0 1 1 0 0
1 0 0 0 1 0 -3 1 0 -3
0 1 0 0 1 0
0 0 1 0 1 0 -3 -3 0 1

6 x 6 image 3 -2 -2 -1
Convolution Filter 2 -1 1 -1
-1 1 -1
stride=1 -1 1 -1
Repeat this for each filter
1 0 0 0 0 1
0 1 0 0 1 0 3 -1 -3 -1
0 0 1 1 0 0 -1 -1 -1 -1
1 0 0 0 1 0 -3 1 0 -3
-1 -1 -2 1 Feature
0 1 0 0 1 0
-3 -3 0 1 Map
0 0 1 0 1 0 -1 -1 -2 1
6 x 6 image 3 -2 -2 -1
-1 0 -4 3
Two 4 x 4 images
Forming 2 x 4 x 4 matrix
Color image: RGB 3 channels
1 1 -1-1 -1-1 -1-1 1 1 -1-1
Filter 1
-1-1 11 -1 -1 -1-1 1-11 -11-1 -1 Filter 2
-1
-1-1 -1-1 1 -1 -1-1 1-11 -11-1 -1
1
-1 -1 1 -1 1 -1
Color image
1 0 0 0 0 1
1 0 0 0 0 1
0 11 00 00 01 0 1
0 1 0 0 1 0
0 00 11 01 00 10 0
0 0 1 1 0 0
1 00 00 10 11 0 0
1 0 0 0 1 0
0 11 00 00 01 10 0
0 1 0 0 1 0
0 00 11 00 01 10 0
0 0 1 0 1 0
0 0 1 0 1 0
Convolution v.s. Fully Connected

1 0 0 0 0 1 1 -1 -1 -1 1 -1
0 1 0 0 1 0 -1 1 -1 -1 1 -1
0 0 1 1 0 0 -1 -1 1 -1 1 -1
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
convolution
image

x1
1 0 0 0 0 1
0 1 0 0 1 0 x2
Fully- 0 0 1 1 0 0
1 0 0 0 1 0
connected




0 1 0 0 1 0
0 0 1 0 1 0
x36
3
Flattening
0

3
3 0
-1 1 1

3 1 -1
0 3 Flattened
Fully Connected
1 Feedforward network

3
The whole CNN
cat dog ……
Convolution

Max Pooling
Can repeat
Fully Connected many
Feedforward network
Convolution times

Max Pooling

Flattened
Max Pooling
1 -1 -1 -1 1 -1
-1 1 -1 Filter 1 -1 1 -1 Filter 2
-1 -1 1 -1 1 -1

3 -1 -3 -1 -1 -1 -1 -1

-3 1 0 -3 -1 -1 -2 1

-3 -3 0 1 -1 -1 -2 1

3 -2 -2 -1 -1 0 -4 3
Max Pooling

New image
1 0 0 0 0 1 but smaller
0 1 0 0 1 0 Conv
3 0
0 0 1 1 0 0
-1 1
1 0 0 0 1 0
0 1 0 0 1 0 Max 3 1
Pooling 0 3
0 0 1 0 1 0
2 x 2 image
6 x 6 image
Each filter
is a channel
EXAMPLE
Figures (i), (ii) and (iii) that max-pools the 3 colour channels for an example input volume
for the pooling layer. The operation uses a stride value of [2, 2].

Figure (iv) shows the operation applied for a stride value of [1,1], resulting in a 3×3
matrix Here we observe overlap between regions.
Variants of the Basic Convolution Function

• When we refer to convolution in the context of neural networks,


we actually mean an operation that consists of many applications
of convolution in parallel.
• This is because convolution with a single kernel can only extract
one kind of feature, though at many spatial locations.
• Usually we want each layer of our network to extract many kinds
of features, at many locations.
• The input is a grid of vector-valued observations.
• For example, a color image has a red, green and blue intensity at
each pixel.
• In a multilayer convolutional network, the input to the second
layer is the output of the first layer, which usually has the output of
many different convolutions at each position.
• When working with images, we usually think of the input and
output of the convolution as being 3-D tensors, with one index into
the different channels and two indices into the spatial coordinates
of each channel.
Zero padding
• One essential feature of any convolutional network implementation is the ability
to implicitly zero-pad the input in order to make it wider.
• Without this feature,the width of the representation shrinks by one pixel less
than the kernel width at each layer.
• Zero padding the input allows us to control the kernel width and the size of the
output independently.
• Zero-padding refers to the process of symmetrically adding zeroes to the input
matrix.
• It’s a commonly used modification that allows the size of the input to be
adjusted to our requirement. It is mostly used in designing the CNN layers when
the dimensions of the input volume need to be preserved in the output volume.
Why Zero padding?

• Without zero padding, we are forced to choose


between shrinking the spatial extent of the
network rapidly and using small kernel.
• Both scenarios that significantly limit the
expressive power of the network.
Zero padding - Types

1.Valid
2.Same
3.Full
Zero padding -VALID

• The extreme case in which no zero-padding is used


whatsoever, and the convolution kernel is only allowed to
visit positions where the entire kernel is contained entirely
within the image.
• In this case, all pixels in the output are a function of the
same number of pixels in the input
• The size of the output shrinks at each layer.
• If the input image has width m and the kernel has width k, the
output will be of width m − k + 1.
• The rate of this shrinkage can be dramatic if the kernels used are
large. Since the shrinkage is greater than 0, it limits the number of
convolutional layers that can be included in the network.
•  Valid padding is used when it is desired to reduce the size of the
output feature map in order to reduce the number of parameters in
the model and improve its computational efficiency.
Zero padding - SAME
• It is another special case of the zero-padding setting where
just enough zero-padding is added to keep the size of the
output equal to the size of the input.
• This is achieved by adding rows and columns of pixels with a
value of zero around the edges of the input data before the
convolution operation.
• In this case, the network can contain as many convolutional
layers as the available hardware can support, since the
operation of convolution does not modify the architectural
possibilities available to the next layer
Zero padding - FULL

• Inthis, enough zeroes are added for every pixel to be


visited k times in each direction, resulting in an output
image of width m + k − 1, where m is the width of image
and k is the width of kernel.
The effect of zero padding on network size
The effect of zero padding on network size

• (Top figure)In this convolutional network, we do not use any implicit zero
padding. This causes the representation to shrink by five pixels at each
layer. Starting from an input of sixteen pixels, we are only able to have
three convolutional layers, and the last layer does not ever move the kernel,
so arguably only two of the layers are truly convolutional. The rate of
shrinking can be mitigated by using smaller kernels, but smaller kernels are
less expressive and some shrinking is inevitable in this kind of architecture.
• (Bottom figure)By adding five implicit zeroes to each layer, we prevent the
representation from shrinking with depth. This allows us to make an
arbitrarily deep convolutional network.
Tiled convolution

• Tiled convolution offers a compromise between a convolutional


layer and a locally connected layer.
• Rather than learning a separate set of weights at every spatial
location, we learn a set of kernels that we rotate through as we
move through space. This means that immediately neighboring
locations will have different filters, like in a locally connected
layer, but the memory requirements for storing the parameters will
increase only by a factor of the size of this set of kernels, rather
than the size of the entire output feature map.
A comparison of locally connected layers, tiled convolution, and
standard convolution.

A locally connected layer has no sharing at all. We indicate that each


connection has its own weight by labeling each connection with a unique
letter
Tiled convolution has a set of t different kernels. Here we illustrate the case of t = 2. One of
these kernels has edges labeled “a” and “b,” while the other has edges labeled “c” and “d.”
Each time we move one pixel to the right in the output, we move on to using a different kernel.
This means that, like the locally connected layer, neighboring units in the output have different
parameters.
Unlike the locally connected layer, after we have gone through all t available kernels,we cycle
back to the first kernel
Traditional convolution is equivalent to tiled convolution with t = 1. There is
only one kernel and it is applied everywhere, as indicated in the diagram by
using the kernel with weights labeled “a” and “b” everywhere
Structured Outputs
• Convolutional networks can be used to output a high-dimensional,
structured object, rather than just predicting a class label for a
classification task or a real value for a regression task.
• Typically this object is just a tensor, emitted by a standard
convolutional layer. For example, the model might emit a tensor S,
where Si,j,k is the probability that pixel (j, k ) of the input to the
network belongs to class i.
• This allows the model to label every pixel in an image and draw
precise masks that follow the outlines of individual objects.
• One issue that often comes up is that the output plane can be
smaller than the input plane.
• In the kinds of architectures typically used for classification of a
single object in an image, the greatest reduction in the spatial
dimensions of the network comes from using pooling layers with
large stride.
• In order to produce an output map of similar size as the input, one can
avoid pooling altogether.
• Another strategy is to simply emit a lower-resolution grid of labels .
• Finally, in principle, one could use a pooling operator with unit stride.
• One strategy for pixel-wise labeling of images is to produce an initial guess
of the image labels, then refine this initial guess using the interactions
between neighboring pixels. Repeating this refinement step several times
corresponds to using the same convolutions at each stage, sharing weights
between the last layers of the deep net.
• Once a prediction for each pixel is made, various methods can be used to
further process these predictions in order to obtain a segmentation of the
image into regions
Data Types

• The data used with a convolutional network usually consists of several


channels, each channel being the observation of a different quantity at some
point in space or time.
• One advantage to convolutional networks is that they can also process inputs
with varying spatial extents. These kinds of input simply cannot be
represented by traditional, matrix multiplication-based neural networks.
• This provides a compelling reason to use convolutional networks even when
computational cost and overfitting are not significant issues
Examples of different formats of data that can be used with
convolutional networks.

Single channel Multichannel


1D Audio waveform Skeleton animation data

2D Audio data that has been preprocessed Color image data


with a Fourier transform
3D Volumetric data Color video data
END

You might also like