0% found this document useful (0 votes)
3 views

Object Detaction

Convolutional neural networks (CNNs) are extensively applied in object detection, localization, video, and text processing, leveraging engineered features for various multidimensional applications. The document discusses techniques for content-based image retrieval, object localization, object detection, and their integration with recurrent neural networks for video classification. It highlights the hierarchical learning of features in CNNs and their competitive performance in text processing compared to traditional recurrent networks.

Uploaded by

swati.dbit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Object Detaction

Convolutional neural networks (CNNs) are extensively applied in object detection, localization, video, and text processing, leveraging engineered features for various multidimensional applications. The document discusses techniques for content-based image retrieval, object localization, object detection, and their integration with recurrent neural networks for video classification. It highlights the hierarchical learning of features in CNNs and their competitive performance in text processing compared to traditional recurrent networks.

Uploaded by

swati.dbit
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

8.6.

APPLICATIONS OF CONVOLUTIONAL NETWORKS 363

Figure 8.18: Example of image classification/localization in which the class “fish” is identi-
fied together with its bounding box. The image is illustrative only.

be performed by deriving the weights from a trained deep-belief convolutional network [285].
This is analogous to the approach in traditional neural networks, where stacked Boltzmann
machines were among the earliest models used for pretraining.

8.6 Applications of Convolutional Networks


Convolutional neural networks have several applications in object detection, localization,
video, and text processing. Many of these applications work on the basic principle of using
convolutional neural networks to provide engineered features, on top of which multidimen-
sional applications can be constructed. The success of convolutional neural networks remains
unmatched by almost any class of neural networks. In recent years, competitive methods
have even been proposed for sequence-to-sequence learning, which has traditionally been
the domain of recurrent networks.

8.6.1 Content-Based Image Retrieval


In content-based image retrieval, each image is first engineered into a set of multidimensional
features by using a pretrained classifier like AlexNet. The pretraining is typically done
up front using a large data set like ImageNet. A huge number of choices of such pretrained
classifiers is available at [586]. The features from the fully connected layers of the classifier
can be used to create a multidimensional representation of the images. The multidimensional
representations of the images can be used in conjunction with any multidimensional retrieval
system to provide results of high quality. The use of neural codes for image retrieval is
discussed in [16]. The reason that this approach works is because the features extracted
from AlexNet have semantic significance to the different types of shapes present in the data.
As a result, the quality of the retrieval is generally quite high when working with these
features.
364 CHAPTER 8. CONVOLUTIONAL NEURAL NETWORKS

FULLY
CONNECTED

SOFTMAX
FULLY CLASS
CONNECTED PROBABILITIES

CLASSIFICATION HEAD
CONVOLUTION LAYERS
(WEIGHTS FIXED FOR TRAIN FOR CLASSIFICATION
BOTH CLASSIFICATION
AND REGRESSION)
FULLY
CONNECTED

LINEAR LAYER
FULLY
CONNECTED
BOUNDING
BOX (FOUR
NUMBERS)
REGRESSION HEAD
TRAIN FOR REGRESSION

Figure 8.19: The broad framework of classification and localization

8.6.2 Object Localization


In object localization, we have a fixed set of objects in an image, and we would like to
identify the rectangular regions in the image in which the object occurs. The basic idea is
to take an image with a fixed number of objects and encase each of them in a bounding
box. In the following, we will consider the simple case in which a single object exists in
the image. Image localization is usually integrated with the classification problem, in which
we first wish to classify the object in the image and draw a bounding box around it. For
simplicity, we consider the case in which there is a single object in the image. We have shown
an example of image classification and localization in Figure 8.18, in which the class “fish”
is identified, and a bounding box is drawn around the portion of the image that delineates
that class.
The bounding box of an image can be uniquely identified with four numbers. A common
choice is to identify the top-left corner of the bounding box, and the two dimensions of
the box. Therefore, one can identify a box with four unique numbers. This is a regression
problem with multiple targets. Here, the key is to understand that one can train almost
the same model for both classification and regression, which vary only in terms of the final
two fully connected layers. This is because the semantic nature of the features extracted
from the convolution network are often highly generalizable across a wide variety of tasks.
Therefore, one can use the following approach:

1. First, we train a neural network classifier like AlexNet or use a pretrained version of
this classifier. In the first phase, it suffices to train the classifier only with image-class
pairs. One can even use an off-the-shelf pretrained version of the classifier, which was
trained on ImageNet.

2. The last two fully connected layers and softmax layers are removed. This removed
set of layers is referred to as the classification head. A new set of two fully connected
8.6. APPLICATIONS OF CONVOLUTIONAL NETWORKS 365

Figure 8.20: Example of object detection. Here, four objects are identified together with
their bounding boxes. The four objects are “fish,” “girl,” “bucket,” and “seat.” The image
is illustrative only.

layers and a linear regression layer is attached. Only these layers are then trained with
training data containing images and their bounding boxes. This new set of layers is
referred to as the regression head. Note that the weights of the convolution layers are
fixed, and are not changed. Both the classification and regression heads are shown
in Figure 8.19. Since the classification and regression heads are not connected to one
another in any way, these two layers can be trained independently. The convolution
layers play the role of creating visual features for both classification and regression.

3. One can optionally fine-tune the convolution layers to be sensitive to both classification
and regression (since they were originally trained only for classification). In such a
case, both classification and regression heads are attached, and the training data for
images, their classes, and bounding boxes are shown to the network. Backpropagation
is used to fine-tune all layers. This full architecture is shown in Figure 8.19.

4. The entire network (with both classification and regression heads attached) is then
used on the test images. The outputs of the classification head provide the class
probabilities, whereas the outputs of the regression head provide the bounding boxes.

One can obtain results of superior quality by using a sliding-window approach. The basic
idea in the sliding-window approach is to perform the localization at multiple locations in
the image with the use of a sliding window, and then integrate the results of the different
runs. An example of this approach is the Overfeat method [441]. Refer to the bibliographic
notes for pointers to other localization methods.

8.6.3 Object Detection


Object detection is very similar to object localization, except that there is a variable number
of objects of different classes in the image. In this case, one wishes to identify all the objects
in the image together with their classes. We have shown an example of object detection
in Figure 8.20, in which there are four objects corresponding to the classes “fish,” “girl,”
“bucket,” and “seat.” The bounding boxes of these classes are also shown in the figure.
366 CHAPTER 8. CONVOLUTIONAL NEURAL NETWORKS

Object detection is generally a more difficult problem than that of localization because of
the variable number of outputs. In fact, one does not even know a priori how many objects
there are in the image. For example, one cannot use the architecture of the previous section,
where it is not clear how many classification or regression heads one might attach to the
convolutional layers.
The simplest approach to this problem is to use a sliding window approach. In the sliding
window approach, one tries all possible bounding boxes in the image, on which the object
localization approach is applied to detect a single object. As a result, one might detect
different objects in different bounding boxes, or the same object in overlapping bounding
boxes. The detections from the different bounding boxes can then be integrated in order
to provide the final result. Unfortunately, the approach can be rather expensive. For an
image of size L × L, the number of possible bounding boxes is L4 . Note that one would have
to perform the classification/regression for each of these L4 possibilities for each image at
test time. This is a problem, because one generally expects the testing times to be modest
enough to provide real-time responses.
In order to address this issue region proposal methods were advanced. The basic idea
of a region proposal method is that it can serve as a general-purpose object detector that
merges regions with similar pixels together to create larger regions. Therefore, the region
proposal methods are used to first create a set of candidate bounding boxes, and then the
object classification/localization method is run in each of them. Note that some candidate
regions might not have valid objects, and others might have overlapping objects. These are
then used to integrate and identify all the objects in the image. This broader approach has
been used in various techniques like MCG [172], EdgeBoxes [568], and SelectiveSearch [501].

8.6.4 Natural Language and Sequence Learning


While the preferred way of machine learning with text sequences is that of recurrent neural
networks, the use of convolutional neural networks has become increasingly popular in
recent years. At first sight, convolutional neural networks do not seem like a natural fit for
text-mining tasks. First, image shapes are interpreted in the same way, irrespective of where
they are in the image. This is not quite the case for text, where the position of a word in a
sentence seems to matter quite a bit. Second, issues such as position translation and shift
cannot be treated in the same way in text data. Neighboring pixels in an image are usually
very similar, whereas neighboring words in text are almost never the same. In spite of these
differences, the systems based on convolutional networks have shown improved performance
in recent years.
Just as an image is represented as a 2-dimensional object with an additional depth
dimension defined by the number of color channels, a text sequence is represented as 1-
dimensional object with depth defined by its dimensionality of representation. The dimen-
sionality of representation of a text sentence is equal to the lexicon size for the case of
one-hot encoding. Therefore, instead of 3-dimensional boxes with a spatial extent and a
depth (color channels/feature maps), the filters for text data are 2-dimensional boxes with
a window (sequence) length for sliding along the sentence and a depth defined by the lex-
icon. In later layers of the convolutional network, the depth is defined by the number of
feature maps rather than the lexicon size. Furthermore, the number of filters in a given layer
defines the number of feature maps in the next layer (as in image data). In image data,
one performs convolutions at all 2-dimensional locations, whereas in text data one performs
convolutions at all 1-dimensional points in the sentence with the same filter. One challenge
8.6. APPLICATIONS OF CONVOLUTIONAL NETWORKS 367

with this approach is that the use of one-hot encoding increases the number of channels,
and therefore blows up the number of parameters in the filters in the first layer. The lex-
icon size of a typical corpus may often be of the order of 106 . Therefore, various types of
pretrained embeddings of words, such as word2vec or GLoVe [371] are used (cf. Chapter 2)
in lieu of the one-hot encodings of the individual words. Such word encodings are semanti-
cally rich, and the dimensionality of the representation can be reduced to a few thousand
(from a hundred-thousand). This approach can provide an order of magnitude reduction
in the number of parameters in the first layer, in addition to providing a semantically rich
representation. All other operations (like max-pooling or convolutions) in the case of text
data are similar to those of image data.

8.6.5 Video Classification

Videos can be considered generalizations of image data in which a temporal component


is inherent to a sequence of images. This type of data can be considered spatio-temporal
data, which requires us to generalize the 2-dimensional spatial convolutions to 3-dimensional
spatio-temporal convolutions. Each frame in a video can be considered an image, and one
therefore receives a sequence of images in time. Consider a situation in which each image
is of size 224 × 224 × 3, and a total of 10 frames are received. Therefore, the size of the
video segment is 224 × 224 × 10 × 3. Instead of performing spatial convolutions with a 2-
dimensional spatial filter (with an additional depth dimension capturing 3 color channels),
we perform spatiotemporal convolutions with a 3-dimensional spatiotemporal filter (and
a depth dimension capturing the color channels). Here, it is interesting to note that the
nature of the filter depends on the data set at hand. A purely sequential data set (e.g., text)
requires 1-dimensional convolutions with windows, an image data set requires 2-dimensional
convolutions, and a video data set requires 3-dimensional convolutions. We refer to the
bibliographic notes for pointers to several papers that use 3-dimensional convolutions for
video classification.
An interesting observation is that 3-dimensional convolutions add only a limited amount
to what one can achieve by averaging the classifications of individual frames by image clas-
sifiers. A part of the problem is that motion adds only a limited amount to the information
that is available in the individual frames for classification purposes. Furthermore, suffi-
ciently large video data sets are hard to come by. For example, even a data set containing a
million videos is often not sufficient because the amount of data required for 3-dimensional
convolutions is much larger than that required for 2-dimensional convolutions. Finally, 3-
dimensional convolutional neural networks are good for relatively short segments of video
(e.g., half a second), but they might not be so good for longer videos.
For the case of longer videos, it makes sense to combine recurrent neural networks
(or LSTMs) with convolutional neural networks. For example, we can use 2-dimensional
convolutions over individual frames, but a recurrent network is used to carry over states
from one frame to the next. One can also use 3-dimensional convolutional neural networks
over short segments of video, and then hook them up with recurrent units. Such an approach
helps in identifying actions over longer time horizons. Refer to the bibliographic notes for
pointers to methods that combine convolutional and recurrent neural networks.
368 CHAPTER 8. CONVOLUTIONAL NEURAL NETWORKS

8.7 Summary
This chapter discusses the use of convolutional neural networks with a primary focus on
image processing. These networks are biologically inspired and are among the earliest suc-
cess stories of the power of neural networks. An important focus of this chapter is the
classification problem, although these methods can be used for additional applications such
as unsupervised feature learning, object detection, and localization. Convolutional neural
networks typically learn hierarchical features in different layers, where the earlier layers
learn primitive shapes, whereas the later layers learn more complex shapes. The backprop-
agation methods for convolutional neural networks are closely related to the problems of
deconvolution and visualization. Recently, convolutional neural networks have also been
used for text processing, where they have shown competitive performance with recurrent
neural networks.

8.8 Bibliographic Notes


The earliest inspiration for convolutional neural networks came from Hubel and Wiesel’s
experiments with the cat’s visual cortex [212]. Based on many of these principles, the notion
of the neocognitron was proposed in early work. These ideas were then generalized to the
first convolutional network, which was referred to as LeNet-5 [279]. An early discussion on
the best practices and principles of convolutional neural networks may be found in [452].
An excellent overview of convolutional neural networks may be found in [236]. A tutorial on
convolution arithmetic is available in [109]. A brief discussion of applications may be found
in [283].
The earliest data set that was used popularly for training convolutional neural net-
works was the MNIST database of handwritten digits [281]. Later, larger datasets like
ImageNet [581] became more popular. Competitions such as the ImageNet challenge
(ILSVRC) [582] have served as sources of some of the best algorithms over the last five
years. Examples of neural networks that have done well at various competitions include
AlexNet [255], ZFNet [556], VGG [454], GoogLeNet [485], and ResNet [184]. The ResNet
is closely related to highway networks [505], and it provides an iterative view of feature
engineering. A useful precursor to GoogLeNet was the Network-in-Network (NiN) architec-
ture [297], which illustrated some useful design principles of the inception module (such
as the use of bottleneck operations). Several explanations of why ResNet works well are
provided in [185, 505]. The use of inception modules between skip connections is proposed
in [537]. The use of stochastic depth in combination with residual networks is discussed
in [210]. Wide residual networks are proposed in [549]. A related architecture, referred to
as FractalNet [268], uses both short and long paths in the network, but does not use skip
connections. Training is done by dropping subpaths in the network, although prediction is
done on the full network.
Off-the-shelf feature extraction methods with pretrained models are discussed in [223,
390, 585]. In cases where the nature of the application is very different from ImageNet data,
it might make sense to extract features only from the lower layers of the pretrained model.
This is because lower layers often encode more generic/primitive features like edges and basic
shapes, which tend to work across an array of settings. The local-response normalization
approach is closely related to the contrast normalization discussed in [221].
The work in [466] proposes that it makes sense to replace the max-pooling layer with
a convolutional layer with increased stride. Not using a max-pooling layer is an advantage

You might also like