Object Detaction
Object Detaction
Figure 8.18: Example of image classification/localization in which the class “fish” is identi-
fied together with its bounding box. The image is illustrative only.
be performed by deriving the weights from a trained deep-belief convolutional network [285].
This is analogous to the approach in traditional neural networks, where stacked Boltzmann
machines were among the earliest models used for pretraining.
FULLY
CONNECTED
SOFTMAX
FULLY CLASS
CONNECTED PROBABILITIES
CLASSIFICATION HEAD
CONVOLUTION LAYERS
(WEIGHTS FIXED FOR TRAIN FOR CLASSIFICATION
BOTH CLASSIFICATION
AND REGRESSION)
FULLY
CONNECTED
LINEAR LAYER
FULLY
CONNECTED
BOUNDING
BOX (FOUR
NUMBERS)
REGRESSION HEAD
TRAIN FOR REGRESSION
1. First, we train a neural network classifier like AlexNet or use a pretrained version of
this classifier. In the first phase, it suffices to train the classifier only with image-class
pairs. One can even use an off-the-shelf pretrained version of the classifier, which was
trained on ImageNet.
2. The last two fully connected layers and softmax layers are removed. This removed
set of layers is referred to as the classification head. A new set of two fully connected
8.6. APPLICATIONS OF CONVOLUTIONAL NETWORKS 365
Figure 8.20: Example of object detection. Here, four objects are identified together with
their bounding boxes. The four objects are “fish,” “girl,” “bucket,” and “seat.” The image
is illustrative only.
layers and a linear regression layer is attached. Only these layers are then trained with
training data containing images and their bounding boxes. This new set of layers is
referred to as the regression head. Note that the weights of the convolution layers are
fixed, and are not changed. Both the classification and regression heads are shown
in Figure 8.19. Since the classification and regression heads are not connected to one
another in any way, these two layers can be trained independently. The convolution
layers play the role of creating visual features for both classification and regression.
3. One can optionally fine-tune the convolution layers to be sensitive to both classification
and regression (since they were originally trained only for classification). In such a
case, both classification and regression heads are attached, and the training data for
images, their classes, and bounding boxes are shown to the network. Backpropagation
is used to fine-tune all layers. This full architecture is shown in Figure 8.19.
4. The entire network (with both classification and regression heads attached) is then
used on the test images. The outputs of the classification head provide the class
probabilities, whereas the outputs of the regression head provide the bounding boxes.
One can obtain results of superior quality by using a sliding-window approach. The basic
idea in the sliding-window approach is to perform the localization at multiple locations in
the image with the use of a sliding window, and then integrate the results of the different
runs. An example of this approach is the Overfeat method [441]. Refer to the bibliographic
notes for pointers to other localization methods.
Object detection is generally a more difficult problem than that of localization because of
the variable number of outputs. In fact, one does not even know a priori how many objects
there are in the image. For example, one cannot use the architecture of the previous section,
where it is not clear how many classification or regression heads one might attach to the
convolutional layers.
The simplest approach to this problem is to use a sliding window approach. In the sliding
window approach, one tries all possible bounding boxes in the image, on which the object
localization approach is applied to detect a single object. As a result, one might detect
different objects in different bounding boxes, or the same object in overlapping bounding
boxes. The detections from the different bounding boxes can then be integrated in order
to provide the final result. Unfortunately, the approach can be rather expensive. For an
image of size L × L, the number of possible bounding boxes is L4 . Note that one would have
to perform the classification/regression for each of these L4 possibilities for each image at
test time. This is a problem, because one generally expects the testing times to be modest
enough to provide real-time responses.
In order to address this issue region proposal methods were advanced. The basic idea
of a region proposal method is that it can serve as a general-purpose object detector that
merges regions with similar pixels together to create larger regions. Therefore, the region
proposal methods are used to first create a set of candidate bounding boxes, and then the
object classification/localization method is run in each of them. Note that some candidate
regions might not have valid objects, and others might have overlapping objects. These are
then used to integrate and identify all the objects in the image. This broader approach has
been used in various techniques like MCG [172], EdgeBoxes [568], and SelectiveSearch [501].
with this approach is that the use of one-hot encoding increases the number of channels,
and therefore blows up the number of parameters in the filters in the first layer. The lex-
icon size of a typical corpus may often be of the order of 106 . Therefore, various types of
pretrained embeddings of words, such as word2vec or GLoVe [371] are used (cf. Chapter 2)
in lieu of the one-hot encodings of the individual words. Such word encodings are semanti-
cally rich, and the dimensionality of the representation can be reduced to a few thousand
(from a hundred-thousand). This approach can provide an order of magnitude reduction
in the number of parameters in the first layer, in addition to providing a semantically rich
representation. All other operations (like max-pooling or convolutions) in the case of text
data are similar to those of image data.
8.7 Summary
This chapter discusses the use of convolutional neural networks with a primary focus on
image processing. These networks are biologically inspired and are among the earliest suc-
cess stories of the power of neural networks. An important focus of this chapter is the
classification problem, although these methods can be used for additional applications such
as unsupervised feature learning, object detection, and localization. Convolutional neural
networks typically learn hierarchical features in different layers, where the earlier layers
learn primitive shapes, whereas the later layers learn more complex shapes. The backprop-
agation methods for convolutional neural networks are closely related to the problems of
deconvolution and visualization. Recently, convolutional neural networks have also been
used for text processing, where they have shown competitive performance with recurrent
neural networks.