0% found this document useful (0 votes)
66 views

1-Recent Advances in Object Detection in The Age of Deep Convolutional Neural Networks

Uploaded by

Mostafa Ahmed
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
66 views

1-Recent Advances in Object Detection in The Age of Deep Convolutional Neural Networks

Uploaded by

Mostafa Ahmed
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 104

Recent Advances in Object Detection in the Age of Deep

Convolutional Neural Networks


arXiv:1809.03193v2 [cs.CV] 20 Aug 2019

Shivang Agarwal(∗,1) , Jean Ogier du Terrail(∗,1,2) , Frédéric Jurie(1)


(∗)
equal contribution
(1)
Normandie Univ, UNICAEN, ENSICAEN, CNRS
(2)
Safran Electronics and Defense

August 21, 2019

Abstract

Object detection, the computer vision task dealing with detecting instances of objects of a certain class
(e.g., ’car’, ’plane’, etc.) in images, attracted a lot of attention from the community during the last six
years. This strong interest can be explained not only by the importance this task has for many applications
but also by the phenomenal advances in this area since the arrival of deep convolutional neural networks
(DCNNs). This article reviews the recent literature on object detection with deep CNN, in a comprehensive
way. This study covers not only the design decisions made in modern deep (CNN) object detectors, but also
provides an in-depth perspective on the set of challenges currently faced by the computer vision community,
as well as some complementary and new directions on how to overcome them. In its last part it goes on to
show how object detection can be extended to other modalities and conducted under different constraints.
This survey also reviews in its appendix the public datasets and associated state-of-the-art algorithms.

1
Contents
1 Introduction 4
1.1 What is object detection in images? How to evaluate detector performance? . . . . . . . . . . . . . . . 6
1.1.1 Problem definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.1.2 Performance evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.1.3 Other detection tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2 From Hand-crafted to Data Driven Detectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3 Overview of Recent Detectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2 On the Design of Modern Deep Detectors 11


2.1 Architecture of the Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.1 Backbone Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.2 Single Stage Detectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1.3 Double Stage Detectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.1.4 Cascades . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.1.5 Parts-Based Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2 Model Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2.1 Losses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2.2 Hyper-Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.2.3 Pre-Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.2.4 Data Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.4 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3 Going Forward in Object Detection 31


3.1 Major Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.1.1 Scale Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.1.2 Rotational Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.1.3 Domain Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.1.4 Object Localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.1.5 Occlusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.1.6 Detecting Small Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2 Complementary New Ideas in Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2.1 Graph Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2.2 Adversarial Trainings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2.3 Use of Contextual Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.3 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4 Extending Object Detection 43


4.1 Detecting Objects in Other Modalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.1.1 Object Detection in Videos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.1.2 Object Detection in 3D Point Clouds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2 Detecting Objects Under Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.2.1 Weakly Supervised Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.2.2 Few-shot Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2.3 Zero-shot Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2.4 Fast and Low Power Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.3 Towards Versatile Object Detectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.3.1 Interpretability and Robustness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.3.2 Universal Detector, Lifelong Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.4 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

2
5 Conclusions 53

Appendix A Datasets and Results 91


A.1 Classical Datasets with Common Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
A.1.1 Pascal-VOC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
A.1.2 MS COCO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
A.1.3 ImageNet Detection Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
A.1.4 VisualGenome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
A.1.5 OpenImages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
A.2 Specialized datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
A.2.1 Aerial Imagery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
A.2.2 Text Detection in Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
A.2.3 Face Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
A.2.4 Pedestrian Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
A.2.5 Logo Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
A.2.6 Traffic Signs Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
A.2.7 Other Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
A.3 3D Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
A.4 Video Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
A.5 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

3
1 Introduction then supposed to be statistically similar. The in-
stance can occupy very few pixels, 0.01% to 0.25%,
The task of automatically recognizing and locating as well as the majority of the pixels, 80% to 90%,
objects in images and videos is important in order in an image. Apart from the variation in size the
to make computers able to understand or interact variation can be in lighting, rotation, appearance,
with their surroundings. For humans, it is one of background, etc. There may not be enough data
the primary tasks, in the paradigm of visual intelli- to accurately cover all the variations well enough.
gence, in order to survive, work and communicate. Small objects, particularly, give low performance at
If one wants machines to work for us or with us, being detected because the available information to
they will need to make sense of their environment detect them is present but compressed and hard to
as good as humans or in some cases even better decode without some prior knowledge or context.
than humans. Solving the problem of object de- Some object instances can also be occluded.
tection with all the challenges it presents has been An additional difficulty is that real world ap-
identified as a major precursor to solving the prob- plications like video object detection demand this
lem of semantic understanding of the surrounding problem to be solved in real time. With the cur-
environment. rent state of the art detectors that is often not the
A large number of academics as well as industry case. Fastest detectors are usually worse than the
researchers have already shown their interest in it best performing ones (e.g. heavy ensembles).
by focusing on applications, such as autonomous We present this review to connect the dots be-
driving, surveillance, relief and rescue operations, tween various deep learning and data driven tech-
deploying robots in factories, pedestrian and face niques proposed in recent years, as they have
detection, brand recognition, visual effects in im- brought about huge improvements in the perfor-
ages, digitizing texts, understanding aerial images, mance, even though the recently introduced ob-
etc. which have object detection as a major chal- ject detection datasets are much more challenging.
lenge at their core. We intend to study what makes them work and
The Semantic Gap, defined by Smeulders et al. what are their shortcomings. We discuss the sem-
[349] as the lack of coincidence between the infor- inal works in the field and the incremental works
mation one can extract from some visual data and which are more application oriented. We also see
its interpretation by a user in a given situation, is their approach on trying to overcome each of the
one of the main challenges object detection must challenges. The earlier methods which were based
deal with. There is indeed a difference of nature on hand-crafted features are outside the scope of
between raw pixel intensities contained in images this review. The problems that are related to ob-
and semantic information depicting objects. ject detection such as semantic segmentation are
Object detection is a natural extension of the also outside the scope of this review, except when
classification problem. The added challenge is to used to bring contextual information to detectors.
correctly detect the presence and accurately locate Salient object detection being related to semantic
the object instance(s) in the image (Figure 1). It is segmentation will also not be treated in this survey.
(usually) a supervised learning problem in which, Several surveys related to object detection have
given a set of training images, one has to design been written in the past, addressing specific tasks
an algorithm which can accurately locate and cor- such as pedestrian detection [84], moving objects in
rectly classify as many object instances as possible surveillance systems [161], object detection in re-
in a rectangle box while avoiding false detections of mote sensing images [53], face detection [126, 453],
background or multiple detections of the same in- facial landmark detection [420], to cite only some
stance. The images can have object instances from illustrative examples. In contrast with this arti-
same classes, different classes or no instances at all. cle, the aforementioned surveys do not cover the
The object categories in training and testing set are latest advances obtained with deep neural net-

4
(a) (b) (c) (d)

(e) (f)

Figure 1: Visualization of sample examples form different kinds of dataset for the detection task. (a) generic
object detection [88], (b) text detection [112], (c) pedestrian detection [399], (d) traffic-sign detection [490],
(e) face detection [432] and (f) objects in aerial images detection [421].

works. Recently four non peer reviewed surveys ap- shot learning or domain adaptation (in addition to
peared on arXiv that also treat the subject of Ob- delving into non mainstream methods already men-
ject Detection using Deep Learning methods. This tioned).
article shares the same motivations as [470] and
[35], but covers the topic more comprehensively
and extensively as these two surveys which only
cover the backbones and flagship articles associated
with modern object detection. This work investi-
gates more thoroughly papers that one would not
The following subsections give an overview of the
necessarily call mainstream, like boosting meth-
problem, some of the seminal works in the field
ods or true cascades, and study related topics like
(hand-crafted as well as data driven) and describe
weakly supervised learning and approaches that
the task and evaluation methodology. Section 2
carry promises but that have yet to become widely
goes into the detail of the design of the current
used by the community (graph networks and gen-
state-of-the-art models. Section 3 presents recent
erative methods). Concurrently to this article, the
methodological advances as well as the main chal-
paper by [220] goes into many details about the
lenge modern detectors have to face. Section 4
modern object detectors. We wanted this survey to
shows how to extend the presented detectors to dif-
be more than just an inventory of existing methods
ferent detection tasks (video, 3D) or perform un-
but to provide the reader with a complete tool-set
der different constraints (energy efficiency, training
to be able to understand fully how the state of the
data, etc.). Finally, Section 5 concludes the re-
art came to be and what are the potential leads to
view. We also list a wide variety of datasets and
advance it further, by studying surrounding top-
the associated state of the art performances in the
ics such as interpretability, lifelong detectors, few-
Appendix.

5
1.1 What is object detection in im- the same object into a single region, or semantic
ages? How to evaluate detector segmentation which is similar to object segmenta-
performance? tion except that the classes may also refer to var-
ied backgrounds or ’stuff’ (e.g.’sky’, ’grass’, ’water’
1.1.1 Problem definition etc., categories). It is also different from Object
Object detection is one of the various tasks related Recognition which is usually defined as recognizing
to the inference of high-level information from im- (i.e. giving the name of the category) of an object
ages. Even if there is no universally accepted def- contained in an image or a bounding box, assuming
inition of it in the literature, it is usually defined there is only one object in the image. For some au-
as the task of locating all the instances of a given thors Object Recognition involves detecting all the
category (e.g.’car’ instances in the case of car de- objects in an image. Instance object detection is
tection) while avoiding raising alarms when/where more restricted than object detection as the detec-
no instances are present. The localization can be tor is focused on a single object (e.g. a particular
provided as the center of the object on the image, car model) and not any object of a given category.
as a bounding box containing the object, or even In case of videos, object detection task is to detect
as the list of the pixels belonging to the object. In the objects on each frame of the video.
some rare cases, only the presence/absence of at
least one instance of the category is sought, with- 1.1.2 Performance evaluation
out any localization. Evaluating the function of detection for a given im-
Object detection is always defined with respect age I is done by comparing the actual list of objects
to a dataset containing images associated with a locations O(I) (so-called the ground truth) of a
list containing some descriptions (position, scale, given category with the detections D(I, λ) provided
etc.) of the objects each image contains. Let’s de- by the detector. Such a comparison is possible only
note by I an image and O(I) the set of NI∗ object once the two following definitions are given:
descriptions, with:
1. A geometric compatibility function
O(I) = {(Y1∗ , Z1∗ ), . . . , (Yi∗ , Zi∗ ), . . . , (YN∗i∗ , ZN

∗ )}
i

where Yi∗ ∈ Y represent the category of the i- G : (Z, Z ∗ ) ∈ Z 2 → {0, 1}



th object and ZN ∗ ∈ Z a representation of its
i defining the conditions that must be met for
location/scale/geometry in the image. Y is the considering two locations as equivalent.
set of possible categories, which can be hierar-

chical or not. Y is the space of possible loca- 2. An association matrix A ∈ {0, 1}N (I,λ)×N (I)
tions/scales/geometries of objects in images. It can defining a bipartite graph between the
be the position of the center of the object (xc , yc ) ∈ detected objects {Z1 , · · · , ZN (I,λ) } and
R2 , a bounding box (xmin , ymin , xmax , ymax ) ∈ R4 the A(i, j) ≤ 1 ground truth objects
encompassing the object, a mask, etc. {Z1∗ , · · · , ZN

∗ (I) }, with:
Using these notations, object detection can be
defined as the function associating an image with N (I,λ)
X
a set of detections A(i, j) ≤ 1
j=1
D(I, λ) = {(Y1 , Z1 ), . . . , (Yi , Zi ), . . . , (YNi (λ) , ZNi (λ) ).
N ∗ (I)
The operating point λ allows to fix a tradeoff be- X
tween false alarms and missed detections. A(i, j) ≤ 1
i=1
Object detection is related but different from ob-
ject segmentation, which aims to group pixels from G(Zi∗ , Zj ) = 0 =⇒ A(i, j) = 0

6
Ground Truth Precision/Recall curve is obtained by varying the
True Positive operational point λ. The Mean Average Precision
FP: Localization
can be computed by averaging the Precision for
FP: Double detection
several Recall values (typically 11 equally spaced
FP: Misclassification
FP: Background
values).
The definition of G can vary from data sets to
data sets. However, only a few definitions reflect
most of the current research. One of the most com-
Figure 2: An illustration of predicted boxes be- mon one comes the Pascal VOC challenge [88]. It
ing marked as True Positive (TP) or False Pos- assumes ground truths are defined by non-rotated
itive (FP). The blue boxes are ground-truths for rectangular bounding boxes containing object in-
the class ”dog”. Predicted box is marked as TP if stances, associated with class labels. The diver-
the predicted class is correct and the overlap with sity of the methods to be evaluated prevents the
the ground truth is greater than a threshold. It use of ROC (Receiver Operating Characteristic) or
is marked as FP if it has overlap less than that DET (Detection Error Trade-off), commonly used
threshold or same object instance is detected again for face detection, as it would assume all the meth-
or it is misclassified or a background is predicted ods use the same window extraction scheme (such
as an object instance. The left dog is marked as as the sliding window mechanism), which is not
False Negative (FN). Best viewed in color. always the case. In the Pascal VOC challenge,
object detection is evaluated by one separate AP
score per category. For a given category, the Pre-
With such definitions, the number of correct de-
cision/Recall curve is computed from the ranked
tections is given by
outputs (bounding boxes) of the method to be eval-
X
T P (I, λ) = A(i, j) uated. Recall is the proportion of positive exam-
i,j ples ranked above a given rank, while precision is
the number of positive boxes above that rank. The
If several association matrices A satisfy the previ- AP summarizes the Precision/Recall curve and is
ous constraints, the one maximizing the number of defined as the mean (interpolated) precision of the
correct detections is chosen. An illustration of TP set of eleven equally spaced recall levels. Output
and False Positives (FP) in an image is shown in bounding boxes are judged as true positives (cor-
Figure 2. rect detections) if the overlap ratio (intersection
Such a definition can be viewed as the size of the over union or IOU) exceeds 0.50. Detection out-
maximal matching in a bipartite graph. Nodes are puts are assigned to ground truth in the order given
locations (ground truth, on the one hand, detec- by decreasing confidence scores. Duplicated detec-
tions on the other hand). Edges are based on the tions of the same object are considered as false de-
acceptance criterion G and the constraints stating tections. The performance over the whole dataset
that ground truth object and detected objets can is computed by averaging the APs across all the
be associated only once each. categories.
It is possible to average the correct detections at The recent and popular MSCOCO challenge
the level of a test set T through the two following [214] relies on the same principles. The main differ-
ratios: ence is that the overall performance (mAP) is ob-
P tained by averaging the AP obtained with 10 differ-
I T P (I, λ)
P recision(λ) = P ent IOU thresholds between 0.50 and 0.95. The Im-
I∈T N (I, λ) ageNet Large Scale Visual Recognition Challenge
P
T P (I, λ) (ILSVRC) also has a detection task in which al-
Recall(λ) = PI ∗
I∈T N (I)
gorithms have to produce triplets of class labels,

7
bounding boxes and confidence scores. Each image case and to evaluate video-mAP based on tubelets
has mostly one dominant object in it. Missing ob- IoU, where a tubelet is detected if and only if the
ject detections are penalized in the same way as a mean per frame IoU for every frame in the video
duplicate detection and the winner of the detection is greater than a threshold, σ, and the tube label
challenge is the one who achieves first place AP on is correctly predicted. We take this definition di-
most of the object categories. The challenge also rectly from [107], where they used it to compute
has the Object Localization task, with a slightly dif- mAP and ROC curves at a video-level.
ferent definition. The motivation is not to penalize
algorithms if one of the detected objects is actu- 1.1.3 Other detection tasks
ally present while not included in the ground-truth
annotations, which is not rare due to the size of This survey only covers the methodologies for per-
the dataset and the number of categories (1000). formance evaluation found in the recent literature.
Algorithms are expected to produce 5 class labels But, beside these common evaluation measures,
(in decreasing order of confidence) and 5 bounding there are a lot of more specific ones, as object de-
boxes (one for each class label). The error of an tection can be combined with other complex tasks,
algorithm on an image is 0 if one of the 5 bound- e.g., 3D orientation and layout inference in [423].
ing boxes is a true positive (correct class label and The reader can refer to the review by Mariano
correct localization according to IOU), 1 otherwise. et al. [232] to explore this topic. It is also worth
The error is averaged on all the images of the test mentioning the very recent work of Oksuz et al.
set. [260] which proposes a novel metric providing richer
Some recent datasets, like DOTA [421], proposed and more discriminative information than AP, es-
two tasks named as detection on horizontal bound- pecially with respect to the localization error.
ing boxes and detection on oriented bounding boxes, We have decided to orient this survey mainly on
corresponding to two different kinds of ground bounding boxes tasks even if there is a tendency
truths (with or without target orientations), no to move away from this task considering the per-
matter how those methods were trained. In some formances of the modern deep learning methods
other datasets, the scale of the detection is not im- that already approach human accuracy on some
portant and a detection is counted as a True Posi- datasets. The reason of this choice are numerous.
tive if its coordinates are close enough to the cen- First of all, historically speaking bounding boxes
ter of the object. This is the case for the VeDAI were one of the first object detection task and thus
dataset [302]. In the particular case of object de- there is already a body of literature on this topic
tection in 3D point clouds, such as in the KITTI that is immense. Secondly, not all the datasets
object detection benchmark [98], the criteria is sim- provide annotation down to the level of pixels. In
ilar to Pascal VOC, except that the boxes are in 3D aerial imagery for instance most of the datasets are
and the overlap is measured in terms of volume in- only bounding boxes. It is also the case for some
tersection. pedestrian detection datasets. Instance segmenta-
tion level annotations are still costly for the mo-
ment, even with the recent development of anno-
Object detection in videos Regarding the de- tator friendly algorithms (e.g. [32, 230]) that offer
tection of objects in videos, the most common prac- pixel level annotations at the expense of a few user
tice is to evaluate the performance by considering clicks. Maybe in the future all datasets will contain
each frame of the video as being an independent annotations down to the level of pixels but it is not
image and averaging the performance over all the yet the case. Even when one has pixel-level annota-
frames, as done in the ImageNet VID challenge tions for tasks like instance segmentation, which is
[319]. becoming the standard, bounding boxes are needed
It is also possible to move away from the 2D from the detector to distinguish between two in-

8
stances of the same class, which explains that most data. This leads to their major disadvantage of re-
modern instance segmentation pipelines like [118] quiring copious amounts of data. The first use of
have a bounding box branch. Therefore, metrics ConvNets for detection and localization goes back
evaluating the bounding boxes from the models are to the early 1990s for faces [392], hands [258] and
still relevant in that case. One could also make the multi-character strings [237]. Then in 2000s they
argument that bounding boxes are more robust an- were used in text [65], face [96, 263] and pedestrians
notations because they are less sensitive to the an- [328] detection.
notator noise but it is debatable. For all of these However, the merits of DCNN for object de-
reasons the rest of this survey will tackle mainly tection was generated in the community only af-
bounding boxes and associated tasks. ter the seminal work of Krizhevsky et al. [181]
and Sermanet et al. [327] on the challenging Im-
1.2 From Hand-crafted to Data ageNet dataset. Krizhevsky et al. [181] were the
first to demonstrate localization through DCNN in
Driven Detectors
the ILSVRC 2012 localization and detection tasks.
While the first object detectors initially relied on Just one year later Sermanet et al. [327] were able
mechanisms to align a 2D/3D model of the object to describe how the DCNN can be used to lo-
on the image using simple features, such as edges cate and detect objects instances. They won the
[217], key-points [224] or templates [278], the ar- ILSVRC 2013 localization and detection competi-
rival of Machine Learning (ML) was the first rev- tion and also showed that combining the classifica-
olution which had shaken up the area. One of the tion, localization and detection tasks can simulta-
most popular ML algorithms used for object de- neously boost the performance of all tasks.
tection was boosting, e.g., [326]) or Support Vector The first DCNN-based object detectors applied a
Machines, e.g. [64]. This first wave of ML-based de- fine-tuned classifier on each possible location of the
tectors were all based on hand-crafted (engineered) image in a sliding window manner [262], or on some
visual features processed by classifiers or regres- specific regions of interest [105], through a region
sors. These hand-crafted features were as diverse proposal mechanism. Girshick et al. [105] treated
as Haar Wavelets [398], edgelets [418], shapelets each region proposal as a separate classification and
[320], histograms of oriented gradient [64], bags-of- localization task. Therefore, given an arbitrary re-
visual-words [187], integral histograms [287], color gion proposal, they deformed it to a warped region
histograms [399], covariance descriptors [388], lin- of fixed dimensions. DCNN are used to extract a
ear binary patterns Wang et al. [408], or their com- fixed-length feature vector from each proposal re-
binations [85]. One of the most popular detectors spectively and then category-specific linear SVMs
before the DCNN revolution was the Deformable were used to classify them. Since it was a region
Part Model of Felzenszwalb et al. [90] and its vari- based CNN they called it R-CNN. Another im-
ants, e.g. [322]. portant contribution was to show the usability of
This very rich literature on visual descriptors has transfer learning in DCNN. Since data is scarce,
been wiped out in less than five years by Deep supervised pre-training on an auxiliary task can
Convolutional Neural Networks, which is a class of lead to a significant boost to the performance of
deep, feed-forward artificial neural networks. DC- domain specific fine-tuning. Sermanet et al. [327],
NNs are inspired by the connectivity patterns be- Girshick et al. [105] and Oquab et al. [262] were
tween neurons of the human visual cortex and use among the first authors to show that DCNN can
no pre-processing as the network learns itself the lead to dramatically higher object detection per-
filters previously hand-engineered by traditional formance on ImageNet detection challenge [66] and
algorithms, making them independent from prior PASCAL VOC [88] respectively as compared to
knowledge and human effort. They are said to be previous state-of-the-art systems based on HOG
end-to-end trainable and solely rely on the training [64] or SIFT [225].

9
Since most prevalent DCNN had to use a fixed age of completely end-to-end architectures. Specif-
size input, because of the fully connected layers at ically, the anchor mechanism, developed for the
the end of the network, they had to either warp RPN, was here to stay. This grid of fixed a-priori
or crop the image to make it fit into that size. He (or anchors), not necessarily corresponding to the
et al. [116] came up with the idea of aggregating receptive field of the feature map pixel they lied on,
feature maps of the final convolutional layer. Thus, created a framework for fully-convolutional classifi-
the fully connected layer at the end of the network cation and regression and is used nowadays by most
gets a fixed size input even if the input images in pipelines like [221] or [216], to cite a few.
the dataset are of varying sizes and aspect ratios. These conceptual changes make the detection
This helped reduce overfitting, increased robust- pipelines far more elegant and efficient than their
ness and improved the generalizability of the exist- counterparts when dealing with big training sets.
ing models. Compared to R-CNN which used one However, it comes at a cost. The resulting detec-
forward pass per proposal to generate the feature tors become complete black boxes, and, because
map, the methodology proposed by [116] allowed they are more prone to overfitting, they require
to share computation among all the proposals and more data than ever.
do just one forward pass for the whole image and [309] and its other double stage variants are now
then select the region from the final feature map the go-to methods for objects detection and will be
according to the regions proposed. This naturally thoroughly explored in Sec. 2.1.3. Although this
increased the speed of the network by over one hun- line of work is now prominent, other choices were
dred times. explored all based on fully-convolutional architec-
All the previous approaches train the network tures.
in multistage pipelines are complex, slow and in- Single-stage algorithms that were completely
elegant. They include extracting features through abandoned since Viola et al. [398] have now be-
CNNs, classifying through SVMs and finally fit- come reasonable alternatives thanks to the discrim-
ting bounding box regressors. Since, each task inative power of the CNN features. Redmon et al.
is handled separately, convolutional layers cannot [308] first showed that the simplest architectural
take advantage of end-to-end learning and bound- design could bring unfathomable speed with ac-
ing box regression. Girshick [104] helped alleviate ceptable performances. Liu et al. [221] sophisti-
this problem by streamlining all the tasks in a sin- cated the pipeline by using anchors at different lay-
gle model using a multitask loss. As we will explain ers while making it faster and more accurate than
later, this not only improved upon the accuracy but Redmon et al. [308]. These two seminal works gave
also made the network run faster at test time. birth to a considerable amount of literature on sin-
gle stage methods that we will cover in Sec. 2.1.2.
Boosting and Deformable part-based models, that
1.3 Overview of Recent Detectors
were once the norm, have yet to make their come-
The foundations of the DCNN based object detec- backs into the mainstream. However, some recent
tion, having been laid out, it allowed the field to popular works used close ideas like Dai et al. [63]
mature and move further away from classical meth- and thus these approaches will also be discussed in
ods. The fully-convolutional paradigm glimpsed in the survey sections 2.1.4 and 2.1.5.
[327] gained more traction every day in the com- The fully-convolutional nature of these new dom-
munity. inant architectures allows all kinds of implementa-
When Ren et al. [309] successfully replaced the tion tricks during training and at inference time
only component of Fast R-CNN that still relied on that will be discussed at the end of the next sec-
non-learned heuristics by inventing RPN (Region tion. However, it makes the subtle design choices
Proposal Networks), it put the last nail in the cof- of the different architectures something of a dark
fin of traditional object detection and started the art to the newcomers.

10
The goal of the rest of the survey is to provide 2.1 Architecture of the Networks
a complete view of this new landscape while giving
the keys to understand the underlying principles The architecture of the DCNN object detectors
that guide interesting new architectural ideas. Be- follows a Lego-like construction pattern based on
fore diving into the subject, the survey starts by chaining different building blocks. The first part
reminding the readers about the object detection of this Section will focus on what researchers call
task and the metrics associated with it. the backbone of the DCNN, meaning the feature ex-
tractor from which the detector draws its discrimi-
native power. We will then tackle diverse arrange-
ments of increasing complexity found in DCNN de-
After introducing the topic and touching upon tectors: from single stage to multiple stages meth-
some general information, next section will get ods. Finally, we will talk about the Deformable
right into the heart of object detection by present- Part Models and their place in the deep learning
ing the designs of recent deep learning based object landscape.
detectors.

2.1.1 Backbone Networks

A lot of deep neural networks originally designed


2 On the Design of Modern for classification tasks have been adopted for the
Deep Detectors detection task as well. And a lot of modifications
have been done on them to adapt for the additional
difficulties encountered. The following discussion
Here we analyze, investigate and dissect the cur-
is about these networks and the modifications in
rent state-of-the-art models and the intuition be-
question.
hind their approaches. We can divide the whole
detection pipeline into three major parts. The first
part focuses on the arrangement of convolutional Backbones: Backbone networks play a major
layers to get proposals (if required) and box pre- role in object detection models. Huang et al. [140]
dictions. The second part is about setting vari- partially confirmed the common observation that,
ous training hyper-parameters, deciding upon the as the classification performance of the backbone
losses, etc. to make the model converge faster. The increases on ImageNet classification task [319], so
third part’s center of attention will be to know var- does the performance of object detectors based on
ious approaches to refine the predictions from the those backbones. It is the case at least for popular
converged model(s) at test time and therefore get double-stage detectors like Faster-RCNN [309] and
better detection performances. The first part has R-FCN [62] although for SSD [221] the object de-
been able to get the attention of most of the re- tection performance remains around the same (see
searchers and second and third part not so much. the following Sections for details about these 3 ar-
To give a clear overview of all the major compo- chitectures).
nents and popular options available in them, we However, as the size of the network increases,
present a map of object detection pipeline in Fig- the inference and the training become slower and
ure 3. require more data. The most popular architectures
Most of the ideas from the following sub-sections in increasing order of inference time are MobileNet
have achieved top accuracy on the challenging MS [134], VGG [343], Inception [151, 364, 365], ResNet
COCO [214] object detection challenge and PAS- [117], Inception-ResNet [366], etc. All of the above
CAL VOC [88] detection challenge or on some other architectures were first borrowed from the classifi-
very challenging datasets. cation problem with little or no modification.

11
- OHEM Supplementary Losses
- Focal Loss
- Segmentation Loss
Dataset Double-Stage - Repulsion Loss
Class Imbalance
Framework
Losses

Regression - Encoding
Classification - L1 Loss
- L2 Loss
- In-Out
- Cross Entropy - Border
- Unit-Box
Network MobileNet | VGG | Inception | ResNet
Xception | Wide-ResNet | ResNeXt
SqueezeNet | ShuffleNet | DenseNet Performance
Train-time

Predictions
Classification
...
- Mean Average Precision 
Bounding Box  (mAP) @0.5 IOU
- mAP @0.5:0.95 IOU
- DET Curve
Initialization Test-time - ROC Curve
He/Xavier/Random - Multiple scales
Pre-training on other - Fusion of layers Greedy NMS
datasets Soft NMS
Data Augmentation
Learning Based
- Scale, Resize, Rotation, Flipping,        Methods
  Elastic distortions and Crop  Clustering
- Contrast, Color, Hue, Brightness,       
  Saturation and Sharpness - Mean Shift Inference
- Smart Augmentation - Agglomerative
- GANs - Affinity Propagation
- Heuristic Variants
Single-Stage
Framework

Figure 3: A map of object detection pipeline along with various options available in different parts. Images are
taken from a Dataset and then fed to a Network. They use one of the Single-Stage (see Figure 6) or Double-
Stage Framework (see Figure 8) to make Predictions of the probabilities of each class and their bounding
boxes at each spatial location of the feature map. During training these predictions are fed to Losses and
during testing these are fed to Inference for pruning. Based on the final predictions, a Performance is
evaluated against the ground-truths. All the ideas are referenced in the text. Best viewed in color.

12
predict

predict
predict
predict
predict
predict predict

predict

(a) (b) (c)

predict predict predict


predict

(d) (e)

Figure 4: An illustration of how the backbones can be modified to give predictions at multiple scales and
through fusion of features. (a) An unmodified backbone. (b) Predictions obtained from different scales of
image. (c) Feature maps added to the backbone to get predictions at different scales. (d) A top down
network added in parallel to backbone. (e) Top down network along with predictions at different scales.

Some other backbones used in object detec- Multi-scale detections: Papers [28, 200, 431]
tors which were not included in the analysis made independent predictions on multiple feature
of [140] but have given state-of-the-art perfor- maps to take into account objects of different
mances on ImageNet [66], or COCO [214] detec- scales. The lower layers with finer resolution have
tion tasks are Xception [57], DarkNet [306], Hour- generally been found better for detecting small ob-
glass [256], Wide-Residual Net [193, 445], ResNeXt jects than the coarser top layers. Similarly, coarser
[426], DenseNet [139], Dual Path Networks [50] layers are better for the bigger objects. Liu et al.
and Squeeze-and-Excitation Net [136]. The recent [221] were the first to use multiple feature maps for
DetNet [208], proposed a backbone network, is de- detecting objects. Their method has been widely
signed specifically for high performance detection. adopted by the community. Since final feature
It avoided large down-sampling factors present in maps of the networks may not be coarse enough
classification networks. Dilated Residual Networks to detect sizable objects in large images, additional
[439] also worked with similar motivations to ex- layers are also usually added. These layers have a
tract features with fewer strides. SqueezeNet [148] wider receptive field.
and ShuffleNet [461] choose instead to focus on
speed. More information for networks focusing on
Fusion of layers: In object detection, it is also
speed can be found in Section 4.2.4.
helpful to make use of the context pixels of the ob-
Adapting the mentioned backbones to the inher- ject [102, 446, 451]. One interesting argument in
ent multi-scale nature of object detection is a chal- favor of fusing different layers is it integrates infor-
lenge, we will give in the following paragraph ex- mation from different feature maps with different
amples of commonly used strategies. receptive fields, thus it can take help of surrounding

13
local context to disambiguate some of the object without reusing any component of the neural net-
instances. work or generating proposals of any kind, thus
Some papers [51, 92, 156, 192, 471] have experi- speeding up the detector.
mented with fusing different feature layers of these They started by dividing the image into a S × S
backbones so that the finer layers can make use grid and assuming B bounding boxes per grid.
of the context learned in the coarser layers. Lin Each cell containing the center of an object in-
et al. [215, 216], Shrivastava et al. [338], Woo et al. stance is responsible for the detection of that ob-
[415] took one step ahead and proposed a whole ject. Each bounding box predicts 4 coordinates,
additional top-down network in addition to stan- objectness and class probabilities. This reframed
dard bottom-up network connected through lateral the object detection as a regression problem. To
connections. The bottom-top network used can be have a receptive field cover that covers the whole
any one of the above mentioned. While Shrivastava image they included a fully connected layer in their
et al. [338] used only the finest layer of top-down design towards the end of the network.
architecture for detection, Feature Pyramid Net-
work (FPN) [216] and RetinaNet [215] used all the
SSD: Liu et al. [221], inspired by the Faster-
layers of top-down architecture for detection. FPN
RCNN architecture, used reference boxes of var-
used the feature maps thus generated in a two-stage
ious sizes and aspect ratios to predict object in-
detector fashion while RetinaNet used them in a
stances (Figure 5) but they completely got rid of
single-stage detector fashion (See Section 2.1.3 and
the region proposal stage (discussed in the follow-
Section 2.1.2 for more details). FPN [215] has been
ing Section). They were able to do this by making
a part of the top entries in MS COCO 2017 chal-
the whole network work as a regressor as well as
lenge. An illustration of multiple scales and fusion
a classifier. During training, thousands of default
of layers is shown in Figure 4.
boxes corresponding to different anchors on differ-
Now that we have seen how to best use the fea- ent feature maps learned to discriminate between
ture maps of the object detectors backbones we can objects and background. They also learned to di-
explore the architectural details of the different ma- rectly localize and predict class probabilities for the
jor players in DCNN object detection, starting with object instances. This was achieved with the help
the most immediate methods: single-stage detec- of a multitask loss. Since, during inference time a
tors. lot of boxes try to localize the objects, generally a
post-processing step like Greedy NMS is required
2.1.2 Single Stage Detectors to suppress duplicate detections.
In order to accommodate objects of all the sizes
The two most popular approaches in single stage they added additional convolutional layers to the
detection category are YOLO [308] and SSD [221]. backbone and used them, instead of a single feature
In this Section we will go through their basic map, to improve the performance. This method
functioning, some upsides and downsides of using was later applied to approaches related to two-stage
these two approaches and further improvements detectors too [215].
proposed on them.
Pros and Cons: Oftentimes single stage detec-
YOLO: Redmon et al. [308] presented for the tors do not give as good performance as the double-
first time a single stage method for object detection stage ones, but they are a lot faster [140] al-
where raw image pixels were converted to bound- though some double-stages detectors can be faster
ing box coordinates and class probabilities and can than single-stages due to architectural tricks and
be optimized end-to-end directly. This allowed to modern single-stage detectors outperform the older
directly predict boxes in a single feed-forward pass multi-stages pipelines.

14
k anchor boxes which they have to perform predictions.

Further improvements: Redmon and Farhadi


[306] and Redmon and Farhadi [307] suggested a lot
of small changes in versions 2 and 3 of the YOLO
method. The changes like applying batch normal-
ization, using higher resolution input images, re-
moving the fully connected layer and making it
256-d fully convolutional, clustering box dimensions, lo-
classification regression cation prediction and multi-scale training helped
(C+1)k scores 4k coordinates to improve performance while a custom network
(DarkNet) helped to improve speed.
Figure 5: The workings of anchors. k anchors are Many further developments by many researchers
declared at each spatial location of the final fea- have been proposed on Single Shot MultiBox De-
ture map(s). Classification score for each class (in- tector. The major advancements over the years
cluding background) is predicted for each anchor. have been illustrated in Figure 6. Deconvolutional
Regression coordinates are predicted only for an- Single Shot Detector (DSSD) [92], instead of the
chors having an overlap greater than a pre-decided element-wise sum, used a deconvolutional module
threshold with the ground-truth. For the special to increase the resolution of top layers and added
case of predicting objectness, C is set to one. This each layer, through element-wise products to previ-
idea was introduced in [309]. ous layer. Rainbow SSD [156] proposed to concate-
nate features of shallow layers to top layers by max-
pooling as well as features of top layers to shallow
The various advantages of YOLO strategy are layers through deconvolution operation. The final
that it is extremely fast, with 45 to 150 frames per fused information increased from few hundreds to
second. It sees the entire image as opposed to re- 2,816 channels per feature map. RUN [192] pro-
gion proposal based strategies which is helpful for posed a 3-way residual block to combine adjacent
encoding contextual information and it learns gen- layers before final prediction. Cao et al. [29] used
eralizable representations of objects. But it also concatenation modules and element-sum modules
has some obvious disadvantages. Since each grid to add contextual information in a slightly differ-
cell has only two bounding boxes, it can only pre- ent manner. Zheng et al. [471] slightly tweak DSSD
dict at most two objects in a grid cell. This is par- by fusing lesser number of layers and adding extra
ticularly inefficient strategy for small objects. It ConvNets to improve speed as well as performance.
struggles to precisely localize some objects as com- They all improved upon the performance of
pared to two stages. Another drawback of YOLO conventional SSD and they lie within a small
is that it uses coarse feature map at a single scale range among themselves on Pascal VOC 2012 test
only. set [88], but they added considerable amount of
To address these issues, SSD used a dense set computational costs, thus making it little slower.
of boxes and considered predictions from various WeaveNet [51] aimed at reducing computational
feature maps instead of one. It improved upon the costs by gradually sharing the information from
performance of YOLO. But since it has to sample adjacent scales in an iterative manner. They hy-
from these dense set of detections at test time it pothesized that by weaving the information iter-
gives lower performance on MS COCO dataset as atively, sufficient multi-scale context information
compared to two-stage detectors. The two-stage can be transferred and integrated to current scale.
object detectors get a sparse set of proposals on Recently three strong candidates have emerged

15
YOLO (Jun 2015) SSD (Dec 2015)

cls
cls +
+
fc reg
reg

RetinaNet (Aug 2017) DSSD (Jan 2017)

cls + reg

cls + reg

RefineDet (Nov 2017) CornerNet (Aug 2018)


HeatMaps Embeddings
obj
+ Top-left corners

reg
reg

Refined
Anchors
cls
cls
+
+
reg Bottom-right corners

Matching corresponding
corners 

Figure 6: Evolution of single stage detectors over the years. Major advancements in chronological order are
YOLO [308], SSD [221], DSSD [92], RetinaNet [216], RefineDet [460] and CornerNet [189]. Cuboid boxes,
solid rectangular box, dashed rectangular boxes and arrows represent convolutional layer, fully connected
layer, predictions and flow of features respectively. obj, cls and reg stand for objectness, classification and
regression losses. Best viewed in color.

16
for replacing the undying YOLO and SSD variants: posal generator is to present the classifier with
class-agnostic rectangular boxes which try to locate
• RetinaNet [216] borrowed the FPN structure
the ground-truth instances. The classifier, then,
but in a single stage setting. It is similar in
tries to assign a class to each of the proposals and
spirit to SSD but it deserves its own paragraph
further fine-tune the coordinates of the boxes.
given its growing popularity based on its speed
and performance. The main new advance of
this pipeline is the focal loss, which we will
discuss in Section 2.2.1.
Region proposal: Hosang et al. [131] presented
• RefineDet [460] tried to combine the advan- an in-depth review of ten ”non-data driven” ob-
tages of double-staged methods and single- ject proposal methods including Objectness [2, 3],
stage methods by incorporating two new mod- CPMC [30, 31], Endres and Hoiem [81, 82], Se-
ules in the single stage classic architecture. lective Search [391, 393], Rahtu et al. [293], Ran-
The first one, the ARM (Anchor Refinement domized Prim [229], Bing [56], MCG [286], Ranta-
modules), is used in multiple staged detectors’ lankila et al. [298], Humayun et al. [145] and Edge-
fashion to reduce the search space and also to Boxes [491] and evaluated their effect on the de-
iteratively refine the localization of the detec- tector’s performance. Also, Xiao et al. [425] de-
tions. The ODM (Object Detection Module) veloped a novel distance metric for grouping two
took the output of the ARM to output fine- super-pixels in high-complexity scenarios. Out of
grained classification and further improve the all these approaches Selective Search and Edge-
localization. Boxes gave the best recall and speed. The former
is an order of magnitude slower than Fast R-CNN
• CornerNet [189] offered a new approach for ob-
while the latter, which is not as efficient, took as
ject detection by predicting bounding boxes
much time as a detector. The bottleneck lied in
as paired top-left and bottom right keypoints.
the region proposal part of the pipeline.
They also demonstrated that one can get rid of
the prominent anchors step while gaining ac- Deep learning based approaches [86, 363] had
curacy and precision. They used fully convolu- also been used to propose regions but they were
tional networks to produce independent score not end-to-end trainable for detection and required
heat maps for both corners for each class in input images to be of fixed size. In order to ad-
addition to learning an embedding for each dress strong localization bias [46] proposed a box-
corner. The embedding similarities were then refinement method based on the super-pixel tight-
used to group them into multiple bounding ness distribution. DeepMask [282] and SharpMask
boxes. It beat its two (less original) competing [284] proposed segmentation based object propos-
rivals on COCO. als with very deep networks. [162] estimated the
objectness of image patches by comparing them
However, most methods used in competitions un- with exemplar regions from prior data and finding
til now are predominantly double-staged methods the ones that are most similar to it.
because their structure is better suited for fine-
grained classification. It is what we are going to The next obvious question became apparent.
see in the next Section. How can deep learning methods be streamlined
into existing approaches to give an elegant, sim-
ple, end-to-end trainable and fully convolutional
2.1.3 Double Stage Detectors
model? In the discussion that follows we will dis-
The process of detecting objects can be split into cuss two widely adopted approaches in two-stage
two parts: proposing regions & classifying and re- detectors, pros and cons of using such approaches
gressing bounding boxes. The purpose of the pro- and further improvements made on them.

17
R-CNN and Fast R-CNN: The first modern posal Network (RPN). RPN learned the ”object-
ubiquitous double-staged deep learning detection ness” of all instances and accumulated the propos-
method is certainly [105]. Although it is has now als to be used by the detector part of the backbone.
been abandoned due to faster alternatives it is The detector further classified and refined bound-
worth mentioning to better understand the next ing boxes around those proposals. RPN and detec-
paragraphs. Closer to the traditional non deep- tor can be trained separately as well as in a com-
learning methods the first stage of the method is bined manner. When sharing convolutional layers
the detection of objects in pictures to reduce the with the detector they result in very little extra
number of false positive of the subsequent stage. cost for region proposals. Since it has two parts for
It is done using a hierarchical pixel grouping algo- generating proposals and detection, it comes under
rithm widely popular in the 2000s called selective the category of two-stage detectors.
search [393]. Once the search space has been prop- Faster-RCNN used thousands of reference boxes,
erly narrowed, all regions above a certain score are commonly known as anchors. Anchors formed a
warped to a fixed size so that a classifier can be ap- grid of boxes that act as starting points for re-
plied on top of it. Further fine-tuning on the last gressing bounding boxes. These anchors were then
layers of the classifier is necessary on the classes trained end-to-end to regress to the ground truth
of the dataset used (they replace the last layer so and an objectness score was calculated per anchor.
that it has the right number of classes) and an The density, size and aspect ratio of anchors are
SVM is used on top of the fixed fine-tuned fea- decided according to the general range of size of
tures to further refine the localization of the detec- object instances expected in the dataset and the
tion. This method was the precursor of all the mod- receptive field of the associated neuron in the fea-
ern deep learning double-staged methods, in spite ture map.
of the fact that this first iteration of the method RoI Pooling, introduced in [104], warped the pro-
was far from the elegant paradigm used nowadays. posals generated by the RPN to fixed size vectors
Fast R-CNN Girshick [104] from the same author for feeding to the detection sub-network as its in-
is built on top of this previous work. The author puts. The quantization and rounding operation
started to refine R-CNN by being one of the first defining the pooling cells introduced misalignments
researcher with He et al. [116] to come-up with his and actually hurt localization.
own deep-learning detection building block. This
differentiable mechanism called RoI-pooling (Re-
gion of Interest Pooling) was used for resizing fixed R-FCN: To avoid running the costly RoI-wise
regions (also extracted with selective-search) com- subnetwork in Faster-RCNN hundreds of times, i.e.
ing not from the image directly but from the fea- once per proposal, Dai et al. [62] got rid of it and
ture computed on the full image, which kept the shared the convolutional network end to end. To
spatial layout of the original image. Not only did achieve this they proposed the idea of position sen-
that bring speed-up to the slow R-CNN method sitive feature maps. In this approach each feature
(x200 in inference) but it also came with a net gain map was responsible for outputting score for a spe-
in performances (around 6 points in mAP). cific part, like top-left, center, bottom right, etc.,
of the target class. The parts were identified with
RoI-Pooling cells which were distributed along-
Faster-RCNN: The seminal Faster-RCNN pa- side each part-specific feature map. Final scores
per [309] showed that the same backbone architec- were obtained by average voting every part of the
ture used in Fast R-CNN for classification can be RoI from the respective filter. This implementa-
used to generate proposals as well. They proposed tion trick introduced some more translational vari-
an efficient fully convolutional data driven based ance to structures that were essentially translation-
approach for proposing regions called Region Pro- invariant by construction. Translational variance

18
(a)

(b)

(c)

Figure 7: Graphical explanation of RoIPooling, RoIWarping and RoIAlign (actual dimensions of pooled
feature map differ). The red box is the predicted output of Region Proposal Network (RPN) and the dashed
blue grid is the feature map from which proposals are extracted. (a) RoI Pooling first aligns the proposal to
the feature map (first quantization) and then max-pools or average-pools the features (second quantization).
Note that some information is lost because the quantized proposal is not an integral multiple of the final
map’s dimensions. (b) RoI Warping retains the first quantization but deals with second one through bilinear
interpolation, calculated from four nearest features, through sampling N points (black dots) for each cell
of final map. (c) RoI Align removed both the quantizations by directly sampling N points on the original
proposal. N is set to four generally. Best viewed in color.

in object detection can be beneficial for learning lo- challenging COCO detection task. Another draw-
calization representations. Although this pipeline back is that Ren et al. [309] and Dai et al. [62] used
seems to be more precise, it is not always better coarse feature maps at a single scale only. This
performance-wise than its Faster R-CNN counter- is not sufficient when objects of diverse sizes are
part. present in the dataset.
From an engineering point of view, this method
of Position sensitive RoI-Pooling (PS Pooling) also Further improvements: Many improvements
prevented the loss of information at RoI Pooling have been suggested on the above methodologies
stage in Faster-RCNN. It improved the overall in- concerning speed, performance and computational
ference time speed of two-stage detectors but per- efficiency.
formed slightly worse. DeepBox [184] proposed a light weight generic
objectness system by capturing semantic proper-
Pros and Cons: RPNs are generally configured ties. It helped in reducing the burden of local-
to generate nearly 300 proposals to get state-of- ization on the detector as the number of classes
the-art performances. Since each of the proposal increased. Light-head R-CNN [206] proposed a
passed through a head of convolutional layers and smaller detection head and thin feature maps to
fully connected layers to classify the objects and speed up two-stage detectors. Singh et al. [346]
fine tune the bounding boxes, it decreased the over- brought R-FCN to 30 fps by sharing position sen-
all speed. Although they are slow and not suited sitive feature maps across classes. Using slight ar-
to real-time applications, the ideas based on these chitectural changes, they were also able to bring
approaches give one the best performances in the the number of classes predicted by R-FCN to 3000

19
R-CNN (Nov 2013) SPPNet (Jun 2014)
External External
Proposals Proposals

SVM SVM
Classifier Classifier

fc fc
+ +

Regressor Regressor
Spatial
Pyramid
Pooling

Faster RCNN (Jun 2015) Fast-RCNN (Apr 2015)

External
obj+reg Proposals cls  +  reg

cls
conv5
+ fc
reg conv5
RoI 
Pooling
Generate RoIs

RFCN (May 2016) FPN (Dec 2016)

obj+reg

vote
pool
(per RoI)

obj + reg cls + reg obj + reg cls + reg obj+reg cls + reg

Deformable CNN (Mar 2017) Mask RCNN (Mar 2017)


cls  +  reg
obj+reg
conv

fc

conv5
conv
conv

mask

input feature map output feature map

Figure 8: Evolution of double stage detectors over the years. Major advancements in chronological order are
R-CNN [105], SPPNet [116], Fast-RCNN [104], Faster RCNN [309], RFCN [62], FPN [215], Mask RCNN
[118], Deformable CNN [63] (only the modification is shown and not the entire network). The main idea is
marked in dashed blue rectangle wherever possible. Other information is same as in Figure 6.

20
without losing too much speed. Mask-RCNN [118] in addition to RoI-align added
Several improvements have been made to RoI- a branch in parallel to the classification and bound-
Pooling. The spatial transformer of [154] used a ing box regression for optimizing the segmentation
differentiable re-sampling grid using bilinear inter- loss. Additional training for segmentation led to
polation and can be used in any detection pipeline. an improvement in the performance of object de-
Chen et al. [37] used this for Face detection, tection task as well. The advancements of two stage
where faces were warped to fit canonical poses. detectors over the years is illustrated in Figure 8.
Dai et al. [61] proposed another type of pooling
called RoI Warping based on bilinear interpola- The double-staged methods have now by far at-
tion. Ma et al. [228] were the first to introduce tained supremacy over best performing object de-
a rotated RoI-Pooling working with oriented re- tection DCNNs. However, for certain applications
gions (More on oriented RoI-Pooling can be found two-stage methods are not enough to get rid of all
in Section 3.1.2). Mask R-CNN [118] proposed RoI the false positives.
Align to address the problem of misalignment in
RoI Pooling which used bilinear interpolation to 2.1.4 Cascades
calculate the value of four regularly sampled loca-
tions on each cell. It allowed to fix to some extents Traditional one-class object detection pipelines re-
the alignment between the computed features and sorted to boosting like approaches for improving
the regions they were extracted from. It brought the performance where uncorrelated weak classi-
consistent improvements to all Faster R-CNN base- fiers (better than random chance but not too cor-
lines on COCO. A comparison is shown in Figure related with the true predictions) are combined to
7. Recently, Jiang et al. [158] introduced a Precise form a strong classifier. With modern CNNs, as
RoI Pooling based on interpolating not just 4 spa- the classifiers are quite strong, the attractiveness of
tial locations but a dense region, which allowed full those methods has plummeted. However, for some
differentiability with no misalignments. specific problems where there are still too many
Li et al. [196], Yu et al. [442] also used contex- false positives, researchers still find it useful. Fur-
tual information and aspect ratios while StuffNet thermore, if the weak CNNs used are very shallow
[24] trained for segmenting amorphous categories it can also sometimes increase the overall speed of
such as ground and water for the same purpose. the method.
Chen and Gupta [48] made use of memory to One of the first ideas that were developed was
take advantage of context in detecting objects. Li to cascade multiple CNNs. Li et al. [198] and
et al. [207] incorporated Global Context Module Yang and Nevatia [434] both used a three-staged
(GCM) to utilize contextual information and Row- approach by chaining three CNNs for face detec-
Column Max Pooling (RCM Pooling) to better ex- tion. The former approach scanned the image using
tract scores from the final feature map as compared a 12 × 12 patch CNN to reject 90% of the non-face
to the R-FCN method. regions in a coarse manner. The remaining detec-
Deformable R-FCN [63] brought flexibility to tions were offset by a second CNN and given as
the fixed geometric transformations at the Position input to a 24 × 24 CNN that continued rejecting
sensitive RoI-Pooling stage of R-FCN by learning false positives and refining regressions. The final
additional offsets for each spatial sampling loca- candidates were then passed on to a 48 × 48 classi-
tion using a different network branch in addition fication network which output the final score. The
to other tricks discussed in Section 2.1.5. Lin et al. latter approach created separate score maps for dif-
[215] proposed to use a network with multiple fi- ferent resolutions using the same FCN on different
nal feature maps with different coarseness to adapt scales of the test image (image pyramid). These
to objects of various sizes. Zagoruyko et al. [446] score maps were then up-sampled to the same reso-
used skip connections with the same motivation. lution and added to create a final score map, which

21
was then used to select proposals. Proposals were cascade where the final score is a linear weighted
then passed to the second stage where two differ- combination of the scores given by the different
ent verification CNNs, trained on hard examples, weak classifiers like in Angelova et al. [6]. They
eradicated the remaining false positives. The first used 200 stages (instead of 2000 stages in their
one being a four-layer FCN trained from scratch baseline with AdaBoost [16]) to keep recall high
and the second one an AlexNet [181] pre-trained enough while improving precision. To save com-
on ImageNet. putations that would be otherwise unmanageable,
All the approaches mentioned in the last para- they terminated the computations of the weighted
graph are ad hoc: the CNNs are independent of sum whenever the score for a certain number of
each other, there is no overall design, therefore, classifiers fell under a specified threshold (there are,
they could benefit from integrating the elegant therefore, as many thresholds to learn as there are
zooming module that is the RoI-Pooling. The RoI- classifiers). These thresholds are then really im-
Pooling can act like a glue to pass the detections portant because they control the trade-off between
from one network to the other, while doing the speed, recall and precision.
down-sampling operation locally. Dai et al. [61] All the previous works in this Section involved
used a Mask R-CNN like structure that first pro- a small fixed number of localization refinement
posed bounding boxes, then predicted a mask and steps, which might cause proposals to be not per-
used a third stage to perform fine grained discrim- fectly aligned with the ground truth, which in turn
ination on masked regions that are RoI-Pooled a might impact the accuracy. That is why lots of
second time. work proposed iterative bounding box regression
Ouyang et al. [268], Wang et al. [404] optimized (while loop on localization refinement until condi-
in an end-to-end manner a Faster R-CNN with mul- tion is reached). Najibi et al. [254], Rajaram et al.
tiple stages of RoI-Pooling. Each stage accepted [295] started with a regularly spaced grid of sparse
only the highest scored proposals from the previ- pyramid boxes (only 200 non-overlapping in Najibi
ous stage and added more context and/or localized et al. [254] whereas, Rajaram et al. [295] used all
the detection better. Then additional information Faster R-CNN anchors on the grid) that were iter-
about context was used to do fine grained discrim- atively pushed towards the ground truth according
ination between hard negatives and true positives to the feature representation obtained from RoI-
in [268], for example. On the contrary, Zhang et al. Pooling the current region. An interesting finding
[455] showed that for pedestrian detection RoI- was that even if the goal was to use as many re-
Pooling, too coarse a feature map actually hurts finement steps as necessary if the seed boxes or an-
the result. This problem has been alleviated by chors span the space appropriately, regressing the
the use of feature pyramid networks with higher boxes only twice can in fact be sufficient [254]. Ap-
resolution feature maps. Therefore, they used the proaches proposed by Gidaris and Komodakis [101]
RPN proposals of a Faster R-CNN in a boosting and Li et al. [199] can also be viewed, internally,
pipeline involving a forest (Tang et al. [373] acted as iterative regression based methods proposing re-
similarly for small vehicle detection). gions for detectors, such as Fast R-CNN.
Yang et al. [431], aware of the problem raised by Recently, Cai and Vasconcelos [27] noticed that
Zhang et al. [455], used RoI-Pooling on multiple when increasing the IoU threshold for a window
scaled feature maps of all the layers of the net- to be considered positive in the training (to get
work. The classification function on each layer was better quality hypothesis for the next stages), one
learned using the weak classifiers of AdaBoost and loses a lot of positive windows. Thus one has to
then approximated using a fully connected neural keep using the low 0.5 threshold to prevent over-
network. While all the mentioned pipelines are fitting and thus one gets bad quality hypothesis in
hard cascades where the different classifiers are in- the next stages. This is true for all the works men-
dependent, it is sometimes possible to use a soft tioned in this section that are based on Faster R-

22
CNN (e.g. [268, 404]). To combat this effect, they In 2014, Savalle and Tsogkas [325] tried to get
slowly increase the IoU threshold over the stages to the best of both worlds: they replaced the HoG fea-
get different sets of detectors using the latest stage ture pyramids used in the DPM with the CNN lay-
proposals as input distribution for the next one. ers. Surprisingly, the performance they obtained,
With only 3 to 4 stages they consistently improve even if far superior to the DPM+HoG baseline, was
the quality of a wide range of detectors with an considerably worse than the R-CNN method. The
average of 3 points gained w.r.t. the non-cascaded authors suspected the reason for it was the fixed
version. This algorithmic advance is used in most size aspect ratios used in the DPM together with
of the winning entries of the 2018 COCO challenge the training strategy. Girshick et al. [106] put more
(used at least by the first three teams). thought on how to mix CNN and DPM by coming
Orthogonal to this approach Jiang et al. [158] up with the distance transform pooling thus bring-
frames the regression of the multi-stage cascade as ing the new DPM (DeepPyramidDPM) to the level
an optimization problem thus introducing a proxy of R-CNN (even slightly better). Ranjan et al.
for a smooth measure of confidence of the bounding [297] built on it and introduced a normalization
box localization. This article among others will be layer that forced each scale-specific feature map to
discussed in more details in the Section 2.2.1. have the same activation intensities. They also im-
plemented a new procedure of sampling optimal
Boosting and multistage (> 2) methods, we have targets by using the closest root filter in the pyra-
seen previously, exhibit very different possible com- mid in terms of dimensions. This allowed them
binations of DCNNs. But we thought it would to further mimic the HOG-DPM strengths. Si-
be interesting to still have a Section for a special multaneously, Wan et al. [401] also improved the
kind of method that was hinted at in the previ- DeepPyramidDPM but failed short compared to
ous Sections, namely the part-based models, if not the newest version of R-CNN, fine-tuned (R-CNN
for their performances at least for their historical FT). Therefore, in 2015 it seemed that the DPM
importance. based approaches have hit a dead end and that the
community should focus on R-CNN type methods.
2.1.5 Parts-Based Models However, the flexibility of the RoI-Pooling of
Fast R-CNN was going to help making the two
Before the reign of CNN methods, the algorithms approaches come together. Ouyang et al. [267]
based on Deformable Parts-based Model (DPM) combined Fast R-CNN to get rid of most back-
and HoG features used to win all the object de- grounds and a DeepID-Net, which introduced a
tection competitions. In this algorithm latent (not max-pooling penalized by the deformation of the
supervised) object parts were discovered for each parts called def-pooling. The combination im-
class and optimized by minimizing the deforma- proved over the state-of-the-art. As we mentioned
tions of the full objects (connections were modeled in Section 2.1.3, Dai et al. [63] built on R-FCN
by springs forces). The whole thing was built on a and added deformations in the Position Sensitive
HoG image pyramid. RoI-Pooling: an offset is learned from the classi-
When Region based DCNNs started to beat the cal Position Sensitive pooled tensor with a fully
former champion, researchers began to wonder if connected network for each cell of the RoI-Pooling
it was only a matter of using better features. If thus creating ”parts” like features. This trick of
this was the case then the region based approach moving RoI cells around is also present in [247],
would not necessarily be a more powerful algo- although slightly different because it is closer to
rithm. The DPM was flexible enough to inte- the original DPM. Dai et al. [63] even added off-
grate the newer more discriminative CNN features. sets to convolutional filters cells on Conv-5, which
Therefore, some research works focused in this re- became doable thanks to bilinear interpolation. It,
search direction. thus, became a truly deformable fully convolutional

23
network. However, Mordan et al. [247] got better positive labels with a value one. This equation con-
performances on VOC without it. Several works straints the network to output the predicted confi-
used deformable R-FCN like [429] for aerial im- dence score, p, to be 1 if it thinks there is an object
agery that used a different training strategy. How- and 0 otherwise.
ever, even if it is still present in famous competi-
tions like COCO, it is less used than its counter-
parts with fixed RoI-Pooling. It might come back
(
−log(p) if y = 1
though thanks to recent best performing models CE(p, y) = (1)
−log(1 − p) otherwise
like [345] that used [63] as their baseline and selec-
tively back-propagated gradients according to the
object size. A multi-variate version of the log loss is used for
classification (Eq. 2). po,c predicts the probability
of observation o being class c where c ∈ {1, .., C}.
2.2 Model Training yo,c is 1 if observation o belongs to class c and 0
The next important aspect of the detection model’s otherwise. c = 0 is accounted for the special case
design is the losses being used to converge the huge of background class.
number of weights and the hyper-parameters that
must be conducive to this convergence. Optimiz- C
X
ing for a wrongfully crafted loss may actually lead M CE(p, y) = − yo,c log(po,c ) (2)
the model to diverge instead. Choosing incorrect c=0

hyper-parameters, on the one hand, can stagnate


the model, trap it in a local optima or, on the Fast-RCNN [104] used a multitask loss (Eq. 3)
other hand, over-fit the training data (causing poor which is the de-facto equation used for classify-
generalizations). Since DCNNs are mostly trained ing as well as regressing. The losses are summed
with mini-batch SGD (see for instance [190]), we fo- over all the regions proposals or default reference
cus the following discussion on losses and on the op- boxes, i. The ground-truth label, p∗i , is 1 if the pro-
timization tricks necessary to attain convergence. posal box is positive, otherwise 0. Regularization
We also review the contribution of pre-training on is learned only for positive proposal boxes.
some other dataset and data augmentation tech-
niques which bring about an excellent initialization 1 X
L({pi }, {ti }) = Lcls (pi , p∗i )+
point and good generalizations respectively. Ncls i
1 X ∗
2.2.1 Losses λ p Lreg (ti , t∗i ) (3)
Nreg i i
Multi-variate cross entropy loss, or log loss, is gen-
erally used throughout the literature to classify im- where ti is a vector representing the 4 coordinates
ages or regions in the context of detectors. How- of the predicted bounding box and similarly t∗i rep-
ever, detecting objects in large images comes with resents the 4 coordinates of the ground truth. Eq.
its own set of specific challenges: regress bound- 4 presents the equation for exact parameterized co-
ing boxes to get precise localization, which is a ordinates. {xa , ya , wa , ha } are the center x and y
hard problem that is not present at all in classi- coordinates, width and height of the default an-
fication and an imbalance between target object chor box respectively. Similarly {x∗a , ya∗ , wa∗ , h∗a } are
regions and background regions. ground truths and {x, y, w, h} are the coordinates
A binary cross entropy loss is formulated as to be predicted. The two terms are normalized
shown in Eq. 1. It is used for learning the com- by mini-batch size, Ncls , and number of propos-
bined objectness. All instances, y, are marked as als/default reference boxes, Nreg , and weighted by

24
a balancing parameter λ. the predicted observation or not (Eq. 7).
x − xa y − ya M
tx = , ty = X X
wa ha Lborder = λ+ Ts,i log(ps,i )+
w h s∈{l,r,t,b} i=1
tw = log , th = log −
wa ha λ (1 − Ts,i )log(1 − ps,i )
∗ ∗ (4)
x − xa y − ya (
t∗x = , t∗y = if i = Bsgt
1,
wa ha ∀i ∈ {1, ..., M }, Ts,i =
w∗ h∗ 0,
otherwise
t∗w = log , t∗h = log (7)
wa ha
where λ− = 0.5 MM−1 , λ+ = (M − 1)λ− . The
Lreg is a smooth L1 loss defined by Eq. 5. In its notations can be inferred from Eq. 6. In the second
place some papers also use L2 losses. paper [101], related to the same topic, applied the
( regression losses iteratively at the region proposal
∗ 0.5(t − t∗ )2 if |t − t∗ | < 1 stage in a class agnostic manner. They used final
lreg (t, t ) = ∗ convolutional features and predictions from last it-
|t − t | − 0.5 otherwise
(5) eration to further refine the proposals.
It was also found out to be beneficial to optimize
the loss directly over Intersection over Union (IoU)
Losses for regressing bounding boxes: Since which is the standard practice to evaluate a bound-
accurate localization is a major issue, papers have ing box or segmentation algorithm. Yu et al. [441]
suggested a more sophisticated localization loss. presented Eq. 8 for regression loss.
[103] came up with a binary logistic type regres-
sion loss. After dividing the image patch into M Lunit−box = −ln(IoU (gt, pred)) (8)
columns and M rows, they computed the probabil-
ity of each row and column being inside or outside The terms are self-explanatory. Jiang et al. [158]
the predicted observation box (in-out loss) (Eq. 6). also learned to predict IoU between predicted box
and ground truth. They made a case to use lo-
M
calization confidence instead of classification con-
fidence to suppress boxes at NMS stage. It gave
X X
Lin−out = Ta,i log(pa,i )+
a∈{x,y} i=1
higher recall on MS COCO dataset. This loss is
however very unstable and has a number of regions
(1 − Ta,i )log(1 − pa,i ) where the IoU has zero-gradient and thus it is un-
(
1, if Blgt ≤ i ≤ Brgt defined. Tychsen-Smith and Petersson [390] adapt
∀i ∈ {1, ..., M }, Tx,i = this loss to make it more stable by adding hard
0, otherwise
( bounds, which prevent the function from diverging.
1, if Btgt ≤ i ≤ Bbgt They also factorize the score function by adding a
∀i ∈ {1, ..., M }, Ty,i =
0, otherwise fitness term representing the IoU of the box w.r.t.
(6) the ground truth.
where {Blgt , Brgt , Btgt , Bbgt } are the left, right, top
and bottom edges of the bounding box respectively. Losses for class imbalance: Since in recent de-
Tx and Ty are the binary positive or negative values tectors there are a lot of anchors which most of the
for rows and columns respectively. p is the proba- time cover background, there is a class imbalance
bility associated with it respectively. between positive and negative anchors. An alter-
In addition, they also compute the confidence for native is Online Hard Example Mining (OHEM).
each column and row being the exact boundary of Shrivastava et al. [337] performed to select only

25
worst performing examples (so-called hard exam- for classification where the structure is much sim-
ples) for calculating gradients. Even if by fixing the pler, no general strategy has emerged yet on how to
ratio between positive and negative instances, gen- use mini-batch gradient descent correctly. Different
erally 1:3, one can partly solve this imbalance. Lin popular versions of mini-batch Stochastic Gradient
et al. [216] proposed a tweak to the cross entropy Descent(SGD) [318] have been proposed based on
loss, called focal loss, which took into account all a combination of momentum, to accelerate conver-
the anchors but penalized easy examples less and gence, and using the history of the past gradients,
hard examples more. Focal loss (Eq. 9) was found to dampen the oscillations when reaching a min-
to increase the performance by 3.2 mAP points on imum: AdaDelta [447], RMSProp [378] and the
MS COCO, in comparison to OHEM on a ResNet- unavoidable ADAM [171, 304] are only the most
50-FPN backbone and 600 pixel image scale. well-known. However, in object detection literature
( authors, use either plain SGD or ADAM, without
−αt (1 − p)γ log(p) if y = 1 putting too much thought into it. The most im-
F L(p, y) = portant hyper-parameters remain the learning rate
−αt pγ log(1 − p) otherwise
(9) and the batch size.
One can also adopt simpler strategies like rebalanc-
ing the cross-entropy by putting more weights on Learning rate: There is no concrete way to de-
the minority class [259]. cide the learning rate policy over the period of the
training. It depends on a myriad of factors like op-
Supplementary losses: In addition to classifi- timizer, number of training examples, model, batch
cation and regression losses, some papers also op- size, etc. We cannot quantify the effect of each fac-
timized extra losses in parallel. Dai et al. [61] pro- tor; Therefore, the current way to determine the
posed a three-stage cascade for differentiating in- policy is by hit-and-trial. What works for one set-
stances, estimating masks and categorizing objects. ting may or may not work for other settings. If
Because of this they achieved competitive perfor- the policy is incorrect then the model might fail
mance on object detection task too. They further to converge at all. Nevertheless, some papers have
experimented with a five-stage cascade also. Uber- studied it and have established general guidelines
Net [173] trained on as many as six other tasks in that have been found to work better than others.
parallel with object detection. He et al. [118] have A large learning rate might never converge while a
shown that using an additional segmentation loss small learning rate gives sub-optimal results. Since,
by adding an extra branch to the Faster R-CNN in the initial stage of training the change in weights
detection sub-network can also improve detection is dramatic, Goyal et al. [111] have proposed a Lin-
performance. Li et al. [203] introduced position- ear Gradual Warmup strategy in which learning
sensitive inside/outside score maps to train for de- rate is increased every iteration during this period.
tection and segmentation simultaneously. Wang Then starting from a point (for e.g. 10−3 ) the pol-
et al. [409] proposed an additional repulsion loss icy was to decrease learning rate over many epochs.
between predicted bounding boxes in order to have Krizhevsky [180] and Goyal et al. [111] also used a
one final prediction per ground truth. Generally, it Linear Scaling Rule which linearly scaled the learn-
can be observed, instance segmentation in particu- ing rate according to the mini-batch size.
lar, aids the object detection task.
Batch size: The object detection literature
2.2.2 Hyper-Parameters doesn’t generally focus on the effects of using a big-
ger or smaller batch size during training. Training
The detection problem is a highly non-convex prob- modern detectors requires working on full images
lem in hundreds of thousands of dimensions. Even and therefore on large tensors which can be trou-

26
blesome to store on the GPU RAM. It has forced Singh and Davis [345] made a compelling case
the community to use small batches, of 1 to 16 im- for the minimum difference in scales of object in-
ages, for training (16 in RetinaNet [216] and Mask stances between classification dataset used for pre-
R-CNN [118] with the latest GPUs). training and detection dataset to minimize domain
One obvious advantage of increasing the batch shift while fine-tuning. They asked should we pre-
size is that it reduces the training time but since train CNNs on low resolution classification dataset
the memory constraint restricts the number of im- or restrict the scale of object instances in a range
ages, more GPUs have to be employed. However, by training on an image pyramid? By minimizing
using extra large batches have been shown to po- scale variance they were able to get better results
tentially lead to big improvements in performances than the methods that employed scale invariant de-
or speed. For instance, batch normalization [152] tector. The problem with the second approach is
needs many images to provide meaningful statis- some instances are so small that in order to bring
tics. Originally batch size effects were studied by them in the scale range, the images have to be up-
[111] on ImageNet dataset. They were able to show scaled so much that they might not fit in the mem-
that by increasing the batch size from 256 to 8192, ory or they will not be used for training at all.
train time can be reduced from 29 hours to just Using a pyramid of images and using each for in-
1 hour while maintaining the same accuracy. Fur- ference is also slower than methods that use input
ther, You et al. [437] and Akiba et al. [1] brought image exactly once.
down the training time to below 15 minutes by in- Section 3.1.3 covers pre-training and other as-
creasing the batch size to 32k. pects of it like fine-tuning and beyond in great de-
Very recently, MegDet [274], inspired from [111], tail. There are only, to the best of our knowledge,
have shown that by averaging gradients on many two articles that tried to match the performances
GPUs to get an equivalent batch size of 256 and ad- of ImageNet pre-training by training detectors from
justing the learning rates could lead to some perfor- scratch. The first one being [331] that used deep
mance gains. It is hard to say now which strategy supervision (dense access to gradients for all layers)
will eventually win in the long term but they have and very recently [332] that adaptively recalibrated
shown that it is worth exploring. supervision intensities based on input object sizes.

2.2.4 Data Augmentation


2.2.3 Pre-Training
The aim of augmenting the train set images is to
Transfer learning was first shown to be useful in create diversity, avoid overfitting, increase amount
a supervised learning approach by Girshick et al. of data, improve generalizability and overcome
[105]. The idea is to fine-tune from a model al- different kinds of variances. This can easily be
ready trained on a dataset that is similar to the achieved without any extra annotations efforts by
target dataset. This is usually a better starting manually designing many augmentation strategies.
point for training instead of randomly initializ- The general practices that are followed include
ing weights. For e.g. model pre-trained on Ima- and are not limited to scale, resize, translation,
geNet being used for training on MS COCO. And rotation, vertical and horizontal flipping, elastic
since, COCO dataset’s classes is a superset of PAS- distortions, random cropping and contrast, color,
CAL VOC’s classes most of the state-of-the-art ap- hue, brightness, saturation and sharpness adjust-
proaches pre-train on COCO before training it on ments etc. The two recent and promising but not
PASCAL VOC. If the dataset at hand is completely widely adapted techniques are Cutout [70] and
unrelated to dataset used for pre-training, it might Sample Pairing [149]. Taylor and Nitschke [376]
not be useful. For e.g. model pre-trained on Ima- benchmarked various popular data augmentation
geNet being used for detecting cars in aerial images. schemes to know the ones that are most appro-

27
priate, and found out that cropping was the most
influential in their case.
Although there are many techniques available
and each one of them is easy to implement, it is
difficult to know in advance, without expert knowl-
Original edge, which techniques assist the performance for
a target dataset. For example, vertical flipping in
case of traffic signs dataset is not helpful because
one is not likely to encounter inverted signs in the
test set. It is not trivial to select the approaches
for each target dataset and test all of them before
Resize Scale deploying a model. Therefore, Cubuk et al. [60]
proposed a search algorithm based on reinforce-
ment learning to find the best augmentation pol-
icy. Their approach tried to find the best suitable
augmentation operations along with their magni-
tude and probability of happening. Smart Aug-
Vertical Flip Horizontal Flip
mentation [194] worked by creating a network that
learned how to automatically generate augmented
data during the training process of a target net-
work in a way that reduced the loss. Tran et al.
[382] proposed a Bayesian approach, where new
annotated training points are treated as missing
Shear Rotation
variables and generated based on the distribution
learned from the training set. Devries and Taylor
[69] applied simple transformations such as adding
noise, interpolating, or extrapolating between data
points. They performed the transformation, not in
input space, but in a learned feature space. All the
Elastic Distortions Lighting above approaches are implemented in the domain
of classification only but they might be beneficial
for the detection task as well and it would be in-
teresting to test them.
Generative adversarial networks (GANs) have
also been used to generate the augmented data di-
Greyscale Noise
rectly for classification without searching for the
best policies explicitly [7, 251, 280, 348]. Ratner
et al. [300] used GANs to describe data augmenta-
tion strategies. GAN approaches may not be as ef-
fective for detection scenarios yet because generat-
ing an image with many object instances placed in
Sample Pairing Cut-Outs a relevant background is much more difficult than
generating an image with just one dominant object.
This is also an interesting problem which might be
Figure 9: Different kinds of data augmentation addressed in the near future and is explored in Sec-
techniques used to improve the generalization of tion 3.2.2.
the network. The modification done for each im-
age is mentioned below the figure. Best viewed in
color. 28
0.98 0.2 0.98 0.98

0.15
0.85 0.54
0.8 0.63
0.99 0.24 0.99 0.99

Predictions of the detector (a) After NMS (b) After Soft-NMS

Figure 10: An illustration of the inference stage. In this example, bounding boxes around horses (blue) and
persons (pink) are obtained from the detector (along with the confidence scores mentioned on top of each
box). (a) NMS chooses the most confident box and suppresses all other boxes having an IoU greater than a
threshold. Note, it sometimes leads to suppression of boxes around other occluded objects. (b) Soft-NMS
deals with this situation by reducing the confidence scores of boxes instead of completely suppressing them.

2.3 Inference best for all datasets. Datasets with just one ob-
ject per image will trivially apply NMS by choosing
The behavior of the modern detectors is to pick only the highest-ranking box. Generally, datasets
up pixels of target objects, propose as many win- with sparse and fewer number of objects per image
dows as possible surrounding those pixels and esti- (2 to 3 objects) require a lower threshold. While
mate confidence scores for each of the window. It datasets with cramped and higher numbers of ob-
does not aim to suggest one box exactly per object. jects per image (7 and above) give better results
Since all the reference boxes act independently dur- with a higher threshold. The problems that arose
ing test time and similar input pixels are picked up with this naive and hand-crafted approach was that
by many neighboring anchors, each positive predic- it may completely suppress nearby or occluded true
tion in the prediction set highly overlaps with other positive detections, choose top scoring box which
boxes. If the best ones out of these are not selected, might not be the best localized one and its inability
it will lead to many double detections and thus false to suppress false positives with insufficient overlap.
positives. The ideal result would be to predict ex-
actly one prediction box per ground-truth object To improve upon it, many approaches have been
that has high overlap with it. To reach near this proposed but most of them work for a very special
ideal state, some sort of post-processing needs to case such as pedestrians in highly occluded scenar-
be done. ios. We discuss the various directions they take and
Greedy Non-maximum suppression (NMS) [64] the approaches that work better than the greedy
is the most frequent technique used for inference NMS in the general scenario. Most of the following
modules to suppress double detections through discussion is based on [132] and [21], who, in their
hard thresholding. In this approach, the predic- papers, provided us with an in-depth view of the
tion box with the highest confidence was chosen alternate strategies being used.
and all the boxes having an Intersection over Union Many clustering approaches for predicted boxes
(IoU) higher than a threshold, Nt , were suppressed have been proposed. A few of them are mean shift
or rescored to zero. This step was made itera- clustering [64, 413], agglomerative clustering [22],
tively till all the boxes were covered. Because of affinity propagation clustering [250], heuristic vari-
its nature there is no single threshold that works ants [327], etc. Rothe et al. [315] presented a learn-

29
ing based method which ”passes messages between as edges in a graph.
windows” or clustered the final detections to finally As a bonus, in the end, we also throw some light
select exemplars from each cluster. Mrowca et al. on the inference ”tricks” that are generally known
[250] deployed a multi-class version of this paper. to the experts participating in the competitions.
Clustering formulations with globally optimal so- The tricks that are used to further improve the
lutions have been proposed in [371]. All of them evaluation metrics are: Doing multi-scale inference
worked for special cases but are less consistent than on an image pyramid (see Section 3.1.1 for train-
Greedy NMS, generally. ing); Doing inference on the original image and on
Some papers learn NMS in a convolutional net- its horizontal flip (or on different rotated versions of
work. Henderson and Ferrari [121] and Wan et al. the image if the application domain does not have a
[401] tried to incorporate NMS procedure at train- fixed direction) and aggregating results with NMS;
ing time. Stewart et al. [357] generated a sparse set Doing bounding box voting as in [102] using the
of detections by training an LSTM. Hosang et al. score of each box as its weight; Using heavy back-
[130] trained the network to find different optimal bones, as observed in the backbone section; Finally,
cutoff thresholds (Nt ) locally. Hosang et al. [132] averaging the predictions of different models in en-
took one step further and got rid of the NMS step sembles. For the last trick often it is better to not
completely by taking into account double detec- necessarily use the top-N best performing models
tions in the loss and jointly processed neighbor- but to prefer instead uncorrelated models so that
ing detections. The former inclined the network they can correct each other’s weaknesses. Ensem-
to predict one detection per object and the lat- bles of models are outperforming single models by
ter provided with the information if an object was often a large margin and one can average as many
detected multiple times. Their approach worked as a dozen models to outrank its competitors. Fur-
better than greedy NMS and they obtained a per- thermore, with DCNNs generally one does not need
formance gain of 0.8 mAP on COCO dataset. to put too much thought on normalizing the models
Most recently, greedy NMS was improved upon as each one gives bounded probabilities (because of
by Bodla et al. [21]. Instead of setting the score of the softmax operator in the last layer).
neighboring detections as zero they decreased the
detection confidence as an increasing function of
2.4 Concluding Remarks
overlap. It improved the performance by 1.1 mAP
for COCO dataset. There was no extra training This concludes a general overview of the land-
required and since it is hand-crafted it can be easily scape of the mainstream object detection halfway
integrated in object detection pipeline. It is used in through 2018. Although the methods presented are
current top entries for MS COCO object detection all different, it has been shown that in fact most
challenge. papers have converged towards the same crucial
Jiang et al. [158] performed NMS based on sep- design choices. All pipelines are now fully convolu-
arately predicted localization confidence instead of tional, which brings structure (regularization), sim-
usually accepted classification confidence. Other plicity, speed and elegance to the detectors. The
papers rescored detections locally [38, 386] or glob- anchors mechanism of Ren et al. [309] has now also
ally [397]. Some others detected objects in pairs been widely adopted and has not really been chal-
in order to handle occlusions [266, 321, 370]. Ro- lenged yet, although iteratively regressing a set of
driguez et al. [312] made use of the crowd den- seed boxes show some promise [101, 254]. The need
sity. Quadratic unconstrained binary optimization to use multi-scale information from different lay-
(QUBO) [317] used detection scores as a unary po- ers of the CNN is now apparent [174, 215, 216].
tential and overlap between detections as a pairwise The RoI-Pooling module and its cousins can also
potential to obtain the optimal subset of detection be cited as one of the main architectural advances
boxes. Niepert et al. [257] saw overlapping windows of recent years but might not ultimately be used

30
Figure 11: An illustration of challenges in object detection. To detect all instances of the class ”fork”
(yellow bounding boxes) from the COCO dataset [214], a detector should be able to handle small objects
(lower middle picture) as well as big objects (third column photograph). It needs to be scale invariant
as well as being rotation invariant (all forks have different orientation in the pictures). It should also
manage occlusions as in the left-hand side photograph. After being trained on the pictures in the first
three columns, detection algorithms are expected to generalize to the ”cartoon” image on the right (domain
adaptation).

by future works. ideas. To have an idea of number of papers be-


With that said, most of the research being done ing published targeting each challenge, we ran a
now in the mainstream object recognition consists corresponding query on advanced search of Google
of inventing new ways of passing the information Scholar. The exact query is mentioned below each
through the different layers or coming up with dif- figure respectively (see Figure 12, 13, 14, 15 and
ferent kinds of losses or parametrization [103, 441]. 16). We report these numbers from year 2011 to
There is a small paradox now in the fact that even 2018. We note that this method doesn’t give the
if man-made features are now absent of most mod- exact number of papers targeting each challenge
ern detectors, more and more research is being done but still gives us a rough idea of the interest of the
on how to better hand-craft the CNN architectures community in each challenge. We couldn’t use this
and modules. for the localization challenge because almost all ob-
ject detection papers mention localization even if
they are not targeting to solve it.
3 Going Forward in Object
Detection
3.1 Major Challenges
While we demonstrated that object detection has
already been turned upside-down by CNN archi- There are some walls that the current models can-
tectures and that nowadays most methods revolve not overcome without heavy structural changes, we
around the same architectural ideas, the field has list these challenges in Figure 11.
not yet reached a status quo, far from it. Com- Often, when we hear that object recognition is
pletely new ideas and paradigms are being devel- solved, we argue that the existence of these walls
oped and explored as we write this survey, shaping are solid proof that it is not. Although we have
the future of object detection. This section lists advanced the field, we cannot rely indefinitely on
the major challenges that remain mostly unsolved the current DCNNs. This section shows how the
and the attempts to get around them using such recent literature addressed these topics.

31
Scale Variance Singh and Davis [345] selectively back-propagated
4000 the gradients of object instances if they fall in a
predetermined size range. This way, small objects
3000 must be scaled up to be considered for training.
They named their technique Scale Normalization
Scale Variance

2000 for Image Pyramids (SNIP). Singh et al. [347] op-


timized this approach by processing only context
1000 regions around ground-truth instances, referred to
as chips.
0
2011 2012 2013 2014 2015 2016 2017 2018
Second, a set of default reference boxes, with var-
ied size and aspect ratios that cover the whole im-
age uniformly, were used. Ren et al. [309] proposed
Figure 12: Number of papers published each year a set of reference boxes at each sliding window lo-
for challenge of scale variance. Query used in cation which are trained to regress and classify. If
Google Scholar: (”scale variance” OR ”scale in- an anchor box has a significant overlap with the
variance” OR ”scale invariant”) AND ”object de- ground truth it is treated as positive otherwise,
tection”. it is ignored or treated as negative. Due to the
huge density of anchors most of them are nega-
3.1.1 Scale Variance tive. This leads to an imbalance in the positive
and negative examples. To overcome it OHEM
In the past three years a lot of approaches have [337] or Focal Loss [216] are generally applied at
been proposed to deal with the challenge of scale training time. One more downside of anchors is
variance. On the one hand, object instances in the that their design has to be adapted according to
image may fill only 0.01% to 0.25% of the pixels, the object sizes in the dataset. If large anchors
and, on the other hand, the instance may fill 80% are used with too many small objects then, and
to 90% of the whole image. It is tough to make vice versa, then they won’t be able to train as ef-
a single feature map predict all the objects, with ficiently. Default reference boxes are an important
this huge variance, because of the limited recep- design feature in double stage [62] as well as single-
tive field that it’s neurons have. Particularly small stage methods [221, 306]. Most of the top winning
objects (discussed in Section 3.1.6) are difficult to entries [63, 118, 215, 216] use them in their models.
classify and localize. In this section we will discuss Bodla et al. [21] helped by improving the suppres-
three main approaches that are used to tackle the sion technique of double detections, generated from
challenge of scale variance. the dense set of reference boxes, at inference time.
First, is to make image pyramids [90, 104, 116, Third, multiple convolutional layers were used
327]. This helps enlarge small objects and shrink for bounding box predictions. Since a single fea-
the large objects. Although the variance is reduced ture map was not enough to predict objects of var-
to an extent but each image has to be pass for- ied sizes, SSD [221] added more feature maps to
warded multiple times thus, making it computa- the original classification backbones. Cai et al. [28]
tionally expensive and slower than the approaches proposed regions as well as performed detections
discussed in the following discussion. This ap- on multiple scales in a two-stage detector. Najibi
proach is different from data augmentation tech- et al. [255] used this method to achieve state-of-
niques [60] where an image is randomly cropped, the-art on a face dataset [433] and Li et al. [200]
zoomed in or out, rotated etc. and used exactly on pedestrian dataset [87]. Yang et al. [431] used
once for inference. Ren et al. [310] extracted fea- all the layers to reject easy negatives and then per-
ture maps from a frozen network at different im- formed scale-dependent pooling on the remaining
age scales and merged them using maxout [110]. proposals. Shallower or finer layers are deemed

32
Rotational Variance at an angle or even inverted. While it is hard to
250 define rotation for flexible objects like a cat, a pose
definition would be more appropriate, it is much
200
easier to define it for texts or objects in aerial im-
ages which have an expected rigid shape. It is well
Rotational Variance

150

known that CNNs as they are now do not have the


100
ability to deal with the rotational variance of the
50
data. More often than not, this problem is cir-
cumvented by using data augmentation: showing
0
2011 2012 2013 2014 2015 2016 2017 2018
the network slightly rotated versions of each patch.
When training on full images with multiple anno-
tations it becomes less practical. Furthermore, like
Figure 13: Number of papers published each year for occlusions, this might work but it is disappoint-
for challenge of rotational variance. Query used in ing as one could imagine incorporating rotational
Google Scholar: (”rotational variance” OR ”rota- invariance into the structure of the network.
tional invariance” OR ”rotational invariant”) AND Building rotational invariance can be simply
”object detection”. done by using oriented bounding boxes in the re-
gion proposal step of modern detectors. Jiang et al.
to be better for detecting small objects while top [159] used Faster R-CNN features to predict ori-
or coarser layers are better at detecting bigger ob- ented bounding boxes, their straightened versions
jects. In the original design, all the layers predict were then passed on to the classifier. More ele-
the boxes independently and no information from gantly, few works like [26, 119, 228] proposed to
other layers is combined or merged. Many papers, construct different kinds of RoI-pooling module for
then, tried to fuse different layers [51, 192] or added oriented bounding boxes. Ma et al. [228] trans-
additional top-down network [338, 415]. They have formed the RoI-Pooling layer of Faster R-CNN by
already been discussed in Section 2.1.1. rotating the region inside the detector to make it
Fourth, Dilated Convolutions (a.k.a. atrous con- fit the usual horizontal grid, which brought an as-
volutions) [438] were deployed to increase the fil- tonishing increase of performances from the 38.7%
ter’s stride. This helped increase the receptive field of regular Faster R-CNN to 71.8% with additional
size and, thus, incorporate larger context without tricks on MSRA. Similarly, He et al. [119] used a ro-
additional computations. Obviously smaller recep- tated version of the recently introduced RoI-Align
tive fields are also needed if the objects are small to pool oriented proposals to get more discrimi-
and thus only a clever combination of larger recep- native features (better aligned with the text direc-
tive field with atrous convolutions and smaller ones tion) that will be used in the text recognition parts.
like in ASPP [42] (Atrous Spatial Pyramid Pooling) Busta et al. [26] also used rotated pooling by bilin-
can lead to a successful scale invariance in detec- ear interpolation to extract oriented features to rec-
tion. It has been successfully applied in the context ognize text after having rendered YOLO to be able
of object detection [62] and semantic segmentation to predict rotated bounding boxes. Shi et al. [333]
[42]. Dai et al. [63] presented a generalized version detected in the same way, oriented bounding boxes
of it by learning the deformation offsets addition- (called segments) with a similar architecture but
ally. differ from [26, 119, 228] because it also learned to
merge the oriented segments appropriately, if they
cover the same word or sentence, which allowed
3.1.2 Rotational Variance
greater flexibility.
In the real world object instances are not necessar- Liu and Jin [222] needed slightly more compli-
ily present in an upright manner but can be found cated anchors: quadrangles anchors, and regressed

33
compact text zones in a single-stage architecture Domain Adaptation
similar to Faster R-CNN’s RPN. This system be- 1000

ing more flexible than the previous ones, necessi-


tated more parameters. They used Monte-Carlo 750

simulations to compute overlaps between quadran-

Domain Adaptation
gles. Liao et al. [212] directly rotated convolution 500

filters inside the SSD framework, which effectively


rendered the network rotation-invariant for a finite 250

set of rotations (which is generalized in the recent


[410] for segmentation). However, in the case of 0
2011 2012 2013 2014 2015 2016 2017 2018

text detection even oriented bounding boxes can


be insufficient to cover text with a layout with too
much curvature and one often sees the same fail- Figure 14: Number of papers published each year
ure cases in different articles (circle-shaped texts for challenge of domain adaptation. Query used in
for instance). Google Scholar: (”domain adaptation” OR ”adapt-
A different kind of approach for translation in- ing domains”) AND ”object detection”.
variance was taken by the two following works of
Cheng et al. [52] and Laptev et al. [188] that made
like [335] relied on oriented proposals too. The di-
use of metric-learning. Former proposed an original
versity of the methods show that no real standard
approach of using metric learning to force features
has emerged yet. Even the most sophisticated de-
of an image and its rotated versions to be close to
tection pipelines are only rotation invariant to a
each other hence, somehow invariant to rotations.
certain extent.
In a somewhat related approach the latter found
a canonical pose for different rotated versions of The detectors presented in this section do not
an image and used a differentiable transformation yet have the same popularity as the vertical ones
to make every example canonical and to pool the because all the main datasets like COCO do not
same features. present rotated images. One could define a rotated-
The difficulty of predicting oriented bounding COCO or rotated-VOC to evaluate the benefit
boxes is alleviated if one resorts to semantic seg- these pipelines could bring over their vertical ver-
mentation like in [466]. They learned to output se- sions but it is obviously difficult and would not be
mantic segmentation then oriented bounding boxes accepted as is by the community without a strong,
were found based on the output score map. How- well-thought-evaluation protocol.
ever, it shares the same downsizes as other ap-
proaches [26, 119, 212, 222, 228] for text detection 3.1.3 Domain Adaptation
because in the end one still has to fit oriented rect-
angles to evaluate the performances. It is often needed to repurpose a detector trained
Other applications than text detection also re- on domain A to function on domain B. In most
quire rotation invariance. In the domain of aerial cases this is because the dataset in domain A has
imagery, the recently released DOTA [421] is one lots of training examples and the categories in it
of the first datasets of its kind expecting oriented are generic whereas the dataset in domain B has
bounding boxes for predictions. One can anticipate less training examples and objects that are very
an avalanche of papers trying to use text detection specific or distinct from A. There are surprisingly
techniques like [372], where the SSD framework is very few recent articles that tackled explicit do-
used to regress bounding box angles or the former main adaptation in the context of object detection
metric learning technique from Cheng et al. [52] –[361, 368, 428] did it for HOG based features –
and Cheng et al. [54]. For face detection, paper even though the literature for domain adaptation

34
for classification is dense, as shown by the recent One of the end goals of domain adaptation would
survey of Csurka [59]. For instance when one trains be to be able to learn a model on synthetic data,
a Faster R-CNN on COCO and want to test it off- which is available (almost) for free and to have it
the-shelf on the car images of KITTI Geiger et al. performing well on real images. Pepik et al. [279]
[98] (’car’ is one of the 80 classes of COCO) one was, to the best of our knowledge, the first to point
gets only 56.1% AP w.r.t. 83.7% using more sim- out that, even though CNNs are texture sensitive,
ilar images because of the differences between the wire-framed and CAD models used in addition to
domains (see [383]) . real data can improve the performances of detec-
Most works adapt the features learned in an- tors. Peng et al. [277] augmented PASCAL-VOC
other domain (mostly classification) by simply fine- data with 3D CAD models of the objects found in
tuning the weights on the task at hand. Since [105], PASCAL-VOC (planes, horses, potted plants, etc.)
literally every state-of-the-art detectors are pre- and then rendered them in backgrounds where they
trained on ImageNet or on an even bigger dataset. are likely to be found and improved overall detec-
This is the case even for relatively large object de- tion performances. Following this line, several au-
tection datasets like COCO. There is no fundamen- thors introduced synthetic data for various tasks
tal reason for it to be a requirement. The objects such as i) persons: Varol et al. [395] ii) furniture:
of the target domains have to be similar and of Massa et al. [236] created rendered CAD furni-
the same scales as the objects on which the net- tures on real backgrounds by using grayscale im-
work was pre-trained as pointed out by Singh and ages to avoid color artifacts and improved the de-
Davis [345], that detected small cars in aerial im- tection performances on the IKEA dataset. iii)
agery by first pre-training on ImageNet. The semi- text: Gupta et al. [112] created an oriented text de-
nal work of Hoffman et al. [127], already evoked in tection benchmark by superimposing synthetic text
the weakly supervised Section 4.2.1, showed how to existing scenes while respecting geometric and
to transfer a good classifier trained on large scale uniformity constraints and showed better results on
image datasets to a good detector trained on few ICDAR iv) logos: Su et al. [359] did the same with-
images by fine-tuning the first layers of a con- out any constraints by superimposing transparent
vnet trained on classification and adapting the fi- logos to existing images.
nal layer using nearest neighbor classes. Hinter- Georgakis et al. [99] synthesized new instances
stoisser et al. [125] demonstrated another example of 3D CAD models by copy pasting rendered ob-
of transfer learning where they froze the first lay- jects on surface normals, very close to [296], which
ers of detectors trained on synthetic data and fine- used Blender to put instances of objects inside a re-
tuned only the last layers on the target task. frigerator. Later Dwibedi et al. [79] with the same
We discuss below all the articles we found that approach but without respecting any global con-
go farther than simple transfer learning for domain sistency shown promise. For them only local con-
adaptation for object detection. Raj et al. [294] sistency is important for modern object detectors.
aligned features subspace from different domains Similar to [99], they used different kinds of blending
for each class using Principal Component Analy- to make the detector robust to the pasting artifacts
sis (PCA). Chen et al. [49] used H-divergence the- (more details can be found in [78]). More recently,
ory and adversarial training to bridge the distribu- Dvornik et al. [77] extended [79] by first finding
tion mismatches. All the mentioned articles worked locations in images with high likelihood of object
on adapting the features. Thanks to GANs some presence before pasting objects. Another recent
of them are trying to adapt directly to the image approach [383] found that domain randomization
[150], which used CycleGAN from [480] to convert when creating synthetic data is vital to train detec-
images directly from one domain to the other. The tors: training on Virtual KITTI Gaidon et al. [94],
object detection community needs to evolve if we a dataset that was built to be close to KITTI (in
want to move beyond transfer-learning. terms of aspects, textures, vehicles and bounding

35
boxes statistics), is not sufficient to be state-of-the- 3.1.4 Object Localization
art on KITTI. One can gain almost one point of AP
when building his own version of Virtual KITTI by Accurate localization remains one of the two
introducing more randomness than was present in biggest sources of error [129] in fully supervised
the original in the form of random textures and object detection. It mainly originates from small
backgrounds, random camera angles and random objects and more stringent evaluation protocol ap-
flying distractor objects. Randomness was appar- plied in the latest datasets. The predicted boxes
ently absent from KITTI but is beneficial for the are required to have an IoU of up to 0.95 with the
detector to gain generalization capabilities. ground-truth boxes. Generally, localization is dealt
by using smooth L1 or L2 losses along with classifi-
cation loss. Some papers proposed a more detailed
Several authors have shown interest in propos-
methodology to overcome this issue. Also, anno-
ing tools for generating artificial images at a large
tating bounding boxes for each and every object
scale. Qiu and Yuille [291] created the open-source
is expensive. We will also look into some methods
plug-in UnrealCV for a popular game engine Unreal
that localize objects using only weakly annotated
Engine 4 and showed applications to deep network
images.
algorithms. Tian et al. [377] used the graphical
Kong et al. [174] overcame the poor localization
model CityEngine to generate a synthetic city ac-
because of coarseness of the feature maps by ag-
cording to the layout of existing cities and added
gregating hierarchical feature maps and then com-
cars, trucks and buses to it using a game engine
pressing them into a uniform space. It provided
(Unity3D). The detectors trained on KITTI and
an efficient combination framework for deep but se-
this dataset are again better than just with KITTI.
mantic, intermediate but complementary, and shal-
Alhaija et al. [4] pushed Blender to its limits to
low but high-resolution CNN features. Chen et al.
generate almost real-looking 3D CAD cars with
[46] proposed multi-thresholding straddling expan-
environment maps and pasted them inside differ-
sion (MTSE) to reduce localization bias and re-
ent 2D/3D environments including KITTI, Virtu-
fine boxes during proposal time which is based
alKITTI (and even Flickr). It is worth noting that
on super-pixel tightness as opposed to objectness
some datasets included real images to better sim-
based models. Zhang et al. [465] addressed the
ulate the scene viewed by a robot in active vision
localization problem by using a search algorithm
settings, as in [5].
based on Bayesian optimization that sequentially
proposed candidate regions for an object bounding
Another strategy is to render simple artificial box. Hosang et al. [132] tried to integrate NMS
images and increase the realism of the images in in the convolutional network which in the end im-
a second iteration, using Generative Adversarial proved localization.
Networks [339]. RenderGAN was used to directly Many papers [101, 158] also try to adapt the loss
generate realistic training images [348]. We refer function to address the localization problem. Gi-
the reader to the section on GANs (Section 3.2.2) daris and Komodakis [103] proposed to assign con-
for more information on the use of GANs for style ditional probabilities to each row and column of
transfer. a sample region, using a neural convolutional net-
work adapted for this task. These probabilities al-
We have seen that for the time being synthetic low more accurate inference of the object bounding
datasets can augment existing ones but not totally box under a simple probabilistic framework. Since
replace them for object detection, however, the do- Intersection over Union (IoU) is used in the eval-
main shift between synthetic data and the target uation strategies of many detection challenges, Yu
distribution is still too large to rely on synthetic et al. [441] and Jiang et al. [158] optimized over
data only. IoU directly. The loss-based papers have been dis-

36
Occlusions struction.
8000 Training with occluded objects help for sure [244]
but it is often not doable because of a lack of data
6000 and furthermore, it cannot be bulletproof. Wu
et al. [419] managed to learn an And-Or model
Occlusions

4000 for cars by dynamic programming, where the And


stood for the decomposition of the objects into
2000 parts and the Or for all different configurations
of parts (including occluded configurations). The
0
2011 2012 2013 2014 2015 2016 2017 2018
learning was only possible thanks to the heavy use
of synthetic data to model every possible type of
occlusion. Another way to generate examples of
Figure 15: Number of papers published each year occlusions is to directly learn to mask the propos-
for challenge of occlusion. Query used in Google als of Fast R-CNN [407].
Scholar: (occlusions OR occlusion OR occluded) For dense pedestrians crowds deformable models
AND ”object detection”. and parts can help improve detection accuracy (see
2.1.5) e.g. if some parts are masked some others will
cussed in Section 2.2.1 in detail. not be, therefore, the average score is diminished
There is also an interesting case made by some but not made zero like in [106, 265, 325]. Parts
papers that do we really need to optimize for lo- are also useful for occlusion handling in face detec-
calization? Oquab et al. [262] used weakly anno- tion where different CNNs can be trained on differ-
tated images to predict approximate locations of ent facial parts [432]. The survey already tackled
the object. Their approach performed comparably Deformable RoI-Pooling (RoI-Pooling with parts)
to the fully supervised counterparts. Zhou et al. [247]. Another way of re-introducing parts in mod-
[474] were able to get localizable deep representa- ern pipelines is the deformable kernels of [63]. They
tions that exposed the implicit attention of CNNs presented a way to alleviate the occlusion problems
on an image with the help of global average pooling by giving more flexibility to the usually fixed geo-
layers. In comparison to the earlier approach, their metric structures.
localization is not limited to localizing a point lying Building special kinds of regression losses for
inside an object but determining the full extent of bounding boxes acknowledging the proximity of
the object. [12, 448, 472] have also tried to predict each detection (which is reminiscent of the springs
localizations by masking different patches of the in the old part-based models) was done in [409].
image during test time. More weakly supervised They, in addition, to the attraction term in the
methods have been discussed in Section 4.2.1. traditional regression loss that pushes predictions
towards their assigned ground truth added a repul-
sion term that pushed predictions away from each
3.1.5 Occlusions
other.
The occlusions lead to partial missing information Traditional non-maximum suppression causes a
from object instances. They may be occluded due lot of problems with occlusions because overlap-
to the background or other object instances. Less ping boxes are suppressed. Hence, if one object is
information naturally leads to harder examples and in front of another only one is detected. To ad-
inaccurate localizations. The occlusions happen all dress this, Hosang et al. [132] offered to learn non-
the time in real-life images. However, since deep maximum suppression making it continuous (and
learning is based on convoluting filters and that differentiable) and Bodla et al. [21] used a soft ver-
occlusions by definition introduce parasite patterns sion that only degraded the score of the overlapping
most modern methods are not robust to it by con- objects (more details can be found about various

37
Small Objects lated to aerial images [421], traffic signs [490], faces
2000 [253], pedestrians [84] or logos [358] are generally
abundant with small object instances.
1500 In case of objects like logos or traffic signs, ob-
jects have an expected shape, size and aspect ratio
Small Objects

1000 of the objects to be detected, and this information


can be embedded to bias the deep learning model.
500 This strategy is much harder and not feasible for
common objects as they are a lot more diverse. As
0
2011 2012 2013 2014 2015 2016 2017 2018
an illustration, the winner of the COCO challenge
2017 [274], which used many of the latest tech-
niques and ensemble of four detectors reported a
Figure 16: Number of papers published each year performance of 34.5% mAP on small objects and
for challenge of small objects. Query used in 64.9% mAP on large objects. The following entries
Google Scholar: (”small objects” OR ”small ob- reported even a greater dip for smaller objects than
ject”) AND ”object detection”. the larger ones. Pham et al. [281] have presented
an evaluation, focusing on real-time small object
detection, of three state-of-the-art models, YOLO,
other types of NMS in Section 2.3.
SSD and Faster R-CNN with related trade-off be-
Other approaches used clues and context to help tween accuracy, execution time and resource con-
infer the presence of occluded objects. Zhang et al. straints.
[457] used super-pixel labeling to help occluded ob- There are different ways to tackle this prob-
jects detection. They hypothesized that if some lem, such as: i) up-scaling the images ii) shallow
pixels are visible then the object is there. This is networks, iii) contextual information, iv) super-
also the approach of the recent [118] but it needs resolution. These four directions are discussed in
pixel-level annotations. In videos, temporal coher- the following.
ence can be used [436], where heavily occluded ob- The first – and most trivial direction – consists
jects are not occluded in every frame and can be in up-scaling the image before detection. But a
tracked to help detection. naive upscaling is not efficient as the large im-
But for now all the solutions seem to be far-off ages become too large to fit into a GPU for train-
from the mentally inpainting ability of humans to ing. Gao et al. [95], first, down-sampled the im-
infer missing parts. Using GANs for this purpose age and then used reinforcement learning to train
might be an interesting research direction. attention-based models to dynamically search for
the interesting regions in the image. The se-
3.1.6 Detecting Small Objects lected regions are then studied at higher resolution
and can be used to predict smaller objects. This
Detecting small objects is harder than detecting avoided the need of analyzing each pixel of the im-
medium sized and large sized objects because of age with equal attention and saved some computa-
less information associated with them, easier pos- tional costs. Some papers [62, 63, 345] used image
sibility of confusion with the background, higher pyramids during training time in the context of ob-
precision requirement for localization, large image ject detection while [310] used it during inference
size, etc. In COCO metrics evaluation, objects oc- time.
cupying areas lesser than and equal to 32 × 32 pix- The second direction is to use shallow networks.
els come under this category and this size thresh- Small objects are easier to predict by detectors
old is generally accepted within the community for which have smaller receptive field. The deeper net-
datasets related to common objects. Datasets re- works with their large receptive field tend to lose

38
some information about the small objects in their
coarser layers. Sommer et al. [259, 351] proposed
very shallow networks with less than 5 convolu-
tional layers and three fully connected layers for
the purpose of detecting objects in aerial imagery.
Such type of detectors are useful when the expected
instances are only of type small. But if expected
instances are of diverse size it is more beneficial to
use finer feature maps of very deep networks for
small objects and coarser feature maps for larger
objects. We have already discussed this approach
in Section 3.1.1. Please refer to Section 4.2.4 for Figure 17: Good detections of persons are marked
more low power and shallow detectors. in green and bad detections in red. Helpful con-
text in blue (the presence of mirror frames) can
help lower the score of a box. The relationships (in
The third direction is to make use of context Fuchsia) between bounding boxes can also help: a
surrounding the small object instances. Gidaris person cannot be present twice in a picture. En-
and Komodakis [102], Zhu et al. [489] used con- hancing parts of the picture using SR ( Figure 18)
text to improve the performance but Chen et al. is yet another way to better make a decision. All
[36] used context specifically for improving the per- those ”reasoning” modules are not included in the
formance for small objects. They augmented the mainstream detectors.
R-CNN with the context patch in parallel to the
proposal patch generated from region proposal net-
work. Zagoruyko et al. [446] combined their ap-
3.2 Complementary New Ideas in
proach of making the information flow through Object Detection
multiple paths with DeepMask object proposals In this subsection we review ideas which haven’t
[282, 284] to gain a massive improvement in the quite matured yet but we feel could bring major
performance for small objects. Context can also be breakthroughs in the near future. If we want the
used by fusing coarser layers of the network with field to advance, we should embrace new grand
finer layers [215, 216, 338]. Context related litera- ideas like these, even if that means completely re-
ture has been covered in Section 3.2.3 in detail. thinking all the architectural ideas evoked in Sec-
tion 2.

Finally, the last direction is to use Generative 3.2.1 Graph Networks


Adversarial Networks to selectively increase the
resolution of small objects, as proposed by Li The dramatic failings of state-of-the-art detectors
et al. [201]. Its generator learned to enhance the on perturbed versions of the COCO validation sets,
poor representations of the small objects to super- spotted by Rosenfeld et al. [314], are raising ques-
resolved ones that are similar enough to real large tions for better understanding of compositionality,
objects to fool a competing discriminator. Table 1 context and relationships in detectors.
summarizes the past subsection by grouping the ar- Battaglia et al. [11] recently wrote a position
ticles by main idea and target challenge. We find it article arguing about the need to introduce more
very useful to see which ideas have been thoroughly representational power into Deep Learning using
investigated by the literature and which are under- graph networks. It means finding new ways to en-
explored. force the learning of graph structures of connected

39
Article references Main idea Challenge(s) addressed
[90, 104, 116, 327] Image Pyramids Scale Variance, Small Objects
[28, 72, 87, 174, Features Fusion Scale Variance, Small Objects
200, 215, 221, 255,
310, 332, 338, 345,
347, 415, 433, 446]
[345, 347] Selective Backpropagation (SN) Scale Variance, Small Objects, Object Localization
[21, 390] Better NMS Small Objects, Occlusions, Object Localization
[216, 337] Hard Examples Mining (Explicit and Implicit) Small Objects, Occlusions
[431] Scale Dependent Pooling Scale Variance, Small Objects
[26, 119, 159, 228] Oriented Bounding boxes Rotational Variance
[26, 119, 228] Oriented Pooling Rotational Variance
[222, 333] Flexible anchors (segments, quadrangles) Rotational Variance
[212] Rotating Filters Rotational Variance
[52, 188] Rotation Invariant Features Rotational Variance
[118, 466] Auxiliary Task (semantic segmentation) Rotational Variance, Occlusions
[49, 294] Aligning Feature distribution Domain Adaptation
[150, 339] Image Transformations (GANs) Domain Adaptation
[79, 99, 277, 279, Data Augmentation using Synthetic Datasets Domain Adaptation
296]
[383] Domain Randomization Domain Adaptation
[46, 457] Super-Pixels Object Localization, Occlusions
[465] Sequential Search Object Localization
[101, 103, 158, 216, Loss Function Modifications Small Objects, Object Localization
441]
[63, 106, 247, 265, Part Based Models Occlusions
325, 432]
[63, 247] Deformable CNN Modules Occlusions
[436] Tracking (in videos) Occlusions
[95] Dynamic Zooming Small Objects
[259, 351] Shallow Networks Small Objects
[36, 102, 489] Use of Contextual Information Small Objects
[201] Features Super Resolution Small Objects

Table 1: Summary of the main ideas found in the literature to account for the limitations of the current
deep learning architectures. For each idea we list the papers that implement it and the challenges they
(sometimes only partially) address.

40
entities instead of outputting independent predic- tion up to an impressive degree of realism. This
tions. Convolutions are too local and translation new tool keeps the flexibility of the regular CNN
equivariant to reflect the intricate structure of ob- architectures as it is implemented using the same
jects in their context. bricks and therefore, it can be added in any detec-
One embodiment of this idea in the realm of de- tion pipeline.
tection can be found in the work of Wang et al. Even if [407] does not belong to the GAN family
[406], where long-distance dependencies were intro- per say, the adversarial training it uses: dropping
duced in deep-learning architectures. These com- pixels in examples to make them harder to clas-
bined local and non-local interactions are reminis- sify and hence, render the network robust to occlu-
cent of the CRF [185], which sparked a renewed in- sions, obviously drew its inspiration from GANs.
terest for graphical models in 2001. Dot products Ouyang et al. [269] went a step further and used
between features determine their influences on each the GAN formalism to learn to generate pedestri-
other, the closest they are in the feature space, the ans from white noise in large images and showed
stronger their interactions will be (using a Gaussian how those created examples were beneficial for the
kernel for instance). This seems to go against the training of object detectors. There are numerous
very principles of DCNNs, which are, by nature, recent papers, e.g., [23, 276], proposing approaches
local. However this kind of layer can be integrated for converting synthetic data towards more realis-
seamlessly in any DCNN to its benefit, it is very tic images for classification. Inoue et al. [150] used
similar to self-attention [55]. It is not clear yet if the latest CycleGAN [480] to convert real images
these new networks will replace their local coun- to cartoons and by doing so gained free annota-
terparts in the long-term but they are definitely tions to train detectors on weakly labeled images
suitable candidates. and became the first work to use GANs to create
Graph structures also emerge when one needs to full images for detectors. As stated in the intro-
incorporate a priori (or inductive biases) on the duction, GANs can also be used, not in a stan-
spatial relationships of the objects to detect (rela- dalone manner but, directly embedded inside a de-
tional reasoning) [135]. The relation module uses tector too: Li et al. [201] operated at the feature
attention to learn object dependencies, also using level by adapting the features of small objects to
dot products of features. Similarly, Wang et al. match features obtained with well resolved objects.
[406] incorporated geometrical features to further Bai et al. [9] trained a generator directly for super-
disambiguate relationships between objects. One resolution of small objects patches using traditional
of the advantages of this pipeline is the last re- GAN loss in addition to classification losses and
lation module, which is used to remove duplicates MSE loss per pixel. Integrating the module in mod-
similarly to the usual NMS step but adaptively. We ern pipelines brought improvement to the original
mention this article in particular because although mAP on COCO, this very simple pipeline is sum-
relationships between detected objects have been marized Figure 18. These two articles addressed
used in the literature before, it was the first at- the detection of small objects, which will be tack-
tempt to have it as a differentiable module inside led in more details in Section 3.1.6.
a CNN architecture. Shen et al. [330] used GANs to completely re-
place the Multiple Instance Learning paradigm (see
3.2.2 Adversarial Trainings Section 4.2.1) using the GAN framework to gener-
ate candidate boxes following the real distribution
No one in the computer vision community was of the training images boxes and built a state-of-
spared by the amazing successes of the Generative the-art detector that is faster than all the others
Adversarial Networks [109]. By pitting a con-artist by two orders of magnitude.
(a CNN) against a judge (another CNN) one can Thus, this extraordinary breakthrough is start-
learn to generate images from a target distribu- ing to produce interesting results in object detec-

41
vironments of the visual objects also comprise of
other objects that they are present with, which ad-
vocates for learning spatial relationships between
objects. Mrowca et al. [250] and Gupta et al. [113]
independently used spatial relationships between
proposals and classes (using WordNet hierarchy)
to post-process detections. This is also the case in
[492] where RNNs were used to model those rela-
tionships at different scales and in [48] where an
external memory module was keeping track of the
likelihood of objects being together. Hu et al. [135],
Figure 18: Small object patches from Regions of that we mentioned in Section 3.2.1, went even fur-
Interest are enhanced to better help the classifier ther with a trainable relation module inside the
make a decision in SOD-MTGAN [9]. structure of the network. In a different but not
unrelated manner Gonzalez-Garcia et al. [108] im-
proved the detection of parts of objects by associ-
tion and its importance is growing. Considering the ating parts with their root objects.
latest result in the generation of synthetic data us- All multi-scale architectures use different sized
ing GANs for instance the high resolution examples context, as we saw in Section 2.1.1. Zeng et al.
of [166] or the infinite image generators, BiCycle- [450] used features from different sized regions (dif-
GAN from Zhu et al. [480] and MUNIT from Huang ferent contexts) in different layers of the CNN with
et al. [142], it seems the tsunami that started in message-passing in between features related to dif-
2014 will only get bigger in the years to come. ferent context. Kong et al. [175] used skip connec-
tions and concatenation directly in the CNN archi-
3.2.3 Use of Contextual Information tecture to extract multi-level and multi-scale infor-
mation.
We will see in this section that the word context Sometimes, even the simplest local context sur-
can mean a lot of different things but taking it into rounding a region of interest can help (see, for
account gives rise to many new methods in object instance, the methods presented in Section 2.1.4,
detection. Most of them (like spatial relationships where the amount of context varies in between the
or using stuff to find things) are often overlooked in classifiers). Extracted proposals can include vari-
competitions, arguably for bad reasons (too com- able amounts of pixels (context means size of the
plex to implement in the time frame of the chal- proposal) to help the classifiers such as in [268] or
lenge). in [36, 101, 103]. Li et al. [196] included global im-
Methods have evolved a lot since Heitz and age context in addition to regional context. Some
Koller [120] used clustering of stuff/backgrounds to approaches went as far as integrating all the image
help detect objects in aerial imagery. Now, thanks context: it was done for the first time in YOLO
to the CNN architectures, it is possible to do de- [308] with the addition of a fully connected layer
tection of things and stuff segmentation in parallel, on the last feature map. Wang et al. [406] modified
both tasks helping the other [24]. the convolutional operator to put weights on every
Of course, this finding is not surprising. Certain part of the image, helping the network use context
objects are more likely to appear in certain stuff or outside the object to infer their existence. This
environments (or context): thanks to our knowl- use of global context is also found with the Global
edge of the world, we find it weird to have a flying Context Module of the recent detection pipeline
train: Katti et al. [167] showed that adding this from Megvii [275]. Li et al. [207] proposed a fully
human knowledge helps existing pipelines. The en- connected layer on all the feature maps (similar to

42
Article references Ideas Type of Context
[24, 167] Segmenting Stuff/Using background cues Background context (Environment)
[48, 113, 135, 250, 442, 492] Likelihood of Objects being together/Infering Relationships/Memory Modules Other Objects Context
[108] Finding Root to find parts Parts-Objects Context
[175, 450] Using dfferent feature scales Multi-scale Context
[36, 101, 103, 196, 268] Adding variable sized context/Adding Borders of RoI Surrounding Pixels
[207, 308, 406] Adding connections to all pixels Full Image Context

Table 2: Summary of the approaches taken to exploit different types of context.

Redmon et al. [308]) with dilated kernels. 4 Extending Object Detec-


Other kinds of context can also be put to work. tion
Yu et al. [442] used latent variables to decide on
which context cues to use to predict the bounding Object detection may still feel like a narrow prob-
boxes. It is not clear yet which method is the best lem: one has a big training set of 2D images, huge
to take context into account, another question is: resources (GPUs, TPUs, etc.) and wants to out-
do we want to? Even if the presence of an object put 2D bounding boxes on a similar set of 2D im-
in a context is unlikely, do we actually want to ages. However, these basic assumptions are often
blind our detectors to unlikely situations? All the not present in practical scenarios. Firstly, because
types of context that can be leveraged, have been there exists many other modalities where one can
summarized in Table 2. perform object detection. These require conceptual
changes in architectures to perform equally well.
Secondly, sometimes one might be constrained to
learn from exceedingly few fully annotated images,
therefore, training a regular detector is either ir-
relevant or not an optimal choice because of over-
3.3 Concluding Remarks fitting. Also detectors are not built to be run in
research labs alone but to be integrated into in-
dustrial products, which often come with an up-
This section finished the tour of all the principal per bound on energy consumption and speed re-
CNN based approaches past, present and future quirements to satisfy the customer. The aim of the
that treat general object detection in the tradi- following discussion will be to know more about
tional settings. It has allowed to peer through the research work done to extend the deep learn-
the armor of the CNN detectors and see them for ing based object detection into new modalities and
what they are: impressive machines having amaz- with tough constraints. It ends with reflections on
ing generalization capabilities but still powerless in what other interesting functionalities a strong de-
a variety of cases, in which a trained human would tector in the future might possess.
have no problem (domain adaptation, occlusions,
rotations, small objects) for an example of a dif- 4.1 Detecting Objects in Other
ficult test case even for so-called robust detectors
Modalities
see Figure 17. Potential ideas to go past these ob-
stacles have also been mentioned among them the There are several modalities other than 2D images
use of adversarial training and context are the most that can be interesting: videos, 3D point clouds,
prominent. The following section will go into more medical imaging, hyper-spectral imagery, etc. We
specific set-ups, less traditional problems or envi- will be discussing in this survey the former two. We
ronments that will frame the detector abilities even did not treat for instance the volumetric images
further. from the medical domain (MRI, etc.) or hyper-

43
having the lowest scores: highly ranked detection
scores were treated as high-confidence classes and
the rest were suppressed. Third, motion-guided
propagation transferred detection results to adja-
cent frames to reduce false negatives. Fourth, tem-
poral tubelet rescoring used a tracking algorithm to
obtain sequences of bounding boxes, classified into
positive and negative samples. Positive samples
were mapped to a higher range, thus, increasing
the score margins. T-CNN has several follow ups.
Figure 19: Detecting objects in other modalities: The first was Seq-NMS [115] which constructed
Left videos. Right 3D point-clouds. sequences along nearby high-confidence bounding
boxes from consecutive frames, rescoring to the
average confidence. Other boxes close to this se-
spectral imagery, which are outside of the scope of quence were suppressed. Another one was MC-
this article and would deserve their own survey. MOT [191] in which a post-processing stage, un-
der the form of a multi-object tracker, was intro-
4.1.1 Object Detection in Videos duced, relying on hand-crafted rules (e.g., detec-
tor confidences, color/motion clues, changing point
The upside of detecting objects in videos is that detection and forward-backward validation) to de-
it provides additional temporal information but it termine whether bounding boxes belonged to the
also has unique challenges associated with it: mo- tracked objects, and to further refine the track-
tion blur, appearance changes, video defocus, pose ing results. Tripathi et al. [385] exploited tempo-
variations, computational efficiency etc. It is a re- ral information by training a recurrent neural net-
cent research domain due to the lack of large scale work that took as input, sequences with predicted
public datasets. One of the first video datasets is bounding boxes, and optimized an objective enforc-
the ImageNet VID [319], proposed in 2015. This ing consistency across frames.
dataset as well as the recent datasets for object The most advanced pipeline for object detection
detection in video are mentioned in Section A.4. in videos is certainly the approach of Feichtenhofer
One of the simplest ways to use temporal in- et al. [89], borrowing ideas from tubelets as well as
formation for detecting object is the detection by from feature aggregation. The approach relies on a
tracking paradigm. As an example, Ray et al. multitask objective loss, for frame-based object de-
[301] proposed a spatio-temporal detector of mo- tection and across-frame track regression, correlat-
tion blobs, associated into tracks by a tracking al- ing features that represented object co-occurrences
gorithm. Each track is then interpreted as a mov- across time and linking the frame level detections
ing object. Despite its simplicity, this type of al- based on across-frame tracklets to produce the de-
gorithm is marginal in the literature as it is only tections.
interesting when the appearances of the objects are The literature on object detection in videos also
not available. addressed the question of computing time, since
The most widely used approaches in the lit- applying a detector on each frame can be time
erature are those relying on tubelets. Tubelets consuming. In general, it is non-trivial to trans-
have been introduced in the T-CNN approach of fer the state-of-the-art object detection networks
Kang et al. [163, 164]. T-CNN relied on 4 steps. to videos, as per-frame evaluation is slow. Deep
First, still-image object detection (with Faster R- feature flow [485, 486] ran the convolutional sub-
CNN like detectors) was performed. Second, multi- network only on sparse key frames, propagated
context suppression removed detection hypotheses deep feature maps to other frames via a flow field.

44
Article references Highlight Type of Detections
[191, 301] Context cues/Motion blobs Basic Tracking
[89, 115, 163, 164] Motion Propagation / Tracking / Seq-NMS / Feature aggregation Tubelets
[385] Enforcing Consistency RNNs
[486, 487] Sparse Key frame aggregation / Fast Computation of flow Flow Field
[41, 329] Motion-based Inference Adaptive Computation

Table 3: Summary of the video object detection methods.

It led to significant speedup as flow computation clouds, iii) the detections made in a 3D voxel grid
is relatively fast. In the impression network [123] iv) the detections made in 2D after projecting the
proposed to iteratively absorb sparsely extracted point cloud on a 2D plane. Most of the presented
frame features, impression features being propa- methods are evaluated on the KITTI benchmark
gated all the way down the video which helped en- [98]. Section A.3 introduces the datasets used for
hance features of low-quality frames. In the same 3D object detection and quantitatively compares
way, the light flow of [487] is a very small net- best methods on these datasets.
work designed to aggregate features on key frames. The methods belonging to the first category,
For non-key frames, sparse feature propagation was monocular, start by the processing of RGB im-
performed, reaching a speed of 25.6 fps. Fast ages and then add shape and geometric prior or
YOLO [329] came up with an optimized architec- occlusion patterns to infer 3D bounding boxes, as
ture that has 2.8X fewer parameters with just a 2% proposed by Chen et al. [44], Mousavian et al. [249]
IOU drop, by applying a motion-adaptive inference and Xiang et al. [424]. Deng and Latecki [68] revis-
method. Finally, [41] proposed to reallocate com- ited the amodal 3D detection by directly relating
putational resources over a scale-time space: while 2.5D visual appearance to 3D objects and proposed
expensive detection is done sparsely and propa- a 3D object detection system that simultaneously
gated across both scales and time. Cheaper net- predicted 3D locations and orientations of objects
works did the temporal propagation over a scale- in indoor scenes. Li et al. [197] represented the data
time lattice. in a 2D point map and used a single 2D end-to-end
An interesting question is ”What can we expect fully convolutional network to detect objects and
from using temporal information?” The improve- predicted full 3D bounding boxes even while using
ment of the mAP due to the direct use of temporal a 2D convolutional network. Deep MANTA [34] is a
information can vary from +2.9% [484] to +5.6% robust convolutional network introduced for simul-
[89]. Table 3 does a recap of this sub-subsection. taneous vehicle detection, part localization, visibil-
ity characterization and 3D dimension estimation,
4.1.2 Object Detection in 3D Point Clouds from 2D images.
This section addresses the literature about object Among the methods using 3D point clouds di-
detection in 3D data, whether it is true 3D point rectly, we can mention the series of papers relying
clouds or 2D images augmented with depth data on PointNet [288] and PointNet++ [290] networks,
(RGBD images). These problems raise novel chal- which are capable of dealing with the irregular for-
lenges, especially in the case of 3D point clouds mat of point clouds without having to transform
for which the nature of the data is totally different them into 3D voxel grids. F-PointNet [289] is a
(both in terms of structure and contained infor- 3D detector operating on raw point clouds (RGB-
mation). We can distinguish 4 main types of ap- D scans). It leveraged mature 2D object detector
proaches depending on i) the use of 2D images and to propose 2D object regions in RGB images and
geometry, ii) the detections made in raw 3D point then collected all points within the frustum to form

45
Article references Implementation Operates on
[34, 44, 68, 197, 249, 424] 3D priors 2D images
[288–290, 356] PointNetworks / Graph Convolutions / SuperPixels PointClouds
[83, 195, 478] 3/4D convolutions Voxels
[15, 47, 240, 342] Plane choices / Discretization / Counting Projections (Bird’s eye)
[182] Feature Fusion Multi-modal

Table 4: Summary of the 3D object detection approaches.

a frustum point cloud. clouds. Using the same fashion as in the previous
Voxel based methods such as VoxelNet [478] rep- section we display a recap in Table 4.
resented the irregular format of point clouds by
fixed size 3D Voxel grids on which standard 3D
4.2 Detecting Objects Under Con-
convolution can be applied. Li [195] discretized
the point cloud on square grids, and represented straints
discretized data by a 4D array of fixed dimensions. In object detection, challenges arise not only be-
Vote3Deep [83] examined the trade-off between ac- cause of the naturally expected problems (scale, ro-
curacy and speed for different architectures applied tation, localization, occlusions, etc.) but also due
on a voxelized representation of input data. to the ones that are created artificially. The first
Regarding approaches based on birds eye view, motivation for the following discussion is to know
MV3D [47] projected LiDAR point cloud to a birds and understand the research works that deal with
eye view on which a 2D region proposal network the inadequacy of annotations in certain datasets.
is applied, allowing the generation of 3D bound- This inadequacy could be due to weak (image-level)
ing box proposals. In a similar way, LMNet [240] labels, scarce bounding box annotations or no an-
addressed the question of real-time object detec- notations at all for certain classes. The second mo-
tion using 3D LiDAR by projecting the point cloud tivation is to discuss the approaches dealing with
onto 5 different frontal planes. More recently, Bird- hardware and application constraints, real-world
Net [15] proposed an original cell encoding mech- detectors might encounter.
anisms for birds eye view, which is invariant to
distance and differences on LiDAR devices resolu-
4.2.1 Weakly Supervised Detection
tion, as well as a detector taking this representa-
tion as input. One of the fastest methods (50 fps) Research teams want to include as many images as
is ComplexYOLO [342], which expanded YOLOv2 possible in their proposed datasets. Due to budget
by a specific complex regression strategy to esti- constraints or to save costs or for some other rea-
mate multi-class 3D boxes in Cartesian space, after sons, sometimes, they chose not to annotate precise
building a birds eye view of the data. bounding boxes around objects and include only
Some recent methods, such as [182], combined image level annotations or captions. The object
different sources of information (eg., birds eye view, detection community has proven that it is still pos-
RGB images, 3D voxels, etc.) and proposed an sible with enough weakly annotated data to train
architecture performing multimodal feature fusion good object detectors.
on high resolution feature maps. Ku et al. [182] The most obvious way to address Weakly Su-
is one of the top performing methods on KITTI pervised Object Detection (WSOD) is to use the
benchmark [98]. Finally, it is worth mentioning Multiple Instance Learning (MIL) framework [233].
the super-pixel based method by Srivastava et al. The image is considered as being a bag of re-
[356] allowed to discover novel objects in 3D point gions extracted by conventional object proposals:

46
at least one of these candidate regions is positive This free localization information can be im-
if the image has the appropriate weak label, if not, proved through the use of different pooling strate-
no region is positive. The classical formulation of gies. For instance: producing a spatial heat map
the problem at hand (before CNNs) then becomes and using a global average pooling instead of global
a latent-SVM on the region’s features where the la- max pooling to train in classification. This strat-
tent part is the assignment of each proposal (that is egy was used in [474] where the heat maps per
weakly constrained by the image label). This prob- class were thresholded to obtain bounding boxes.
lem being highly non-convex is heavily dependent In this line of work, Pinheiro and Collobert [283]
on the quality of the initialization. went a step further by producing pixel-level label
Song et al. [353, 354] thus focused on the ini- segmentation maps using Log-Sum-Exp pooling in
tialization of the boxes by starting from selective- conjunction with some image and smoothing prior.
search proposals. They used for each proposal, its Other pooling strategies involved aggregating mini-
K-nearest neighbors in other images to construct a mum and maximum evidences to get a more precise
bipartite graph. The boxes were then pruned by idea where the object is and isn’t, e.g., as in the
taking only the patches that occur in most posi- line developed in Durand et al. [74, 75, 76]. Bilen
tive images (covering) while not belonging to the and Vedaldi [18] used the spatial pyramid pooling
set of neighbors of regions found in negative im- module to take MIL to the modern-age by incorpo-
ages. They also applied Nesterov smoothing on the rating it into a Fast R-CNN like architecture with
SVM objective to make the optimization easier. Of a two-stream Fast R-CNN proposal classification
course, if proposals do not spin enough of the image part: one with classification score and the other
some objects will not be detected and thus the per- with relative rankings of proposals that are merged
formance will be bad as there is no re-localization. together using hadamard products. Thus, produc-
The work of Sun et al. [362] also belongs to this ing region level labels predictions like in classic
category. Bilen et al. [19] added regularization detection settings. They then aggregated all la-
to the smoothened optimization problem of Song bels per image by taking the sum. They trained
et al. [353] using prior knowledge, but followed it end-to-end using image level labels thanks to
the same general directions. In another related re- their aggregation module while adding a spatial-
search direction Wang et al. [402] learned to cluster regularization constraint on the features obtained
the regions extracted with selective search into K- by the SPP module.
categories using unsupervised learning (pLSA) and Another idea, which can be combined with MIL
then learned category selection using bag of words is to draw the supervision from elsewhere. Tracked
to determine the most discriminative clusters per object proposals were used by Kumar Singh et al.
class. [183] to extract pseudo-groundtruth to train detec-
However, it is not always a requirement to ex- tors. This idea was further explored by Chen et al.
plicitly solve the latent-SVM problem. Thanks to [40] where the keywords extracted from the sub-
the fully convolutional structure of most CNNs it is titles of documentaries allowed to further ground
sometimes possible to get a rough idea where an ob- and cluster the generated annotations. In a simi-
ject might be while training for classification. For lar way, Yuan et al. [443] used action description
example, the arg-max of the produced spatial heat supervision via LSTMs. Cheap supervision can
maps before global max-pooling is often located in- also be gained by involving user feedback [270],
side a bounding box as shown in [261, 262]. It is where the users iteratively improved the pseudo-
also possible to learn to detect objects without us- ground truth by saying if the objects were missed
ing any ground truth bounding boxes for training or partly included in the detections. Click super-
by masking regions of the image and see how the vision by users, far less demanding than full an-
global classification score is impacted, as proposed notations, also improved the performance of detec-
by Bazzani et al. [12]. tors [271]. [316] used active learning to select the

47
Article references Implementation Paradigm
[19, 353, 354, 362, 402] Optimization Tricks (smoothing, EM, etc.) Full MIL
[12] Monitor score change Masking
[18, 261, 262, 283, 474] Global Pooling / GAPooling / LogSumExp Pooling Refining Pooling
[74–76] top-k max/min Contradictory Evidence Pooling
[40, 128, 146, 183, 270, 311, 374, 443] Subtitles / Motion Cues / User clicks / Strong Annotations Auxiliary Supervision

Table 5: Summary of the weakly supervised approaches.

right images to annotate and thus get the same samples which are used in the following iterations
performance by using far fewer images. One can for training. They observed that as the model
also leverage strong annotations for other classes becomes more discriminative it is able to sample
to improve the performance of weakly supervised harder as well as more number of instances. Iterat-
classes. This was done in [374] by using the power- ing between multiple kinds of detectors was found
ful LSDA framework [127]. This was also the case to outperform the single detector approach. One
in [128, 146, 311]. interesting aspect of the paper is that their ap-
This year, a lot of interesting new works con- proach with only three to four annotations per class
tinued to develop the MIL+CNN framework using gives results comparable to weakly annotated ap-
diverse approaches [97, 369, 400, 462–464]. These proaches with image level annotations on the whole
articles will not be treated in detail because the fo- PASCAL VOC dataset. A similar approach was
cus of this survey is object detection in general and used by Keren et al. [168], who proposed a model
not WSOD. which can be trained with as few as one single ex-
As of this writing, the state-of-the-art mAP on emplar of an unseen class and a larger target ex-
VOC2007 in WSOD is 47.6% [463]. The gap is be- ample that may or may not contain an instance of
ing reduced at an exhilarating pace but we are still the same class as the exemplar (weakly supervised
far from the 83.1% state-of-the-art with full an- learning). This model was able to simultaneously
notations [247] (without COCO pre-training). We identify and localize instances of classes unseen at
present a recap in Table 5. training time.

4.2.2 Few-shot Detection


Another way to deal with few-shot detection is
The cost of annotating thousands of boxes over to fine-tune a detector trained on a sourced do-
hundreds of classes is too high. Although some main to a target domain for which only few sam-
large scale datasets are created, but it is not practi- ples are available. This is what Chen et al. [39]
cal to do it for every single target domain. Collect- did, by introducing a novel regularization method,
ing and annotating training examples in the case involving, depressing the background and transfer-
of video is even costlier than still images, making ring the knowledge from the source domain to the
few shot detection more interesting. For this pur- target domain to enhance the fine-tuned detector.
pose, researchers have come up with ways to train
the detectors with as low as three to five bounding
boxes per target class and get lower but compet- For videos, Misra et al. [243] proposed a semi-
itive performance as compared to the fully super- supervised framework in which some initial labeled
vised approach on a large scale dataset. Few shot boxes allowed to iteratively learn and label hun-
learning usually relies on semi-supervised learning dreds of thousands of object instances automat-
mechanisms. ically. Criteria for reliable object detection and
Dong et al. [73] took up an iterative approach to tracking constrained the semi-supervised learning
simultaneously train the model and generate new process and minimized semantic drift.

48
4.2.3 Zero-shot Detection ent classes belonging to a large open vocabulary,
for this task. They used MSCOCO [214] and Vi-
Zero-shot detection is useful for a system where sualGenome [179] which contain an average of 7.7
large number of classes are to be detected. Its hard and 35 objects per image respectively. They also
to annotate a large number of classes as the cost set number of unseen classes to be higher, making
of annotation gets higher with more classes. This their task more complex than previous two papers.
is a unique type of problem in the object detection Since, it is quite a new problem there is no well-
domain as the aim is to classify and localize new defined experimental protocol for this approach.
categories, without any training examples, during They vary in number and nature of unseen classes,
test time with the constraint that the new cate- use of semantic attribute information of unseen
gories are semantically related to the objects in the classes during training, complexity of the visual
training classes. Therefore, in practice the seman- scene, etc.
tic attributes are available for the unseen classes.
The challenges that come with this problem are:
4.2.4 Fast and Low Power Detection
First, zero-shot learning techniques are restricted
to recognize a single dominant objects and not all There is generally a trade-off between performance
the object instances present in the image. Second, and speed (we refer to the comprehensive study of
the background class during fully supervised train- [140] for instance). When one needs real time de-
ing may contain objects from unseen classes. The tectors, like for video object detection, one loses
detector will be trained to discriminatively treat some precision. However, researchers have been
these classes as background. constantly working on improving the precision of
While there is a comparably large literature fast methods and making precise methods faster.
present for zero shot classification, well covered in Furthermore, not every setup can have powerful
the survey [93], zero shot detection has only a few GPUs, so for most industrial applications the de-
papers to the best of our knowledge. Zhu et al. tectors have to run on CPUs or on different low
[482] proposed a method where semantic features power embedded devices like Raspberry-Pie.
are utilized during training but it is agnostic to se- Most real-time methods are single stage because
mantic information during test time. This means they need to perform inference in a quasi fully con-
they incorporated semantic attribute information stitutional manner. The most iconic methods have
in addition to seen classes during training and gen- already been discussed in detail in the rest of the
erated proposals only, but no identification label. paper [216, 221, 306–308]. Zhou et al. [475] de-
for seen and unseen objects at test time. Rahman signed a scale transfer module to replace the feature
et al. [292] proposed a multitask loss that com- pyramid and thus got a detection network more
bines max-margin, useful for separating individual accurate and faster than YOLOv2. Iandola et al.
classes, and semantic clustering, useful for reduc- [147] provided a framework to efficiently compute
ing noise in semantic vectors by positioning simi- multi-scale features. Redmon and Angelova [305]
lar classes together and dissimilar classes far apart. used a YOLO-like architecture to provide oriented
They used ILSVRC [66] which contains an average bounding boxes symbolizing grasps in real time.
of only three objects per image. They also pro- Shafiee et al. [329] built a faster version of YOLOv2
posed another method for a more general case when that runs on embedded devices other than GPUs.
unseen classes are not predefined during training. Li and Zhou [210] managed to speed-up the SSD
Bansal et al. [10] proposed two background-aware detector, bringing it to almost 70 fps, using a more
approaches, statically assigning the background lightweight architecture.
image regions into a single background class em- In single stage methods most of the compu-
bedding and latent assignment based alternating tations are found in the backbone networks so
algorithms which associated background to differ- researchers started to design new backbones for

49
detection in order to have fewer operations like it needed 30% more pixels than original images at
PVANet [170] that built a deep and thin networks inference time making it slower.
with fewer channels than its classification counter- There have also been lots of work done on prun-
parts, or SqueezeDet [416] that is similar to YOLO ing and/or quantifying the weights of CNNs for
but with more anchors and fewer parameters. image classification [114, 138, 141, 143, 144, 218,
Iandola et al. [148] built an AlexNet backbone 273, 299, 454, 476], but much fewer in detection
with 50 times fewer parameters. Howard et al. yet. Although, one can find some detection arti-
[134] used depth-wise-separable convolutions and cles that used pruning. Girshick [104] used SVD
point-wise convolutions to build an efficient back- on the weights of the fully connected layers in Fast
bone called MobileNets for image classification and R-CNN. Masana et al. [234], who pruned near-
detection. Sandler et al. [324] improved upon it zero weights in detection networks and extended
by adding residual connections and removing non- the compression to be domain-adaptive in Masana
linearities. Very recently, Tan et al. [367] used ar- et al. [235].
chitecture search to come up with an even more ef-
To help the reader better encompass the different
ficient network (1.5 times faster than Sandler et al.
accuracy vs speed trade-offs present in the modern
[324] and with lower latency). ShuffleNet [461] at-
methods, we display some of the leading methods
tained impressive performance on ARM devices.
on PASCAL-VOC 2007 [88] with their inference
They can sustain only that many computations
speed on one image (batch size of 1) in Figure 20.
(40MFlops). Their backbone is 13 times faster than
AlexNet. It is not only necessary to respect available mate-
Finally, Wang et al. [405] proposed PeleeNet, rial constraints (data and machines) but detectors
a light network that is 66% of the model size have to be reliable too. They must be robust to
of MobileNet, achieving 76.4% mAP on PASCAL perturbations and they can make mistakes but the
VOC2007 and 22.4% mAP on MS COCO at a mistakes also need to be interpretable, which is a
speed of 17.1 fps on iPhone 6s and 23.6 fps on challenge in itself with the millions of weights and
iPhone 8. [205] is also very efficient, achieving the architectural complexity of modern pipelines.
72.1% mAP on PASCAL VOC2007 with 0.95M pa- It is a good sign to outperform all other methods
rameters and 1.06B FLOPs. on a benchmark, it is something else to perform ac-
Fast double-staged methods exist, although curately in the wild. That is why we dedicate the
the NMS part becomes generally the bottleneck. following sections to the exploration of such chal-
Among them one can also mention for the second lenges.
time Singh et al. [346], which is one of the double-
staged methods that researchers have brought to
30 fps by using superclass (sets of similar classes)
specific detection. Using a mask obtained by a 4.3 Towards Versatile Object Detec-
fast and coarse face detection method the authors tors
of [37] reduced the computational complexity of
their double stage detector by a great amount at So far in all this survey, detectors were tested on
test time by only computing convolutions on non- limited, well-defined benchmarks. It is mandatory
masked regions. Singh et al. [346] sped up R-FCN to assess their performances. However, at the end
by using detection heads super classes (sets of sim- we are really interested in their behaviors in the
ilar classes) specific and thus decouple detection wild where no annotations are present. Detectors
from classification. SNIPER [347] can train on have to be robust to unusual situations and one
512x512 images using an adaptive sampling of the would wish for detectors to be able to evolve them-
region of interests. Therefore, it’s training can use selves. This section will review the state of deep
larger batch size and therefore, be way faster but learning methods w.r.t. these expectations.

50
Figure 20: Performance on VOC07 with respect to Inference speed on a TitanX GPU. The vertical line
represents the limit of Real-Time Speed (indistinguishable from continuous motion for the human eye). We
also added in light gray some relevant work measured on similar devices (K40, TITAN Xp, Jetson TX2).
Only RefineDet [460], DES [467] and STDN [475] are simultaneously real-time and above 80% in mAP
although for some of them (DES, STDN) better hardware (TITAN Xp) must have helped.

51
Figure 21: On the left side we display an example of guided backpropagation to visualize the pattern that
make the neurons fire from [355] and on the right side we show the approach of gradients mask to find
important zones for a classifier on an image from [91], which can lead to bad surprises (the network uses the
spoon as a proxy for the presence of coffee).

4.3.1 Interpretability and Robustness But most of all the detectors should incorporate
a certain level of interpretability so that if a dra-
With the recent craze about self-driving cars, it matic failure happens it can be understood and
has become a top priority to build detectors, that fixed. It is also a need for legal matters. Very few
can be trusted with our lives. Hence, detectors works have done so because it requires delving into
should be robust to physical adversarial attacks the feature maps of the backbone network. A few
[43, 226] and weather conditions, which was the works proposed different approaches for classifica-
reason for building KITTI [98] and DETRAC [411] tion only but no consensus has been reached yet.
back then and has now led to the creation of two Among the popular methods one can cite the gradi-
amazingly huge datasets: ApolloScape from Baidu ent map in the image space of Simonyan et al. [344],
[440] and BDD100K from Berkeley [440] car detec- the occlusion analysis of Zeiler and Fergus [449],
tion datasets. The driving conditions of the real the guided back propagation of Springenberg et al.
world are so complex: changing environments, re- [355] and, recently, the perturbation approach of
flections, different traffic signs and rules for differ- Fong and Vedaldi [91]. Figure 21 shows the insights
ent countries. So far, this open problem is largely gained by using two of the mentioned methods on
unsolved even if some industry players seem to be a classifier.
confident enough to leave self-driving cars without No method exists yet for object detectors to the
safety nets in specific cities. It will surely involve best of our knowledge. It would be a very interest-
at some point the heavy use of synthetic data oth- ing research direction for future works.
erwise it would take a lifetime to gather the data
necessary to be confident enough. To finish on a 4.3.2 Universal Detector, Lifelong Learn-
positive note detectors in self-driving cars can ben- ing
efit from multi-sensory inputs such as LiDAR point
clouds [124], other lasers and multiple cameras so Having object detectors able to iteratively, and
it can help disambiguate certain difficult situations without any supervision, learn to detect novel ob-
(reflections on the cars in front of it for instance). ject classes and improve their performance would

52
be one of the Holy Grails of computer vision. This NNs entirely. However, the Achilles heel of deep-
can have the form of lifelong learning, where goal learning methods is their interpretability and trust-
is to sequentially retrain learned knowledge and to worthiness. The object detection community seems
selectively transfer the knowledge when learning focused on improving the performances on static
a new task, as defined in [341]. Or never ending benchmarks instead of finding ways to better un-
learning [245], where the system has sufficient self- derstand the behavior of DCNNs. It is under-
reflection to avoid plateaus in performances and standable but it shows that Deep Learning has
can decide how to progress by itself. However, one not yet reached full maturity. Eventually, one
of the biggest issues with current detectors is they can hope that the performances of new detectors
suffer from catastrophic forgetting, as say Castro will plateau and when it does, researchers will be
et al. [33]. It means their performance decreases forced to come back to the basics and focus instead
when new classes are added incrementally. Some on interpretability and robustness before the next
authors tried to face this challenge. For exam- paradigm washes off deep-learning entirely.
ple, the knowledge distillation loss introduced by
Li and Hoiem [209] allows to forget old data while
using previous models to constraint updated ones
during learning. In the domain of object detec-
tion, the only recent contribution we are aware of
is the incremental learning approach of Shmelkov
et al. [336], relying on a distillation mechanism. 5 Conclusions
Lifelong learning and never ending learning are do-
mains where a lot still have to be discovered or
developed. Object detection in images, a key topic attracting
a substantial part of the computer vision commu-
nity, has been revolutionized by the recent arrival of
4.4 Concluding Remarks
convolutional neural networks, which swept all the
It seems that deep learning in its current form is methods previously dominating the field. This ar-
not yet fully ready to be applied to other modal- ticle provides a comprehensive survey of what hap-
ities than 2D images: in videos, temporal consis- pened in the domain since 2012. It shows that, even
tency is hard to take into account with DCNNs be- if top-performing methods concentrate around two
cause 3D convolutions are expensive, tubelets and main alternatives – single stage methods such as
tracklets are interesting ideas but lack the elegance SSD or YOLO, or two stages methods in the foot-
of DCNNs on still images. For point clouds the steps of Faster RCNN – the domain is still very
picture is even worse. The voxelisation of point active. Graph networks, GANs, context, small ob-
clouds does not deal with their inherent sparsity jects, domain adaptation, occlusions, etc. are the
and create memory issues and even the simplicity directions that are actively studied in the context
and originality of the PointNet articles Qi et al. of object detection. Extension of object detection
[288, 290] that leaves the point clouds untouched to other modalities, such as videos or 3D point
has not matured enough yet to be widely adopted clouds, as well as constraints, such as weak super-
by the community. Hopefully, dealing with other vision is also very active and has been addressed.
constraints like weak supervision or few training The appendix of this survey also provides a very
images is starting to produce worthy results with- complete list of the public datasets available to the
out too much change to the original DCNN archi- community and highlights top performing methods
tectures [76, 97, 369, 400, 462–464]. It seems to be on these datasets. We believe this article will be
only a matter of refining cost functions and coming- useful to better understand the recent progress and
up with more building blocks than reinventing DC- the bigger picture of this constantly moving field.

53
References [8] Seung-Hwan Bae, Youngwan Lee, Youngjoo
Jo, Yuseok Bae, and Joong-won Hwang.
[1] Takuya Akiba, Shuji Suzuki, and Keisuke Rank of experts: Detection network ensem-
Fukuda. Extremely large minibatch SGD: ble. CoRR, abs/1712.00185, 2017. URL
training resnet-50 on imagenet in 15 min- http://arxiv.org/abs/1712.00185.
utes. CoRR, abs/1711.04325, 2017. URL
http://arxiv.org/abs/1711.04325. [9] Yancheng Bai, Yongqiang Zhang, Mingli
Ding, and Bernard Ghanem. SOD-MTGAN:
[2] Bogdan Alexe, Thomas Deselaers, and Vit- Small Object Detection via Multi-Task Gen-
torio Ferrari. What is an object? In The erative Adversarial Network. In Computer
Twenty-Third IEEE Conference on Com- Vision - ECCV 2018 - 15th European Con-
puter Vision and Pattern Recognition, CVPR ference, Munich, Germany, September 8 -
2010, San Francisco, CA, USA, 13-18 June 14, 2018, page 16, 2018.
2010, pages 73–80, 2010.
[10] Ankan Bansal, Karan Sikka, Gaurav
[3] Bogdan Alexe, Thomas Deselaers, and Vit- Sharma, Rama Chellappa, and Ajay
torio Ferrari. Measuring the objectness of Divakaran. Zero-shot object detection.
image windows. IEEE Transactions on Pat- CoRR, abs/1804.04340, 2018. URL
tern Analysis and Machine Intelligence, 34 http://arxiv.org/abs/1804.04340.
(11):2189–2202, 2012.
[11] Peter W. Battaglia, Jessica B. Hamrick,
[4] Hassan Abu Alhaija, Siva Karthik Victor Bapst, Alvaro Sanchez-Gonzalez,
Mustikovela, Lars M. Mescheder, Andreas Vinı́cius Flores Zambaldi, Mateusz Mali-
Geiger, and Carsten Rother. Augmented nowski, Andrea Tacchetti, David Raposo,
reality meets computer vision: Efficient Adam Santoro, Ryan Faulkner, Çaglar
data generation for urban driving scenes. Gülçehre, Francis Song, Andrew J. Bal-
International Journal of Computer Vision lard, Justin Gilmer, George E. Dahl, Ashish
(IJCV), 126(9):961–972, 2018. Vaswani, Kelsey Allen, Charles Nash, Vic-
[5] Phil Ammirato, Patrick Poirson, Eunbyung toria Langston, Chris Dyer, Nicolas Heess,
Park, Jana Kosecka, and Alexander C. Berg. Daan Wierstra, Pushmeet Kohli, Matthew
A dataset for developing and benchmarking Botvinick, Oriol Vinyals, Yujia Li, and Raz-
active vision. IEEE International Conference van Pascanu. Relational inductive biases,
on Robotics and Automation (ICRA), cs.CV, deep learning, and graph networks. CoRR,
2017. abs/1806.01261, 2018. URL http://arxiv.
org/abs/1806.01261.
[6] Anelia Angelova, Alex Krizhevsky, Vincent
Vanhoucke, Abhijit S Ogale, and Dave Fer- [12] Loris Bazzani, Alessandro Bergamo,
guson. Real-time pedestrian detection with Dragomir Anguelov, and Lorenzo Tor-
deep network cascades. In Proceedings of resani. Self-taught object localization with
the British Machine Vision Conference 2015, deep networks. In 2016 IEEE Winter Con-
BMVC 2015, Swansea, UK, September 7-10, ference on Applications of Computer Vision,
2015, volume 2, page 4, 2015. WACV 2016, Lake Placid, NY, USA, March
7-10, 2016, pages 1–9, 2016. URL https:
[7] Antreas Antoniou, Amos J. Storkey, and //doi.org/10.1109/WACV.2016.7477688.
Harrison Edwards. Data augmentation
generative adversarial networks. CoRR, [13] Karsten Behrendt and Libor Novak. A Deep
abs/1711.04340, 2017. URL http://arxiv. Learning Approach to Traffic Lights: De-
org/abs/1711.04340. tection, Tracking, and Classification. In

54
Robotics and Automation (ICRA), 2017 [21] Navaneeth Bodla, Bharat Singh, Rama Chel-
IEEE International Conference On, 2017. lappa, and Larry S Davis. Soft-nms—
improving object detection with one line of
[14] Sean Bell, C. Lawrence Zitnick, Kavita Bala, code. In IEEE International Conference on
and Ross Girshick. Inside-Outside Net: De- Computer Vision, ICCV 2017, Venice, Italy,
tecting Objects in Context with Skip Pool- October 22-29, 2017, pages 5562–5570, 2017.
ing and Recurrent Neural Networks. In 2016
IEEE Conference on Computer Vision and [22] Lubomir Bourdev, Subhransu Maji, Thomas
Pattern Recognition, CVPR 2016, Las Ve- Brox, and Jitendra Malik. Detecting people
gas,NV, USA, June 27-30, 2016, 2016. using mutually consistent poselet activations.
In Computer Vision - ECCV 2010, 11th Eu-
[15] Jorge Beltrán, Carlos Guindel, Fran- ropean Conference on Computer Vision, Her-
cisco Miguel Moreno, Daniel Cruzado, aklion, Crete, Greece, September 5-11, 2010,
Fernando Garcı́a, and Arturo de la pages 168–181, 2010.
Escalera. Birdnet: a 3d object detec-
tion framework from lidar information. [23] Konstantinos Bousmalis, Nathan Silberman,
CoRR, abs/1805.01195, 2018. URL David Dohan, Dumitru Erhan, and Dilip Kr-
http://arxiv.org/abs/1805.01195. ishnan. Unsupervised Pixel-Level Domain
Adaptation with Generative Adversarial Net-
[16] Rodrigo Benenson, Markus Mathias, Radu works. In 2017 IEEE Conference on Com-
Timofte, and Luc Van Gool. Pedestrian de- puter Vision and Pattern Recognition, CVPR
tection at 100 frames per second. In 2012 2017, Honolulu, HI, USA, July 21-26, 2017,
IEEE Conference on Computer Vision and pages 95–104, 2017.
Pattern Recognition, Providence, RI, USA,
June 16-21, 2012, pages 2903–2910, 2012. [24] Samarth Brahmbhatt, Henrik I. Chris-
tensen, and James Hays. StuffNet - Using
[17] Simone Bianco, Marco Buzzelli, Davide &apos;Stuff&apos; to Improve Object Detec-
Mazzini, and Raimondo Schettini. Deep tion. In IEEE Winter Conf. on Applications
Learning for Logo Recognition. Neurocom- of Computer Vision (WACV), 2017.
puting, 245:23–30, July 2017.
[25] Markus Braun, Sebastian Krebs, Fabian
[18] Hakan Bilen and Andrea Vedaldi. Weakly Su- Flohr, and Dariu M. Gavrila. The eurocity
pervised Deep Detection Networks. In 2016 persons dataset: A novel benchmark for ob-
IEEE Conference on Computer Vision and ject detection. CoRR, abs/1805.07193, 2018.
Pattern Recognition, CVPR 2016, Las Ve- URL http://arxiv.org/abs/1805.07193.
gas,NV, USA, June 27-30, 2016, 2016.
[26] Michal Busta, Lukas Neumann, and Jiri
[19] Hakan Bilen, Marco Pedersoli, and Tinne Matas. Deep textspotter: An end-to-end
Tuytelaars. Weakly supervised object detec- trainable scene text localization and recogni-
tion with convex clustering. In IEEE Confer- tion framework. In IEEE International Con-
ence on Computer Vision and Pattern Recog- ference on Computer Vision, ICCV 2017,
nition, CVPR 2015, Boston, MA, USA, June Venice, Italy, October 22-29, 2017, pages
7-12, 2015, June 2015. 2223–2231. IEEE Computer Society, 2017.
[20] Bin Yang, Junjie Yan, Zhen Lei, and Stan Z. [27] Zhaowei Cai and Nuno Vasconcelos. Cas-
Li. Fine-grained evaluation on face detection cade R-CNN: delving into high quality ob-
in the wild. In Automatic Face and Gesture ject detection. In Computer Vision and Pat-
Recognition (FG), pages 1–7, 2015. tern Recognition (CVPR), 2018 IEEE Con-

55
ference on, pages 6154–6162, 2018. doi: Thierry Chateau. Deep MANTA: A coarse-
10.1109/CVPR.2018.00644. to-fine many-task network for joint 2d and
3d vehicle analysis from monocular image.
[28] Zhaowei Cai, Quanfu Fan, Rogerio S Feris, In 2017 IEEE Conference on Computer
and Nuno Vasconcelos. A unified multi- Vision and Pattern Recognition, CVPR
scale deep convolutional neural network for 2017, Honolulu, HI, USA, July 21-26, 2017,
fast object detection. In Computer Vision pages 1827–1836, 2017.
- ECCV 2016 - 14th European Conference,
Amsterdam, The Netherlands, October 11- [35] Karanbir Singh Chahal and Kuntal Dey. A
14, 2016, pages 354–370, 2016. survey of modern object detection literature
using deep learning. CoRR, 2018.
[29] Guimei Cao, Xuemei Xie, Wenzhe Yang,
Quan Liao, Guangming Shi, and Jinjian Wu. [36] Chenyi Chen, Ming-Yu Liu 0001, Oncel
Feature-fused SSD: fast detection for small Tuzel, and Jianxiong Xiao. R-CNN for Small
objects. CoRR, abs/1709.05054, 2017. URL Object Detection. Computer Vision - ACCV
http://arxiv.org/abs/1709.05054. 2016 - 13th Asian Conference on Computer
Vision, Taipei, Taiwan, November 20-24,
[30] Joao Carreira and Cristian Sminchisescu.
2016, 10115:214–230, 2016.
Constrained parametric min-cuts for auto-
matic object segmentation. In The Twenty- [37] D. Chen, G. Hua, F. Wen, and J. Sun.
Third IEEE Conference on Computer Vision Supervised transformer network for efficient
and Pattern Recognition, CVPR 2010, San face detection. In Computer Vision - ECCV
Francisco, CA, USA, 13-18 June 2010, pages 2016 - 14th European Conference, Amster-
3241–3248, 2010. dam, The Netherlands, October 11-14, 2016,
[31] Joao Carreira and Cristian Sminchisescu. 2016.
Cpmc: Automatic object segmentation us-
[38] Guang Chen, Yuanyuan Ding, Jing Xiao, and
ing constrained parametric min-cuts. IEEE
Tony X Han. Detection evolution with multi-
Transactions on Pattern Analysis and Ma-
order contextual co-occurrence. In 2013
chine Intelligence, 34(7):1312–1328, 2011.
IEEE Conference on Computer Vision and
[32] Lluı́s Castrejón, Kaustav Kundu, Raquel Ur- Pattern Recognition, Portland, OR, USA,
tasun, and Sanja Fidler. Annotating ob- June 23-28, 2013, pages 1798–1805, 2013.
ject instances with a polygon-rnn. In 2017
IEEE Conference on Computer Vision and [39] Hao Chen, Yali Wang, Guoyou Wang,
Pattern Recognition, CVPR 2017, Honolulu, and Yu Qiao. LSTD: A low-shot transfer
HI, USA, July 21-26, 2017, pages 4485–4493, detector for object detection. In Sheila A.
2017. doi: 10.1109/CVPR.2017.477. McIlraith and Kilian Q. Weinberger, ed-
itors, Proceedings of the Thirty-Second
[33] Francisco M. Castro, Manuel J. Marı́n- AAAI Conference on Artificial Intelligence,
Jiménez, Nicolás Guil, Cordelia Schmid, and New Orleans, Louisiana, USA, Febru-
Karteek Alahari. End-to-End Incremental ary 2-7, 2018. AAAI Press, 2018. URL
Learning. In Computer Vision - ECCV 2018 https://www.aaai.org/ocs/index.php/
- 15th European Conference, Munich, Ger- AAAI/AAAI18/paper/view/16778.
many, September 8 - 14, 2018, 2018.
[40] Kai Chen, Hang Song, Chen Change Loy,
[34] Florian Chabot, Mohamed Chaouch, and Dahua Lin. Discover and Learn New
Jaonary Rabarisoa, Céline Teulière, and Objects from Documentaries. In 2017 IEEE

56
Conference on Computer Vision and Pat- cc/paper/5644-3d-object-proposals-
tern Recognition, CVPR 2017, Honolulu, HI, for-accurate-object-class-detection.
USA, July 21-26, 2017, pages 1111–1120,
July 2017. [46] Xiaozhi Chen, Huimin Ma, Xiang Wang, and
Zhichen Zhao. Improving object propos-
[41] Kai Chen, Jiaqi Wang, Shuo Yang, als with multi-thresholding straddling expan-
Xingcheng Zhang, Yuanjun Xiong, sion. In IEEE Conference on Computer Vi-
Chen Change Loy, and Dahua Lin. Optimiz- sion and Pattern Recognition, CVPR 2015,
ing video object detection via a scale-time Boston, MA, USA, June 7-12, 2015, 2015.
lattice. CoRR, abs/1804.05472, 2018. URL
http://arxiv.org/abs/1804.05472. [47] Xiaozhi Chen, Huimin Ma, Ji Wan, Bo Li,
and Tian Xia. Multi-view 3d object detec-
[42] Liang-Chieh Chen, George Papandreou, Ia- tion network for autonomous driving. In 2017
sonas Kokkinos, Kevin Murphy, and Alan L. IEEE Conference on Computer Vision and
Yuille. Deeplab: Semantic image seg- Pattern Recognition, CVPR 2017, Honolulu,
mentation with deep convolutional nets, HI, USA, July 21-26, 2017, pages 6526–6534.
atrous convolution, and fully connected IEEE Computer Society, 2017.
crfs. IEEE Transactions on Pattern Analy-
[48] Xinlei Chen and Abhinav Gupta. Spatial
sis and Machine Intelligence, 40(4):834–848,
Memory for Context Reasoning in Object De-
2018. URL https://doi.org/10.1109/
tection. In 2017 IEEE Conference on Com-
TPAMI.2017.2699184.
puter Vision and Pattern Recognition, CVPR
[43] Shang-Tse Chen, Cory Cornelius, Jason Mar- 2017, Honolulu, HI, USA, July 21-26, 2017,
tin, and Duen Horng Chau. Robust physical 2017.
adversarial attack on faster R-CNN object
[49] Yuhua Chen, Wen Li, Christos Sakaridis,
detector. CoRR, abs/1804.05810, 2018. URL
Dengxin Dai, and Luc Van Gool. Domain
http://arxiv.org/abs/1804.05810.
adaptive faster R-CNN for object detection
[44] X. Chen, K. Kundu, Z. Zhang, H. Ma, and in the wild. CoRR, abs/1803.03243, 2018.
S. Fidler. Monocular 3d object detection for URL http://arxiv.org/abs/1803.03243.
autonomous driving. In 2016 IEEE Confer- [50] Yunpeng Chen, Jianan Li, Huaxin Xiao, Xi-
ence on Computer Vision and Pattern Recog- aojie Jin, Shuicheng Yan, and Jiashi Feng.
nition, CVPR 2016, Las Vegas,NV, USA, Dual path networks. In Advances in Neural
June 27-30, 2016, 2016. Information Processing Systems 30: Annual
Conference on Neural Information Process-
[45] Xiaozhi Chen, Kaustav Kundu, Yukun Zhu,
ing Systems 2017, 4-9 December 2017, Long
Andrew G. Berneshawi, Huimin Ma, Sanja
Beach, CA, USA, pages 4467–4475, 2017.
Fidler, and Raquel Urtasun. 3d object
proposals for accurate object class detec- [51] Yunpeng Chen, Jianshu Li, Bin Zhou, Jiashi
tion. In Corinna Cortes, Neil D. Lawrence, Feng, and Shuicheng Yan. Weaving multi-
Daniel D. Lee, Masashi Sugiyama, and scale context for single shot detector. CoRR,
Roman Garnett, editors, Advances in Neural abs/1712.03149, 2017. URL http://arxiv.
Information Processing Systems 28: An- org/abs/1712.03149.
nual Conference on Neural Information
Processing Systems 2015, December 7-12, [52] G. Cheng, P. Zhou, and J. Han. RIFD-CNN:
2015, Montreal, Quebec, Canada, pages Rotation-Invariant and Fisher Discriminative
424–432, 2015. URL http://papers.nips. Convolutional Neural Networks for Object

57
Detection. In 2016 IEEE Conference on [59] Gabriela Csurka. A comprehensive sur-
Computer Vision and Pattern Recognition, vey on domain adaptation for visual appli-
CVPR 2016, Las Vegas,NV, USA, June 27- cations. In Gabriela Csurka, editor, Do-
30, 2016, 2016. main Adaptation in Computer Vision Appli-
cations., Advances in Computer Vision and
[53] Gong Cheng and Junwei Han. A Survey on Pattern Recognition, pages 1–35. Springer,
Object Detection in Optical Remote Sensing 2017. URL https://doi.org/10.1007/
Images. ISPRS Journal of Photogrammetry 978-3-319-58347-1_1.
and Remote Sensing, 117:11–28, 2016.
[54] Gong Cheng, Peicheng Zhou, and Junwei [60] Ekin Dogus Cubuk, Barret Zoph, Dandelion
Han. Learning rotation-invariant convo- Mané, Vijay Vasudevan, and Quoc V. Le.
lutional neural networks for object detec- Autoaugment: Learning augmentation poli-
tion in vhr optical remote sensing images. cies from data. CoRR, abs/1805.09501, 2018.
IEEE Transactions on Geoscience and Re- URL http://arxiv.org/abs/1805.09501.
mote Sensing, 54(12):7405–7415, 2016.
[61] Jifeng Dai, Kaiming He, and Jian Sun.
[55] Jianpeng Cheng, Li Dong, and Mirella Lap- Instance-aware semantic segmentation via
ata. Long short-term memory-networks for multi-task network cascades. In 2016 IEEE
machine reading. In Proceedings of the 2016 Conference on Computer Vision and Pattern
Conference on Empirical Methods in Natural Recognition, CVPR 2016, Las Vegas,NV,
Language Processing, EMNLP 2016, Austin, USA, June 27-30, 2016, pages 3150–3158,
Texas, USA, November 1-4, 2016, pages 551– 2016.
561, 2016.
[62] Jifeng Dai, Yi Li, Kaiming He, and Jian
[56] Ming-Ming Cheng, Ziming Zhang, Wen-Yan Sun. R-fcn: Object detection via region-
Lin, and Philip Torr. Bing: Binarized based fully convolutional networks. In Ad-
normed gradients for objectness estimation vances in Neural Information Processing Sys-
at 300fps. In 2014 IEEE Conference on tems 29: Annual Conference on Neural In-
Computer Vision and Pattern Recognition, formation Processing Systems 2016, Decem-
CVPR 2014, Columbus, OH, USA, June 23- ber 5-10, 2016, Barcelona, Spain, pages 379–
28, 2014, pages 3286–3293, 2014. 387, 2016.
[57] François Chollet. Xception: Deep learn-
ing with depthwise separable convolutions. [63] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li,
In 2017 IEEE Conference on Computer Vi- Guodong Zhang, Han Hu, and Yichen Wei.
sion and Pattern Recognition, CVPR 2017, Deformable convolutional networks. In IEEE
Honolulu, HI, USA, July 21-26, 2017, pages International Conference on Computer Vi-
1800–1807, 2017. sion, ICCV 2017, Venice, Italy, October 22-
29, 2017, pages 764–773. IEEE Computer So-
[58] Marius Cordts, Mohamed Omran, Sebastian ciety, 2017.
Ramos, Timo Rehfeld, Markus Enzweiler,
Rodrigo Benenson, Uwe Franke, Stefan Roth, [64] Navneet Dalal and Bill Triggs. Histograms
and Bernt Schiele. The cityscapes dataset of oriented gradients for human detection.
for semantic urban scene understanding. In In 2005 IEEE Computer Society Conference
Proc. of the IEEE Conference on Computer on Computer Vision and Pattern Recognition
Vision and Pattern Recognition (CVPR), (CVPR 2005), 20-26 June 2005, San Diego,
2016. CA, USA, volume 1, pages 886–893, 2005.

58
[65] Manolis Delakis and Christophe Garcia. text [72] Piotr Dollár, Ron Appel, Serge J. Belongie,
detection with convolutional neural net- and Pietro Perona. Fast feature pyramids for
works. In International Joint Conference object detection. IEEE Transactions on Pat-
on Computer Vision, Imaging and Computer tern Analysis and Machine Intelligence, 36
Graphics Theory and Applications (VISAP), (8):1532–1545, 2014.
pages 290–294, 2008.
[73] Xuanyi Dong, Liang Zheng, Fan Ma,
[66] Jia Deng, Wei Dong, Richard Socher, Li- Yi Yang, and Deyu Meng. Few-shot ob-
Jia Li, Kai Li, and Li Fei-Fei. Imagenet: ject detection. CoRR, abs/1706.08249, 2017.
A large-scale hierarchical image database. URL http://arxiv.org/abs/1706.08249.
In 2009 IEEE Computer Society Conference
on Computer Vision and Pattern Recogni- [74] Thibaut Durand, Nicolas Thome, and
tion (CVPR 2009), 20-25 June 2009, Miami, Matthieu Cord. MANTRA: Minimum Maxi-
Florida, USA, pages 248–255, 2009. mum Latent Structural SVM for Image Clas-
sification and Ranking. In IEEE Inter-
[67] Zhipeng Deng, Hao Sun, Shilin Zhou, Juan- national Conference on Computer Vision,
ping Zhao, and Huanxin Zou. Toward Fast ICCV 2015, Santiago, Chile, December 7-13,
and Accurate Vehicle Detection in Aerial Im- 2015, 2015.
ages Using Coupled Region-Based Convolu-
tional Neural Networks. IEEE Journal of Se- [75] Thibaut Durand, Nicolas Thome, and
lected Topics in Applied Earth Observations Matthieu Cord. Weldon: Weakly supervised
and Remote Sensing, 10:3652–3664, 2017. learning of deep convolutional neural net-
works. In 2016 IEEE Conference on Com-
[68] Zhuo Deng and Longin Jan Latecki. Amodal
puter Vision and Pattern Recognition, CVPR
Detection of 3D Objects: Inferring 3D
2016, Las Vegas,NV, USA, June 27-30, 2016,
Bounding Boxes from 2D Ones in RGB-
2016.
Depth Images. In 2017 IEEE Conference on
Computer Vision and Pattern Recognition,
[76] Thibaut Durand, Taylor Mordan, Nicolas
CVPR 2017, Honolulu, HI, USA, July 21-26,
Thome, and Matthieu Cord. WILDCAT:
2017, pages 398–406, 2017.
Weakly Supervised Learning of Deep Con-
[69] Terrance Devries and Graham W. Tay- vNets for Image Classification, Pointwise Lo-
lor. Dataset augmentation in feature space. calization and Segmentation. In 2017 IEEE
CoRR, abs/1702.05538, 2017. URL http: Conference on Computer Vision and Pat-
//arxiv.org/abs/1702.05538. tern Recognition, CVPR 2017, Honolulu, HI,
USA, July 21-26, 2017, 2017.
[70] Terrance Devries and Graham W. Tay-
lor. Improved regularization of convolu- [77] Nikita Dvornik, Julien Mairal, and Cordelia
tional neural networks with cutout. CoRR, Schmid. Modeling Visual Context is Key
abs/1708.04552, 2017. URL http://arxiv. to Augmenting Object Detection Datasets.
org/abs/1708.04552. In Computer Vision - ECCV 2018 - 15th
European Conference, Munich, Germany,
[71] Piotr Dollar, Christian Wojek, Bernt Schiele, September 8 - 14, 2018, page 18, 2018.
and Pietro Perona. Pedestrian detection: An
evaluation of the state of the art. IEEE [78] D. Dwibedi. Synthesizing scenes for instance
Transactions on Pattern Analysis and Ma- detection. Master’s thesis, Carnegie Mellon
chine Intelligence, 34(4):743–761, 2012. University, 2017.

59
[79] Debidatta Dwibedi, Ishan Misra, and Mar- [86] Dumitru Erhan, Christian Szegedy, Alexan-
tial Hebert. Cut, paste and learn: Surpris- der Toshev, and Dragomir Anguelov. Scal-
ingly easy synthesis for instance detection. In able Object Detection Using Deep Neural
IEEE International Conference on Computer Networks. In 2014 IEEE Conference on
Vision, ICCV 2017, Venice, Italy, October Computer Vision and Pattern Recognition,
22-29, 2017, pages 1310–1319. IEEE Com- CVPR 2014, Columbus, OH, USA, June 23-
puter Society, 2017. 28, 2014, 2014.

[80] Christian Eggert, Dan Zecha, Stephan [87] Andreas Ess, Bastian Leibe, and Luc
Brehm, and Rainer Lienhart. Improving Van Gool. Depth and appearance for mo-
small object proposals for company logo de- bile scene analysis. In IEEE 11th Inter-
tection. In Proceedings of the 2017 ACM on national Conference on Computer Vision,
International Conference on Multimedia Re- ICCV 2007, Rio de Janeiro, Brazil, October
trieval, pages 167–174, 2017. 14-20, 2007, pages 1–8, 2007.
[88] Mark Everingham, Luc Van Gool, Christo-
[81] Ian Endres and Derek Hoiem. Category inde- pher KI Williams, John Winn, and An-
pendent object proposals. In Computer Vi- drew Zisserman. The pascal visual object
sion - ECCV 2010, 11th European Confer- classes (voc) challenge. International Journal
ence on Computer Vision, Heraklion, Crete, of Computer Vision (IJCV), 88(2):303–338,
Greece, September 5-11, 2010, pages 575– 2010.
588, 2010.
[89] Christoph Feichtenhofer, Axel Pinz, and An-
[82] Ian Endres and Derek Hoiem. Category- drew Zisserman. Detect to track and track
independent object proposals with diverse to detect. In 2017 IEEE Conference on Com-
ranking. IEEE Transactions on Pattern puter Vision and Pattern Recognition, CVPR
Analysis and Machine Intelligence, 36(2): 2017, Honolulu, HI, USA, July 21-26, 2017,
222–234, 2014. pages 3038–3046, 2017.

[83] Martin Engelcke, Dushyant Rao, Do- [90] Pedro F. Felzenszwalb, Ross B. Gir-
minic Zeng Wang, Chi Hay Tong, and Ingmar shick, David A. McAllester, and Deva Ra-
Posner. Vote3Deep: Fast Object Detection manan. Object detection with discrimi-
in 3D Point Clouds Using Efficient Convolu- natively trained part-based models. IEEE
tional Neural Networks. In IEEE Interna- Transactions on Pattern Analysis and Ma-
tional Conference on Robotics and Automa- chine Intelligence, 32(9):1627–1645, 2010.
tion (ICRA), 2017. [91] Ruth C. Fong and Andrea Vedaldi. In-
terpretable explanations of black boxes by
[84] Markus Enzweiler and Dariu M Gavrila. meaningful perturbation. In IEEE Inter-
Monocular pedestrian detection: Survey and national Conference on Computer Vision,
experiments. IEEE Transactions on Pattern ICCV 2017, Venice, Italy, October 22-29,
Analysis and Machine Intelligence, 31(12): 2017, pages 3449–3457. IEEE Computer So-
2179–2195, 2008. ciety, 2017.
[85] Markus Enzweiler and Dariu M. Gavrila. [92] Cheng-Yang Fu, Wei Liu, Ananth Ranga,
A multilevel mixture-of-experts framework Ambrish Tyagi, and Alexander C. Berg.
for pedestrian classification. IEEE Trans- DSSD : Deconvolutional single shot detec-
actions on Image Processing, 20(10):2967– tor. CoRR, abs/1701.06659, 2017. URL
2979, 2011. http://arxiv.org/abs/1701.06659.

60
[93] Yanwei Fu, Tao Xiang, Yu-Gang Jiang, Xi- Institute of Technology, Cambridge, Mas-
angyang Xue, Leonid Sigal, and Shaogang sachusetts, USA, July 12-16, 2017, 2017.
Gong. Recent advances in zero-shot recogni- URL http://www.roboticsproceedings.
tion: Toward data-efficient understanding of org/rss13/p43.html.
visual content. IEEE Signal Processing Mag-
azine, 35(1):112–125, 2018. [100] David Gerónimo, Angel Domingo Sappa, An-
tonio López, and Daniel Ponsa. Adaptive
[94] A Gaidon, Q Wang, Y Cabon, and E Vig. image sampling and windows classification
Virtual worlds as proxy for multi-object for on-board pedestrian detection. In Pro-
tracking analysis. In 2016 IEEE Conference ceedings of the 5th International Conference
on Computer Vision and Pattern Recogni- on Computer Vision Systems (ICVS 2007),
tion, CVPR 2016, Las Vegas,NV, USA, June 2007.
27-30, 2016, 2016.
[101] Spyridon Gidaris and Nikos Komodakis. At-
[95] Mingfei Gao, Ruichi Yu, Ang Li, Vlad I. tend Refine Repeat - Active Box Proposal
Morariu, and Larry S. Davis. Dynamic zoom- Generation via In-Out Localization. In
in network for fast object detection in large Proceedings of the British Machine Vision
images. CoRR, abs/1711.05187, 2017. URL Conference 2016, BMVC 2016, York, UK,
http://arxiv.org/abs/1711.05187. September 19-22, 2016, 2016.
[96] Christophe Garcia and Manolis Delakis. A
[102] Spyros Gidaris and Nikos Komodakis. Ob-
neural architecture for fast and robust face
ject detection via a multi-region and seman-
detection. In Pattern Recognition, 2002. Pro-
tic segmentation-aware cnn model. In IEEE
ceedings. 16th International Conference on,
Conference on Computer Vision and Pat-
volume 2, pages 44–47, 2002.
tern Recognition, CVPR 2015, Boston, MA,
[97] Weifeng Ge, Sibei Yang, and Yizhou Yu. USA, June 7-12, 2015, pages 1134–1142,
Multi-evidence filtering and fusion for multi- 2015.
label classification, object detection and se-
mantic segmentation based on weakly super- [103] Spyros Gidaris and Nikos Komodakis. Loc-
vised learning. In Computer Vision and Pat- Net: Improving Localization Accuracy for
tern Recognition (CVPR), 2018 IEEE Con- Object Detection. In 2016 IEEE Conference
ference on, June 2018. on Computer Vision and Pattern Recogni-
tion, CVPR 2016, Las Vegas,NV, USA, June
[98] Andreas Geiger, Philip Lenz, and Raquel Ur- 27-30, 2016, 2016.
tasun. Are we ready for autonomous driving?
the kitti vision benchmark suite. In 2012 [104] Ross Girshick. Fast r-cnn. In IEEE In-
IEEE Conference on Computer Vision and ternational Conference on Computer Vision,
Pattern Recognition, Providence, RI, USA, ICCV 2015, Santiago, Chile, December 7-13,
June 16-21, 2012, pages 3354–3361, 2012. 2015, pages 1440–1448, 2015.

[99] Georgios Georgakis, Arsalan Mousavian, [105] Ross Girshick, Jeff Donahue, Trevor Darrell,
Alexander C. Berg, and Jana Kosecka. Syn- and Jitendra Malik. Rich feature hierarchies
thesizing training data for object detection for accurate object detection and semantic
in indoor scenes. In Nancy M. Amato, segmentation. In 2014 IEEE Conference on
Siddhartha S. Srinivasa, Nora Ayanian, Computer Vision and Pattern Recognition,
and Scott Kuindersma, editors, Robotics: CVPR 2014, Columbus, OH, USA, June 23-
Science and Systems XIII, Massachusetts 28, 2014, pages 580–587, 2014.

61
[106] Ross B. Girshick, Forrest N. Iandola, Trevor [112] Ankush Gupta, Andrea Vedaldi, and An-
Darrell, and Jitendra Malik. Deformable part drew Zisserman. Synthetic Data for Text
models are convolutional neural networks. In Localisation in Natural Images. In 2016
IEEE Conference on Computer Vision and IEEE Conference on Computer Vision and
Pattern Recognition, CVPR 2015, Boston, Pattern Recognition, CVPR 2016, Las Ve-
MA, USA, June 7-12, 2015, 2015. gas,NV, USA, June 27-30, 2016, pages 2315–
2324, June 2016.
[107] Georgia Gkioxari and Jitendra Malik. Find-
ing action tubes. In IEEE Conference on [113] Saurabh Gupta, Bharath Hariharan, and Ji-
Computer Vision and Pattern Recognition, tendra Malik. Exploring person context
CVPR 2015, Boston, MA, USA, June 7-12, and local scene context for object detection.
2015, pages 759–768, 2015. doi: 10.1109/ CoRR, abs/1511.08177, 2015. URL http:
CVPR.2015.7298676. //arxiv.org/abs/1511.08177.
[114] Song Han, Huizi Mao, and William J. Dally.
[108] Abel Gonzalez-Garcia, Davide Modolo, and Deep compression: Compressing deep neural
Vittorio Ferrari. Objects as context for network with pruning, trained quantization
part detection. CoRR, abs/1703.09529, 2017. and huffman coding. CoRR, abs/1510.00149,
URL http://arxiv.org/abs/1703.09529. 2015. URL http://arxiv.org/abs/1510.
00149.
[109] Ian Goodfellow, Jean Pouget-Abadie, Mehdi
Mirza, Bing Xu, David Warde-Farley, Sher- [115] Wei Han, Pooya Khorrami, Tom Le
jil Ozair, Aaron Courville, and Yoshua Ben- Paine, Prajit Ramachandran, Mohammad
gio. Generative adversarial nets. In Ad- Babaeizadeh, Honghui Shi, Jianan Li,
vances in Neural Information Processing Sys- Shuicheng Yan, and Thomas S. Huang. Seq-
tems 27: Annual Conference on Neural In- nms for video object detection. CoRR,
formation Processing Systems 2014, Decem- abs/1602.08465, 2016. URL http://arxiv.
ber 8-13 2014, Montreal, Quebec, Canada, org/abs/1602.08465.
pages 2672–2680, 2014.
[116] Kaiming He, Xiangyu Zhang, Shaoqing Ren,
[110] Ian J. Goodfellow, David Warde-Farley, and Jian Sun. Spatial pyramid pooling in
Mehdi Mirza, Aaron C. Courville, and deep convolutional networks for visual recog-
Yoshua Bengio. Maxout networks. In nition. IEEE Transactions on Pattern Anal-
Proceedings of the 30th International ysis and Machine Intelligence, 37(9):1904–
Conference on Machine Learning, ICML 1916, 2015.
2013, Atlanta, GA, USA, 16-21 June [117] Kaiming He, Xiangyu Zhang, Shaoqing Ren,
2013, pages 1319–1327, 2013. URL and Jian Sun. Deep residual learning for im-
http://jmlr.org/proceedings/papers/ age recognition. In 2016 IEEE Conference
v28/goodfellow13.html. on Computer Vision and Pattern Recogni-
tion, CVPR 2016, Las Vegas,NV, USA, June
[111] Priya Goyal, Piotr Dollár, Ross B. Girshick, 27-30, 2016, pages 770–778, 2016.
Pieter Noordhuis, Lukasz Wesolowski, Aapo
Kyrola, Andrew Tulloch, Yangqing Jia, and [118] Kaiming He, Georgia Gkioxari, Piotr Dollár,
Kaiming He. Accurate, large minibatch and Ross Girshick. Mask r-cnn. In IEEE
SGD: training imagenet in 1 hour. CoRR, International Conference on Computer Vi-
abs/1706.02677, 2017. URL http://arxiv. sion, ICCV 2017, Venice, Italy, October 22-
org/abs/1706.02677. 29, 2017, pages 2980–2988, 2017.

62
[119] Tong He, Zhi Tian, Weilin Huang, Chunhua [126] Erik Hjelmås and Boon Kee Low. Face De-
Shen, Yu Qiao, and Changming Sun. An end- tection: A Survey. Computer Vision and Im-
to-end textspotter with explicit alignment age Understanding (CVIU), 83(3):236–274,
and attention. In Computer Vision and Pat- September 2001.
tern Recognition (CVPR), 2018 IEEE Con-
ference on, 2018. [127] Judy Hoffman, Sergio Guadarrama, Eric S
Tzeng, Ronghang Hu, Jeff Donahue, Ross
[120] Geremy Heitz and Daphne Koller. Learn- Girshick, Trevor Darrell, and Kate Saenko.
ing Spatial Context - Using Stuff to Find Lsda: Large scale detection through adap-
Things. In Computer Vision - ECCV 2008, tation. In Advances in Neural Information
10th European Conference on Computer Vi- Processing Systems 27: Annual Conference
sion, Marseille, France, October 12-18, 2008, on Neural Information Processing Systems
Berlin, Heidelberg, 2008. 2014, December 8-13 2014, Montreal, Que-
bec, Canada, pages 3536–3544, 2014.
[121] Paul Henderson and Vittorio Ferrari. End-
to-end training of object class detectors for [128] Judy Hoffman, Deepak Pathak, Trevor Dar-
mean average precision. In Computer Vision rell, and Kate Saenko. Detector discovery in
- ACCV 2016 - 13th Asian Conference on the wild: Joint multiple instance and repre-
Computer Vision, Taipei, Taiwan, November sentation learning. In IEEE Conference on
20-24, 2016, pages 198–213, 2016. Computer Vision and Pattern Recognition,
CVPR 2015, Boston, MA, USA, June 7-12,
[122] João F. Henriques and Andrea Vedaldi. 2015, pages 2883–2891, 2015.
Warped Convolutions - Efficient Invariance
[129] Derek Hoiem, Yodsawalai Chodpathumwan,
to Spatial Transformations. International
and Qieyun Dai. Diagnosing error in ob-
Conference on Machine Learning (ICML),
ject detectors. In Computer Vision - ECCV
2017.
2012 - 12th European Conference on Com-
puter Vision, Florence, Italy, October 7-13,
[123] Congrui Hetang, Hongwei Qin, Shaohui Liu,
2012, pages 340–353, 2012.
and Junjie Yan. Impression network for video
object detection. CoRR, abs/1712.05896, [130] Jan Hosang, Rodrigo Benenson, and Bernt
2017. URL http://arxiv.org/abs/1712. Schiele. A convnet for non-maximum sup-
05896. pression. In German Conference on Pattern
Recognition, pages 192–204, 2016.
[124] Michael Himmelsbach, Andre Mueller,
Thorsten Lüttel, and Hans-Joachim [131] Jan Hendrik Hosang, Rodrigo Benenson, and
Wünsche. Lidar-based 3d object per- Bernt Schiele. How good are detection pro-
ception. In Proceedings of 1st international posals, really?. In British Machine Vision
workshop on cognition for technical systems, Conference, BMVC 2014, Nottingham, UK,
volume 1, 2008. September 1-5, 2014, 2014.

[125] Stefan Hinterstoisser, Vincent Lepetit, Paul [132] Jan Hendrik Hosang, Rodrigo Benenson, and
Wohlhart, and Kurt Konolige. On pre- Bernt Schiele. Learning non-maximum sup-
trained image features and synthetic images pression. In 2017 IEEE Conference on Com-
for deep learning. CoRR, abs/1710.10710, puter Vision and Pattern Recognition, CVPR
2017. URL http://arxiv.org/abs/1710. 2017, Honolulu, HI, USA, July 21-26, 2017,
10710. pages 6469–6477, 2017.

63
[133] Sebastian Houben, Johannes Stallkamp, Jan [140] Jonathan Huang, Vivek Rathod, Chen
Salmen, Marc Schlipsing, and Christian Igel. Sun, Menglong Zhu, Anoop Korattikara,
Detection of traffic signs in real-world images: Alireza Fathi, Ian Fischer, Zbigniew Wo-
The German Traffic Sign Detection Bench- jna, Yang Song, Sergio Guadarrama, et al.
mark. In International Joint Conference on Speed/accuracy trade-offs for modern con-
Neural Networks, number 1288, 2013. volutional object detectors. In 2017 IEEE
Conference on Computer Vision and Pat-
[134] Andrew G. Howard, Menglong Zhu, tern Recognition, CVPR 2017, Honolulu, HI,
Bo Chen, Dmitry Kalenichenko, Wei- USA, July 21-26, 2017, 2017.
jun Wang, Tobias Weyand, Marco
Andreetto, and Hartwig Adam. Mo- [141] Qiangui Huang, Shaohua Kevin Zhou, Suya
bilenets: Efficient convolutional neural You, and Ulrich Neumann. Learning to prune
networks for mobile vision applications. filters in convolutional neural networks. In
CoRR, abs/1704.04861, 2017. URL 2018 IEEE Winter Conference on Applica-
http://arxiv.org/abs/1704.04861. tions of Computer Vision, WACV 2018, Lake
Tahoe, NV, USA, March 12-15, 2018, pages
[135] Han Hu, Jiayuan Gu, Zheng Zhang, Jifeng
709–718. IEEE Computer Society, 2018.
Dai, and Yichen Wei. Relation networks for
object detection. In Computer Vision and [142] Xun Huang, Ming-Yu Liu, Serge J. Be-
Pattern Recognition (CVPR), 2018 IEEE longie, and Jan Kautz. Multimodal unsu-
Conference on, June 2018. pervised image-to-image translation. CoRR,
[136] Jie Hu, Li Shen, and Gang Sun. Squeeze-and- abs/1804.04732, 2018. URL http://arxiv.
excitation networks. CoRR, abs/1709.01507, org/abs/1804.04732.
2017. URL http://arxiv.org/abs/1709. [143] Itay Hubara, Matthieu Courbariaux, Daniel
01507. Soudry, Ran El-Yaniv, and Yoshua Ben-
[137] Peiyun Hu and Deva Ramanan. Finding tiny gio. Binarized neural networks. In Advances
faces. In 2017 IEEE Conference on Com- in Neural Information Processing Systems
puter Vision and Pattern Recognition, CVPR 29: Annual Conference on Neural Informa-
2017, Honolulu, HI, USA, July 21-26, 2017, tion Processing Systems 2016, December 5-
pages 1522–1530. IEEE Computer Society, 10, 2016, Barcelona, Spain, pages 4107–4115,
2017. 2016.

[138] Gao Huang, Shichen Liu, Laurens van der [144] Itay Hubara, Matthieu Courbariaux, Daniel
Maaten, and Kilian Q. Weinberger. Con- Soudry, Ran El-Yaniv, and Yoshua Bengio.
densenet: An efficient densenet using learned Quantized neural networks: Training neural
group convolutions. CoRR, abs/1711.09224, networks with low precision weights and ac-
2017. URL http://arxiv.org/abs/1711. tivations. The Journal of Machine Learning
09224. Research, 18(1):6869–6898, 2017.

[139] Gao Huang, Zhuang Liu, Laurens Van [145] Ahmad Humayun, Fuxin Li, and James M
Der Maaten, and Kilian Q Weinberger. Rehg. Rigor: Reusing inference in graph
Densely connected convolutional networks. cuts for generating object regions. In 2014
In 2017 IEEE Conference on Computer Vi- IEEE Conference on Computer Vision and
sion and Pattern Recognition, CVPR 2017, Pattern Recognition, CVPR 2014, Columbus,
Honolulu, HI, USA, July 21-26, 2017, vol- OH, USA, June 23-28, 2014, pages 336–343,
ume 1, page 3, 2017. 2014.

64
[146] Brody Huval, Adam Coates, and Andrew Y. Conference on Machine Learning, ICML
Ng. Deep learning for class-generic object 2015, Lille, France, 6-11 July 2015, pages
detection. CoRR, abs/1312.6885, 2013. URL 448–456, 2015. URL http://jmlr.org/
http://arxiv.org/abs/1312.6885. proceedings/papers/v37/ioffe15.html.
[147] Forrest N. Iandola, Matthew W. Moskewicz, [153] Max Jaderberg, Karen Simonyan, Andrea
Sergey Karayev, Ross B. Girshick, Trevor Vedaldi, and Andrew Zisserman. Syn-
Darrell, and Kurt Keutzer. Densenet: Im- thetic data and artificial neural networks
plementing efficient convnet descriptor pyra- for natural scene text recognition. CoRR,
mids. CoRR, abs/1404.1869, 2014. URL abs/1406.2227, 2014. URL http://arxiv.
http://arxiv.org/abs/1404.1869. org/abs/1406.2227.

[148] Forrest N Iandola, Song Han, Matthew W [154] Max Jaderberg, Karen Simonyan, and An-
Moskewicz, Khalid Ashraf, William J Dally, drew Zisserman. Spatial transformer net-
and Kurt Keutzer. Squeezenet: Alexnet-level works. In Advances in Neural Information
accuracy with 50x fewer parameters and¡ 0.5 Processing Systems 28: Annual Conference
mb model size. CoRR, abs/1602.07360v3, on Neural Information Processing Systems
2016. URL http://arxiv.org/abs/1602. 2015, December 7-12, 2015, Montreal, Que-
07360v3. bec, Canada, 2015.

[149] Hiroshi Inoue. Data augmentation by pair- [155] Vidit Jain and Erik Learned-Miller. FDDB:
ing samples for images classification. CoRR, A Benchmark for Face Detection in Uncon-
abs/1801.02929, 2018. URL http://arxiv. strained Settings. UM-CS-2010-009, Univer-
org/abs/1801.02929. sity of Massachusetts Amherst, 2010.

[150] Naoto Inoue, Ryosuke Furuta, Toshihiko Ya- [156] Jisoo Jeong, Hyojin Park, and Nojun Kwak.
masaki, and Kiyoharu Aizawa. Cross-domain Enhancement of SSD by concatenating fea-
weakly-supervised object detection through ture maps for object detection. CoRR,
progressive domain adaptation. CoRR, abs/1705.09587, 2017. URL http://arxiv.
abs/1803.11365, 2018. URL http://arxiv. org/abs/1705.09587.
org/abs/1803.11365. [157] Saurav Jha, Nikhil Agarwal, and Suneeta
Agarwal. Towards improved cartoon face
[151] Sergey Ioffe. Batch renormalization: Towards
detection and recognition systems. CoRR,
reducing minibatch dependence in batch-
abs/1804.01753, 2018. URL http://arxiv.
normalized models. In Advances in Neural
org/abs/1804.01753.
Information Processing Systems 30: Annual
Conference on Neural Information Process- [158] Borui Jiang, Ruixuan Luo, Jiayuan Mao,
ing Systems 2017, 4-9 December 2017, Long Tete Xiao, and Yuning Jiang. Acquisition
Beach, CA, USA, pages 1942–1950, 2017. of localization confidence for accurate ob-
URL http://papers.nips.cc/paper/ ject detection. CoRR, abs/1807.11590, 2018.
6790-batch-renormalization-towards- URL http://arxiv.org/abs/1807.11590.
reducing-minibatch-dependence-in-
batch-normalized-models. [159] Yingying Jiang, Xiangyu Zhu, Xiaobing
Wang, Shuli Yang, Wei Li, Hua Wang, Pei
[152] Sergey Ioffe and Christian Szegedy. Batch Fu, and Zhenbo Luo. R2CNN: rotational re-
normalization: Accelerating deep network gion CNN for orientation robust scene text
training by reducing internal covariate shift. detection. CoRR, abs/1706.09579, 2017.
In Proceedings of the 32nd International URL http://arxiv.org/abs/1706.09579.

65
[160] Alexis Joly and Olivier Buisson. Logo re- [166] Tero Karras, Timo Aila, Samuli Laine,
trieval with a contrario visual query expan- and Jaakko Lehtinen. Progressive grow-
sion. In Wen Gao, Yong Rui, Alan Hanjalic, ing of GANs for improved quality, stabil-
Changsheng Xu, Eckehard G. Steinbach, Ab- ity, and variation. In International Con-
dulmotaleb El-Saddik, and Michelle X. Zhou, ference on Learning Representations, 2018.
editors, Proceedings of the 17th International URL https://openreview.net/forum?id=
Conference on Multimedia 2009, Vancouver, Hk99zCeAb.
British Columbia, Canada, October 19-24,
2009, pages 581–584. ACM, 2009. [167] Harish Katti, Marius V. Peelen, and S. P.
Arun. Object detection can be improved us-
[161] Kinjal A Joshi and Darshak G Thakore. ing human-derived contextual expectations.
A Survey on Moving Object Detection and CoRR, abs/1611.07218, 2016. URL http:
Tracking in Video Surveillance System. In- //arxiv.org/abs/1611.07218.
ternational Journal of Soft Computing and
[168] Gil Keren, Maximilian Schmitt, Thomas
Engineering (IJSCE), 2(3):5, 2012.
Kehrenberg, and Björn W. Schuller. Weakly
[162] Hongwen Kang, Martial Hebert, Alexei A supervised one-shot detection with attention
Efros, and Takeo Kanade. Data-driven ob- siamese networks. CoRR, abs/1801.03329,
jectness. IEEE Transactions on Pattern 2018. URL http://arxiv.org/abs/1801.
Analysis and Machine Intelligence, (1):189– 03329.
195, 2015. [169] Aditya Khosla, Tinghui Zhou, Tomasz Mal-
[163] Kai Kang, Wanli Ouyang, Hongsheng Li, and isiewicz, Alexei A Efros, and Antonio Tor-
Xiaogang Wang. Object detection from video ralba. Undoing the damage of dataset bias.
tubelets with convolutional neural networks. In Computer Vision - ECCV 2012 - 12th Eu-
In 2016 IEEE Conference on Computer Vi- ropean Conference on Computer Vision, Flo-
sion and Pattern Recognition, CVPR 2016, rence, Italy, October 7-13, 2012, pages 158–
Las Vegas,NV, USA, June 27-30, 2016, pages 171, 2012.
817–825, 2016. [170] Kye-Hyeon Kim, Yeongjae Cheon, Sanghoon
Hong, Byung-Seok Roh, and Minje Park.
[164] Kai Kang, Hongsheng Li, Junjie Yan, Xingyu
PVANET: deep but lightweight neural net-
Zeng, Bin Yang, Tong Xiao, Cong Zhang, Zhe
works for real-time object detection. CoRR,
Wang, Ruohui Wang, Xiaogang Wang, and
abs/1608.08021, 2016. URL http://arxiv.
Wanli Ouyang. T-CNN: Tubelets with Con-
org/abs/1608.08021.
volutional Neural Networks for Object De-
tection from Videos. IEEE Transactions on [171] Diederik P. Kingma and Jimmy Ba. Adam:
Circuits and Systems for Video Technology, A method for stochastic optimization. CoRR,
pages 1–1, 2017. abs/1412.6980, 2014. URL http://arxiv.
org/abs/1412.6980.
[165] D. Karatzas, L. Gomez-Bigorda, A. Nico-
laou, S. Ghosh, A. Bagdanov, M. Iwamura, [172] Brendan F Klare, Ben Klein, Emma
J. Matas, L. Neumann, V. R. Chandrasekhar, Taborsky, Austin Blanton, Jordan Cheney,
S. Lu, F. Shafait, S. Uchida, and E. Valveny. Kristen Allen, Patrick Grother, Alan Mah,
Icdar 2015 competition on robust reading. In and Anil K Jain. Pushing the frontiers of
2015 13th International Conference on Doc- unconstrained face detection and recognition:
ument Analysis and Recognition (ICDAR), Iarpa janus benchmark a. In IEEE Confer-
pages 1156–1160, Aug 2015. ence on Computer Vision and Pattern Recog-

66
nition, CVPR 2015, Boston, MA, USA, June Belongie, Victor Gomes, Abhinav Gupta,
7-12, 2015, pages 1931–1939, 2015. Chen Sun, Gal Chechik, David Cai, Zheyun
Feng, Dhyanesh Narayanan, and Kevin
[173] Iasonas Kokkinos. Ubernet: Training a uni- Murphy. Openimages: A public dataset
versal convolutional neural network for low- for large-scale multi-label and multi-class
, mid-, and high-level vision using diverse image classification. Dataset available from
datasets and limited memory. In 2017 IEEE https://storage.googleapis.com/openimages/web/index.html,
Conference on Computer Vision and Pat- 2017.
tern Recognition, CVPR 2017, Honolulu, HI,
USA, July 21-26, 2017, pages 5454–5463. [179] Ranjay Krishna, Yuke Zhu, Oliver Groth,
IEEE Computer Society, 2017. Justin Johnson, Kenji Hata, Joshua Kravitz,
[174] Tao Kong, Anbang Yao, Yurong Chen, and Stephanie Chen, Yannis Kalantidis, Li-Jia
Fuchun Sun. HyperNet: Towards Accurate Li, David A. Shamma, Michael S. Bernstein,
Region Proposal Generation and Joint Ob- and Li Fei-Fei. Visual genome: Connect-
ject Detection. In 2016 IEEE Conference on ing language and vision using crowdsourced
Computer Vision and Pattern Recognition, dense image annotations. International Jour-
CVPR 2016, Las Vegas,NV, USA, June 27- nal of Computer Vision (IJCV), 123(1):32–
30, 2016, April 2016. 73, 2017.

[175] Tao Kong, Fuchun Sun, Anbang Yao, Huap- [180] Alex Krizhevsky. One weird trick for
ing Liu, Ming Lu, and Yurong Chen. RON: parallelizing convolutional neural networks.
reverse connection with objectness prior net- CoRR, abs/1404.5997, 2014. URL http:
works for object detection. In 2017 IEEE //arxiv.org/abs/1404.5997.
Conference on Computer Vision and Pat-
tern Recognition, CVPR 2017, Honolulu, HI, [181] Alex Krizhevsky, Ilya Sutskever, and Ge-
USA, July 21-26, 2017, pages 5244–5252. offrey E. Hinton. Imagenet classification
IEEE Computer Society, 2017. with deep convolutional neural networks. In
Peter L. Bartlett, Fernando C. N. Pereira,
[176] Tao Kong, Fuchun Sun, Wen-bing Huang, Christopher J. C. Burges, Léon Bottou,
and Huaping Liu. Deep feature pyramid re- and Kilian Q. Weinberger, editors, Ad-
configuration for object detection. CoRR, vances in Neural Information Processing
abs/1808.07993, 2018. URL http://arxiv. Systems 25: 26th Annual Conference on
org/abs/1808.07993. Neural Information Processing Systems
[177] Martin Kostinger, Paul Wohlhart, Peter M. 2012. Proceedings of a meeting held De-
Roth, and Horst Bischof. Annotated Fa- cember 3-6, 2012, Lake Tahoe, Nevada,
cial Landmarks in the Wild: A large-scale, United States, pages 1106–1114, 2012. URL
real-world database for facial landmark local- http://papers.nips.cc/paper/4824-
ization. In First IEEE International Work- imagenet-classification-with-deep-
shop on Benchmarking Facial Image Analysis convolutional-neural-networks.
Technologies, pages 2144–2151, 2011.
[182] Jason Ku, Melissa Mozifian, Jungwook Lee,
[178] Ivan Krasin, Tom Duerig, Neil Alldrin, Ali Harakeh, and Steven Lake Waslander.
Vittorio Ferrari, Sami Abu-El-Haija, Alina Joint 3d proposal generation and object
Kuznetsova, Hassan Rom, Jasper Uijlings, detection from view aggregation. CoRR,
Stefan Popov, Shahab Kamali, Matteo Mal- abs/1712.02294, 2017. URL http://arxiv.
loci, Jordi Pont-Tuset, Andreas Veit, Serge org/abs/1712.02294.

67
[183] Krishna Kumar Singh, Fanyi Xiao, and Yong 2016, Las Vegas, NV, USA, June 27-30,
Jae Lee. Track and transfer: Watching videos 2016, pages 289–297. IEEE Computer Soci-
to simulate strong human supervision for ety, 2016.
weakly-supervised object detection. In 2016
IEEE Conference on Computer Vision and [189] Hei Law and Jia Deng. Cornernet: Detecting
Pattern Recognition, CVPR 2016, Las Ve- objects as paired keypoints. In Computer Vi-
gas,NV, USA, June 27-30, 2016, pages 3548– sion - ECCV 2018 - 15th European Confer-
3556, 2016. ence, Munich, Germany, September 8 - 14,
2018, 2018.
[184] Weicheng Kuo, Bharath Hariharan, and Ji-
tendra Malik. Deepbox: Learning objectness [190] Yann LeCun, Léon Bottou, Genevieve B.
with convolutional networks. In IEEE In- Orr, and Klaus-Robert Müller. Effi-
ternational Conference on Computer Vision, cient backprop. In Grégoire Montavon,
ICCV 2015, Santiago, Chile, December 7-13, Genevieve B. Orr, and Klaus-Robert Müller,
2015, pages 2479–2487, 2015. editors, Neural Networks: Tricks of the Trade
- Second Edition, volume 7700 of Lecture
[185] John D. Lafferty, Andrew McCallum, and Notes in Computer Science, pages 9–48.
Fernando C. N. Pereira. Conditional ran- Springer, 2012. URL https://doi.org/10.
dom fields: Probabilistic models for segment- 1007/978-3-642-35289-8_3.
ing and labeling sequence data. In Pro-
ceedings of the Eighteenth International Con- [191] Byungjae Lee, Enkhbayar Erdenee, Song-
ference on Machine Learning, ICML ’01, Guo Jin, Mi Young Nam, Young Giu Jung,
pages 282–289, San Francisco, CA, USA, and Phill-Kyu Rhee. Multi-class multi-object
2001. URL http://dl.acm.org/citation. tracking using changing point detection. In
cfm?id=645530.655813. Gang Hua and Hervé Jégou, editors, Com-
puter Vision - ECCV 2016 - 14th European
[186] Darius Lam, Richard Kuzma, Kevin McGee, Conference, Amsterdam, The Netherlands,
Samuel Dooley, Michael Laielli, Matthew October 11-14, 2016, volume 9914 of Lec-
Klaric, Yaroslav Bulatov, and Brendan Mc- ture Notes in Computer Science, pages 68–
Cord. xview: Objects in context in overhead 83, 2016. URL https://doi.org/10.1007/
imagery. CoRR, abs/1802.07856, 2018. URL 978-3-319-48881-3_6.
http://arxiv.org/abs/1802.07856.
[192] Kyoungmin Lee, Jaeseok Choi, Jisoo Jeong,
[187] Christoph H. Lampert, Matthew B. and Nojun Kwak. Residual features and uni-
Blaschko, and Thomas Hofmann. Be- fied prediction network for single stage de-
yond sliding windows: Object localization tection. CoRR, abs/1707.05031, 2017. URL
by efficient subwindow search. In 2008 IEEE http://arxiv.org/abs/1707.05031.
Computer Society Conference on Computer
Vision and Pattern Recognition (CVPR [193] Youngwan Lee, Huieun Kim, Eunsoo Park,
2008), 24-26 June 2008, Anchorage, Alaska, Xuenan Cui, and Hakil Kim. Wide-residual-
USA, 2008. inception networks for real-time object detec-
tion. In Intelligent Vehicles Symposium (IV),
[188] Dmitry Laptev, Nikolay Savinov, Joachim M. 2017 IEEE, pages 758–764, 2017.
Buhmann, and Marc Pollefeys. TI-
POOLING: transformation-invariant pooling [194] Joseph Lemley, Shabab Bazrafkan, and Peter
for feature learning in convolutional neural Corcoran. Smart augmentation learning an
networks. In 2016 IEEE Conference on Com- optimal data augmentation strategy. IEEE
puter Vision and Pattern Recognition, CVPR Access, 5:5858–5869, 2017.

68
[195] Bo Li. 3D Fully Convolutional Network for Conference on Computer Vision and Pat-
Vehicle Detection in Point Cloud. In IROS, tern Recognition, CVPR 2017, Honolulu, HI,
2017. USA, July 21-26, 2017, pages 1951–1959.
IEEE Computer Society, 2017.
[196] Bo Li, Tianfu Wu, Shuai Shao, Lun Zhang,
and Rufeng Chu. Object detection via end- [202] Xiaofei Li, Fabian Flohr, Yue Yang, Hui
to-end integration of aspect ratio and con- Xiong, Markus Braun, Shuyue Pan, Keqiang
text aware part-based models and fully con- Li, and Dariu M Gavrila. A new benchmark
volutional networks. CoRR, abs/1612.00534, for vision-based cyclist detection. In Intelli-
2016. URL http://arxiv.org/abs/1612. gent Vehicles Symposium (IV), 2016 IEEE,
00534. pages 1028–1033, 2016.

[197] Bo Li, Tianlei Zhang, and Tian Xia. [203] Yi Li, Haozhi Qi, Jifeng Dai, Xiangyang
Vehicle detection from 3d lidar using Ji, and Yichen Wei. Fully convolutional
fully convolutional network. In David instance-aware semantic segmentation. In
Hsu, Nancy M. Amato, Spring Berman, 2017 IEEE Conference on Computer Vision
and Sam Ade Jacobs, editors, Robotics: and Pattern Recognition, CVPR 2017, Hon-
Science and Systems XII, University of olulu, HI, USA, July 21-26, 2017, pages
Michigan, Ann Arbor, Michigan, USA, 4438–4446, 2017. doi: 10.1109/CVPR.
June 18 - June 22, 2016, 2016. URL 2017.472. URL https://doi.org/10.1109/
http://www.roboticsproceedings.org/ CVPR.2017.472.
rss12/p42.html.
[204] Yikang Li, Wanli Ouyang, Bolei Zhou, Kun
[198] Haoxiang Li, Zhe Lin, Xiaohui Shen, Wang, and Xiaogang Wang. Scene graph gen-
Jonathan Brandt, and Gang Hua. A convolu- eration from objects, phrases and caption re-
tional neural network cascade for face detec- gions. CoRR, abs/1707.09700, 2017. URL
tion. In IEEE Conference on Computer Vi- http://arxiv.org/abs/1707.09700.
sion and Pattern Recognition, CVPR 2015, [205] Yuxi Li, Jiuwei Li, Weiyao Lin, and Jianguo
Boston, MA, USA, June 7-12, 2015, pages Li. Tiny-DSOD: Lightweight Object Detec-
5325–5334, 2015. tion for Resource-Restricted Usages. In Pro-
ceedings of the British Machine Vision Con-
[199] Hongyang Li, Yu Liu, Wanli Ouyang, and
ference 2018, BMVC 2018, Newcastle, UK,
Xiaogang Wang. Zoom out-and-in network
September 3-6, 2018, July 2018.
with recursive training for object proposal.
CoRR, abs/1702.05711, 2017. URL http: [206] Zeming Li, Chao Peng, Gang Yu, Xiangyu
//arxiv.org/abs/1702.05711. Zhang, Yangdong Deng, and Jian Sun. Light-
head R-CNN: in defense of two-stage object
[200] Jianan Li, Xiaodan Liang, ShengMei Shen, detector. CoRR, abs/1711.07264, 2017. URL
Tingfa Xu, Jiashi Feng, and Shuicheng Yan. http://arxiv.org/abs/1711.07264.
Scale-aware fast r-cnn for pedestrian detec-
tion. IEEE Transactions on Multimedia, [207] Zeming Li, Yilun Chen, Gang Yu, and Yang-
2017. dong Deng. R-FCN++: Towards Accurate
Region-Based Fully Convolutional Networks
[201] Jianan Li, Xiaodan Liang, Yunchao Wei, for Object Detection. In AAAI, page 8, 2018.
Tingfa Xu, Jiashi Feng, and Shuicheng Yan.
Perceptual generative adversarial networks [208] Zeming Li, Chao Peng, Gang Yu, Xiangyu
for small object detection. In 2017 IEEE Zhang, Yangdong Deng, and Jian Sun. Det-

69
net: A backbone network for object detec- [216] Tsung-Yi Lin, Priya Goyal, Ross B. Gir-
tion. CoRR, abs/1804.06215, 2018. URL shick, Kaiming He, and Piotr Dollár. Fo-
http://arxiv.org/abs/1804.06215. cal loss for dense object detection. In IEEE
International Conference on Computer Vi-
[209] Zhizhong Li and Derek Hoiem. Learning
sion, ICCV 2017, Venice, Italy, October 22-
without Forgetting. IEEE Transactions on
29, 2017, pages 2999–3007. IEEE Computer
Pattern Analysis and Machine Intelligence,
Society, 2017.
(to appear), 2018.
[217] Zhe Lin, Larry S. Davis, David S. Doer-
[210] Zuoxin Li and Fuqiang Zhou. FSSD: feature mann, and Daniel DeMenthon. Hierarchi-
fusion single shot multibox detector. CoRR, cal part-template matching for human detec-
abs/1712.00960, 2017. URL http://arxiv. tion and segmentation. In IEEE 11th In-
org/abs/1712.00960. ternational Conference on Computer Vision,
[211] Minghui Liao, Baoguang Shi, and Xiang Bai. ICCV 2007, Rio de Janeiro, Brazil, October
Textboxes++: A single-shot oriented scene 14-20, 2007, pages 1–8, 2007.
text detector. CoRR, abs/1801.02765, 2018. [218] Zhouhan Lin, Matthieu Courbariaux, Roland
URL http://arxiv.org/abs/1801.02765. Memisevic, and Yoshua Bengio. Neural
[212] Minghui Liao, Zhen Zhu, Baoguang Shi, Gui- networks with few multiplications. CoRR,
Song Xia, and Xiang Bai. Rotation-sensitive abs/1510.03009, 2015. URL http://arxiv.
regression for oriented scene text detection. org/abs/1510.03009.
CoRR, abs/1803.05265, 2018. URL http:// [219] Kang Liu and Gellert Mattyus. Fast multi-
arxiv.org/abs/1803.05265. class vehicle detection on aerial images. IEEE
[213] Yuan Liao, Xiaoqing Lu, Chengcui Zhang, Geoscience and Remote Sensing Letters, 12
Yongtao Wang, and Zhi Tang. Mutual En- (9):1938–1942, 2015.
hancement for Detection of Multiple Logos in [220] Li Liu, Wanli Ouyang, Xiaogang Wang, Paul
Sports Videos. In IEEE International Con- Fieguth, Jie Chen, Xinwang Liu, and Matti
ference on Computer Vision, ICCV 2017, Pietikäinen. Deep learning for generic object
Venice, Italy, October 22-29, 2017, pages detection: A survey. CoRR, abs/1809.02165,
4856–4865, October 2017. 2018. URL https://arxiv.org/abs/1809.
02165.
[214] Tsung-Yi Lin, Michael Maire, Serge Be-
longie, James Hays, Pietro Perona, Deva Ra- [221] Wei Liu, Dragomir Anguelov, Dumitru Er-
manan, Piotr Dollár, and C Lawrence Zit- han, Christian Szegedy, Scott Reed, Cheng-
nick. Microsoft coco: Common objects in Yang Fu, and Alexander C Berg. Ssd: Sin-
context. In Computer Vision - ECCV 2014 - gle shot multibox detector. In Computer Vi-
13th European Conference, Zurich, Switzer- sion - ECCV 2016 - 14th European Confer-
land, September 6-12, 2014, pages 740–755, ence, Amsterdam, The Netherlands, October
2014. 11-14, 2016, pages 21–37, 2016.
[215] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, [222] Yuliang Liu and Lianwen Jin. Deep matching
Kaiming He, Bharath Hariharan, and Serge prior network: Toward tighter multi-oriented
Belongie. Feature pyramid networks for ob- text detection. In 2017 IEEE Conference on
ject detection. In 2017 IEEE Conference on Computer Vision and Pattern Recognition,
Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-
CVPR 2017, Honolulu, HI, USA, July 21-26, 26, 2017, pages 3454–3461. IEEE Computer
2017, volume 1, page 4, 2017. Society, 2017.

70
[223] Zikun Liu, Liu Yuan, Lubin Weng, and Yip- Society, 2018. doi: 10.1109/CVPR.2018.
ing Yang. A high resolution optical satellite 00071.
image dataset for ship recognition and some
new baselines. In ICPRAM, pages 324–331, [231] Jiayuan Mao, Tete Xiao, Yuning Jiang, and
2017. Zhimin Cao. What can help pedestrian de-
tection? In 2017 IEEE Conference on
[224] David G Lowe. Object recognition from local Computer Vision and Pattern Recognition
scale-invariant features. In Computer vision, (CVPR), pages 6034–6043, 2017.
1999. The proceedings of the seventh IEEE
international conference on, volume 2, pages [232] V.Y. Mariano, Junghye Min, Jin-Hyeong
1150–1157, 1999. Park, R. Kasturi, D. Mihalcik, Huiping Li,
D. Doermann, and T. Drayer. Performance
[225] David G Lowe. Distinctive image features
evaluation of object detection algorithms. In
from scale-invariant keypoints. International
International Conference on Pattern Recog-
Journal of Computer Vision (IJCV), 60(2):
nition (ICPR), volume 3, pages 965–969,
91–110, 2004.
2002.
[226] Jiajun Lu, Hussein Sibai, Evan Fabry, and
David A. Forsyth. Standard detectors aren’t [233] Oded Maron and Tomás Lozano-Pérez. A
(currently) fooled by physical adversarial framework for multiple-instance learning. In
stop signs. CoRR, abs/1710.03337, 2017. Michael I. Jordan, Michael J. Kearns, and
URL http://arxiv.org/abs/1710.03337. Sara A. Solla, editors, Advances in Neural
Information Processing Systems 10, [NIPS
[227] S. M. Lucas, A. Panaretos, L. Sosa, A. Tang, Conference, Denver, Colorado, USA, 1997],
S. Wong, and R. Young. Icdar 2003 robust pages 570–576. The MIT Press, 1997. URL
reading competitions. In Seventh Interna- http://papers.nips.cc/paper/1346-
tional Conference on Document Analysis and a-framework-for-multiple-instance-
Recognition, 2003. Proceedings., pages 682– learning.
687, Aug 2003.
[234] Marc Masana, Joost van de Weijer, and
[228] Jianqi Ma, Weiyuan Shao, Hao Ye, Li Wang,
Andrew D. Bagdanov. On-the-fly net-
Hong Wang, Yingbin Zheng, and Xiangyang
work pruning for object detection. CoRR,
Xue. Arbitrary-Oriented Scene Text Detec-
abs/1605.03477, 2016. URL http://arxiv.
tion via Rotation Proposals. IEEE Transac-
org/abs/1605.03477.
tions on Multimedia, pages 1–1, 2018.
[229] Santiago Manen, Matthieu Guillaumin, and [235] Marc Masana, Joost van de Weijer, Luis Her-
Luc Van Gool. Prime object proposals with ranz, Andrew D. Bagdanov, and Jose M.
randomized prim’s algorithm. In IEEE In- Álvarez. Domain-adaptive deep network
ternational Conference on Computer Vision, compression. In IEEE International Con-
ICCV 2013, Sydney, Australia, December 1- ference on Computer Vision, ICCV 2017,
8, 2013, pages 2536–2543, 2013. Venice, Italy, October 22-29, 2017, pages
4299–4307. IEEE Computer Society, 2017.
[230] Kevis-Kokitsi Maninis, Sergi Caelles, Jordi
Pont-Tuset, and Luc Van Gool. Deep ex- [236] Francisco Massa, Bryan C. Russell, and
treme cut: From extreme points to object Mathieu Aubry. Deep Exemplar 2D-3D De-
segmentation. In Computer Vision and Pat- tection by Adapting from Real to Rendered
tern Recognition (CVPR), 2018 IEEE Con- Views. 2016 IEEE Conference on Com-
ference on, pages 616–625. IEEE Computer puter Vision and Pattern Recognition, CVPR

71
2016, Las Vegas,NV, USA, June 27-30, 2016, USA, June 7-12, 2015, pages 3593–3602,
pages 6024–6033, 2016. June 2015.
[237] Ofer Matan, Henry S. Baird, Jane Bromley, [244] Chaitanya Mitash, Kun Wang, Kostas E
Christopher J. C. Burges, John S. Denker, Bekris, and Abdeslam Boularias. Physics-
Lawrence D. Jackel, Yann Le Cun, Ed- aware Self-supervised Training of CNNs for
win P. D. Pednault, William D Satterfield, Object Detection. In IEEE International
Charles E. Stenard, et al. Reading handwrit- Conference on Robotics and Automation
ten digits: A zip code recognition system. (ICRA), 2017.
IEEE Computer, 25(7):59–63, 1992.
[245] T M Mitchell. Never-Ending Learning. Com-
[238] Brianna Maze, Jocelyn Adams, James A mun. ACM, 61(5):103–115, 2018.
Duncan, Nathan Kalka, Tim Miller, Charles
[246] A. Mogelmose, M. M. Trivedi, and T. B.
Otto, Anil K Jain, W Tyler Niggel, Janet An-
Moeslund. Vision-Based Traffic Sign De-
derson, Jordan Cheney, and Patrick Grother.
tection and Analysis for Intelligent Driver
IARPA Janus Benchmark – C: Face Dataset
Assistance Systems: Perspectives and Sur-
and Protocol. In ICB, page 8, 2018.
vey. IEEE Transactions on Intelligent Trans-
[239] John McCormac, Ankur Handa, Stefan portation Systems, 13:1484–1497, November
Leutenegger, and Andrew J. Davison. 2012.
Scenenet RGB-D: can 5m synthetic images
[247] Taylor Mordan, Nicolas Thome, Matthieu
beat generic imagenet pre-training on indoor
Cord, and Gilles Henaff. Deformable Part-
segmentation? In IEEE International Con-
based Fully Convolutional Network for Ob-
ference on Computer Vision, ICCV 2017,
ject Detection. In Proceedings of the British
Venice, Italy, October 22-29, 2017, pages
Machine Vision Conference 2017, BMVC
2697–2706. IEEE Computer Society, 2017.
2017, London, UK, September 4-7, 2017,
[240] Kazuki Minemura, Hengfui Liau, Abraham 2017.
Monrroy, and Shinpei Kato. Lmnet: Real-
[248] Taylor Mordan, Nicolas Thome, Gilles
time multiclass object detection on CPU us-
Henaff, and Matthieu Cord. End-to-End
ing 3d lidar. CoRR, abs/1805.04902, 2018.
Learning of Latent Deformable Part-Based
URL http://arxiv.org/abs/1805.04902.
Representations for Object Detection. Inter-
[241] A. Mishra, S. Nandan Rai, A. Mishra, and national Journal of Computer Vision, 2018.
C. V. Jawahar. IIIT-CFW: A Benchmark doi: 10.1007/s11263-018-1109-z.
Database of Cartoon Faces in the Wild. In
[249] Arsalan Mousavian, Dragomir Anguelov,
VASE ECCVW, 2016.
John Flynn, and Jana Kosecka. 3d bound-
[242] Anand Mishra, Karteek Alahari, and ing box estimation using deep learning and
CV Jawahar. Scene text recognition using geometry. In 2017 IEEE Conference on Com-
higher order language priors. In British puter Vision and Pattern Recognition, CVPR
Machine Vision Conference, BMVC 2012, 2017, Honolulu, HI, USA, July 21-26, 2017,
Surrey, UK, September 3-7, 2012, 2012. pages 5632–5640. IEEE Computer Society,
2017.
[243] I. Misra, A. Shrivastava, and M. Hebert.
Watch and learn: Semi-supervised learning [250] Damian Mrowca, Marcus Rohrbach, Judy
of object detectors from videos. In IEEE Hoffman, Ronghang Hu, Kate Saenko, and
Conference on Computer Vision and Pat- Trevor Darrell. Spatial semantic regularisa-
tern Recognition, CVPR 2015, Boston, MA, tion for large scale object detection. In IEEE

72
International Conference on Computer Vi- neural networks for graphs. In International
sion, ICCV 2015, Santiago, Chile, December conference on machine learning, pages 2014–
7-13, 2015, pages 2003–2011, 2015. 2023, 2016.
[251] Seongkyu Mun, Sangwook Park, David K [258] Steven J Nowlan and John C Platt. A convo-
Han, and Hanseok Ko. Generative adversar- lutional neural network hand tracker. In Ad-
ial network based acoustic scene training set vances in Neural Information Processing Sys-
augmentation and selection using svm hyper- tems 8, NIPS, Denver, CO, USA, November
plane. Proc. DCASE, pages 93–97, 2017. 27-30, 1995, pages 901–908, 1995.
[252] T Nathan Mundhenk, Goran Konjevod, We- [259] Jean Ogier Du Terrail and Frédéric Ju-
sam A Sakla, and Kofi Boakye. A large con- rie. ON THE USE OF DEEP NEURAL
textual dataset for classification, detection NETWORKS FOR THE DETECTION OF
and counting of cars with deep learning. In SMALL VEHICLES IN ORTHO-IMAGES.
Computer Vision - ECCV 2016 - 14th Eu- In IEEE International Conference on Im-
ropean Conference, Amsterdam, The Nether- age Processing, Beijing, China, Septem-
lands, October 11-14, 2016, pages 785–800, ber 2017. URL https://hal.archives-
2016. ouvertes.fr/hal-01527906.
[253] Hajime Nada, Vishwanath A. Sindagi, [260] Kemal Oksuz, Baris Can Cam, Emre Ak-
He Zhang, and Vishal M. Patel. Push- bas, and Sinan Kalkan. Localization Recall
ing the limits of unconstrained face detec- Precision (LRP): A New Performance Metric
tion: a challenge dataset and baseline re- for Object Detection. In Computer Vision
sults. CoRR, abs/1804.10275, 2018. URL - ECCV 2018 - 15th European Conference,
http://arxiv.org/abs/1804.10275. Munich, Germany, September 8 - 14, 2018,
[254] Mahyar Najibi, Mohammad Rastegari, and July 2018.
Larry S. Davis. G-CNN: An Iterative Grid [261] M. Oquab, L. Bottou, I. Laptev, and J. Sivic.
Based Object Detector. In 2016 IEEE Con- Weakly supervised object recognition with
ference on Computer Vision and Pattern convolutional neural networks. In Advances
Recognition, CVPR 2016, Las Vegas,NV, in Neural Information Processing Systems
USA, June 27-30, 2016, 2016. 27: Annual Conference on Neural Informa-
[255] Mahyar Najibi, Pouya Samangouei, Rama tion Processing Systems 2014, December 8-13
Chellappa, and Larry Davis. SSH: Single 2014, Montreal, Quebec, Canada, 2014.
Stage Headless Face Detector. In IEEE In-
[262] Maxime Oquab, Léon Bottou, Ivan Laptev,
ternational Conference on Computer Vision,
and Josef Sivic. Is object localization for
ICCV 2017, Venice, Italy, October 22-29,
free? - weakly-supervised learning with con-
2017, 2017.
volutional neural networks. In IEEE Confer-
[256] Alejandro Newell, Kaiyu Yang, and Jia Deng. ence on Computer Vision and Pattern Recog-
Stacked hourglass networks for human pose nition, CVPR 2015, Boston, MA, USA, June
estimation. In Computer Vision - ECCV 7-12, 2015, pages 685–694, 2015.
2016 - 14th European Conference, Amster-
[263] Margarita Osadchy, Yann Le Cun, and
dam, The Netherlands, October 11-14, 2016,
Matthew L Miller. Synergistic face detection
pages 483–499, 2016.
and pose estimation with energy-based mod-
[257] Mathias Niepert, Mohamed Ahmed, and els. Journal of Machine Learning Research,
Konstantin Kutzkov. Learning convolutional 8(May):1197–1215, 2007.

73
[264] W. Ouyang, X. Wang, and C. Zhang. Fac- class detectors using only human verifica-
tors in finetuning deep model for object de- tion. In 2016 IEEE Conference on Com-
tection with long-tail distribution. In 2016 puter Vision and Pattern Recognition, CVPR
IEEE Conference on Computer Vision and 2016, Las Vegas,NV, USA, June 27-30, 2016,
Pattern Recognition, CVPR 2016, Las Ve- February 2016.
gas,NV, USA, June 27-30, 2016, 2016.
[271] Dim P. Papadopoulos, Jasper R. R. Uijlings,
[265] Wanli Ouyang and Xiaogang Wang. Joint Frank Keller, and Vittorio Ferrari. Training
deep learning for pedestrian detection. In object class detectors with click supervision.
IEEE International Conference on Computer In 2017 IEEE Conference on Computer Vi-
Vision, ICCV 2013, Sydney, Australia, De- sion and Pattern Recognition, CVPR 2017,
cember 1-8, 2013, 2013. Honolulu, HI, USA, July 21-26, 2017, pages
180–189. IEEE Computer Society, 2017.
[266] Wanli Ouyang and Xiaogang Wang. Single-
pedestrian detection aided by multi- [272] Constantine Papageorgiou and Tomaso Pog-
pedestrian detection. In 2013 IEEE gio. A trainable system for object detec-
Conference on Computer Vision and Pattern tion. International Journal of Computer Vi-
Recognition, Portland, OR, USA, June sion (IJCV), 38(1):15–33, 2000.
23-28, 2013, pages 3198–3205, 2013. [273] Bo Peng, Wenming Tan, Zheyang Li, Shun
Zhang, Di Xie, and Shiliang Pu. Extreme
[267] Wanli Ouyang, Xiaogang Wang, Xingyu
network compression via filter group approx-
Zeng, Shi Qiu, Ping Luo, Yonglong Tian,
imation. CoRR, abs/1807.11254, 2018. URL
Hongsheng Li, Shuo Yang, Zhe Wang, Chen-
http://arxiv.org/abs/1807.11254.
Change Loy, and Xiaoou Tang. DeepID-
Net: Deformable deep convolutional neural [274] Chao Peng, Tete Xiao, Zeming Li, Yuning
networks for object detection. In Advances Jiang, Xiangyu Zhang, Kai Jia, Gang Yu,
in Neural Information Processing Systems and Jian Sun. Megdet: A large mini-batch
28: Annual Conference on Neural Informa- object detector. CoRR, abs/1711.07240,
tion Processing Systems 2015, December 7- 2017. URL http://arxiv.org/abs/1711.
12, 2015, Montreal, Quebec, Canada, 2015. 07240.
[268] Wanli Ouyang, Ku Wang, Xin Zhu, and Xi- [275] Chao Peng, Xiangyu Zhang, Gang Yu, Guim-
aogang Wang. Learning chained deep fea- ing Luo, and Jian Sun. Large kernel mat-
tures and classifiers for cascade in object de- ters???improve semantic segmentation by
tection. CoRR, abs/1702.07054, 2017. URL global convolutional network. In 2017 IEEE
http://arxiv.org/abs/1702.07054. Conference on Computer Vision and Pat-
tern Recognition, CVPR 2017, Honolulu, HI,
[269] Xi Ouyang, Yu Cheng, Yifan Jiang, Chun- USA, July 21-26, 2017, pages 1743–1751,
Liang Li, and Pan Zhou. Pedestrian- 2017.
synthesis-gan: Generating pedestrian data
in real scene and beyond. CoRR, [276] Xingchao Peng and Kate Saenko. Synthetic
abs/1804.02047, 2018. URL http://arxiv. to real adaptation with generative correlation
org/abs/1804.02047. alignment networks. In 2018 IEEE Winter
Conference on Applications of Computer Vi-
[270] Dim P. Papadopoulos, Jasper R. R. Uijlings, sion, WACV 2018, Lake Tahoe, NV, USA,
Frank Keller, and Vittorio Ferrari. We don’t March 12-15, 2018, pages 1982–1991. IEEE
need no bounding-boxes: Training object Computer Society, 2018.

74
[277] Xingchao Peng, Baochen Sun, Karim Ali Convolutional Networks. In IEEE Confer-
0002, and Kate Saenko. Learning Deep Ob- ence on Computer Vision and Pattern Recog-
ject Detectors from 3D Models. In IEEE In- nition, CVPR 2015, Boston, MA, USA, June
ternational Conference on Computer Vision, 7-12, 2015, 2015.
ICCV 2015, Santiago, Chile, December 7-13,
2015, 2015. [284] Pedro O Pinheiro, Tsung-Yi Lin, Ronan Col-
lobert, and Piotr Dollár. Learning to re-
[278] Alex Pentland, Baback Moghaddam, and fine object segments. In Computer Vision
Thad Starner. View-based and modular - ECCV 2016 - 14th European Conference,
eigenspaces for face recognition. In Confer- Amsterdam, The Netherlands, October 11-
ence on Computer Vision and Pattern Recog- 14, 2016, pages 75–91, 2016.
nition, CVPR 1994, 21-23 June, 1994, Seat-
tle, WA, USA, pages 84–91, 1994. [285] Alex D. Pon, Oles Andrienko, Ali Harakeh,
and Steven L. Waslander. A Hierarchical
[279] Bojan Pepik, Rodrigo Benenson, Tobias Deep Architecture and Mini-Batch Selection
Ritschel, and Bernt Schiele. What is hold- Method For Joint Traffic Sign and Light De-
ing back convnets for detection? In German tection. In IEEE Conference on Computer
Conference on Pattern Recognition, pages and Robot Vision, June 2018.
517–528, 2015.
[286] Jordi Pont-Tuset, Pablo Arbelaez,
[280] Luis Perez and Jason Wang. The ef- Jonathan T Barron, Ferran Marques, and
fectiveness of data augmentation in image Jitendra Malik. Multiscale combinatorial
classification using deep learning. CoRR, grouping for image segmentation and object
abs/1712.04621, 2017. URL http://arxiv. proposal generation. IEEE Transactions on
org/abs/1712.04621. Pattern Analysis and Machine Intelligence,
39(1):128–140, 2017.
[281] Phuoc Pham, Duy Nguyen, Tien Do,
Thanh Duc Ngo, and Duy-Dinh Le. Eval- [287] Fatih Murat Porikli. Integral histogram: A
uation of Deep Models for Real-Time Small fast way to extract histograms in cartesian
Object Detection. ICONIP, 10636:516–526, spaces. In 2005 IEEE Computer Society
2017. Conference on Computer Vision and Pattern
Recognition (CVPR 2005), 20-26 June 2005,
[282] Pedro H. O. Pinheiro, Ronan Collobert, San Diego, CA, USA, pages 829–836, 2005.
and Piotr Dollár. Learning to segment ob-
ject candidates. In Corinna Cortes, Neil D. [288] Charles R. Qi, Hao Su, Kaichun Mo, and
Lawrence, Daniel D. Lee, Masashi Sugiyama, Leonidas J. Guibas. Pointnet: Deep learning
and Roman Garnett, editors, Advances in on point sets for 3d classification and segmen-
Neural Information Processing Systems 28: tation. In 2017 IEEE Conference on Com-
Annual Conference on Neural Informa- puter Vision and Pattern Recognition, CVPR
tion Processing Systems 2015, December 2017, Honolulu, HI, USA, July 21-26, 2017,
7-12, 2015, Montreal, Quebec, Canada, July 2017.
pages 1990–1998, 2015. URL http://
papers.nips.cc/paper/5852-learning- [289] Charles Ruizhongtai Qi, Wei Liu, Chenxia
to-segment-object-candidates. Wu, Hao Su, and Leonidas J. Guibas. Frus-
tum pointnets for 3d object detection from
[283] Pedro O. Pinheiro and Ronan Collobert. RGB-D data. CoRR, abs/1711.08488, 2017.
From Image-level to Pixel-level Labeling with URL http://arxiv.org/abs/1711.08488.

75
[290] Charles Ruizhongtai Qi, Li Yi, Hao Su, and [295] Rakesh N. Rajaram, Eshed Ohn-Bar, and
Leonidas J. Guibas. Pointnet++: Deep Mohan M. Trivedi. RefineNet: Iterative
hierarchical feature learning on point sets in refinement for accurate object localization.
a metric space. In Isabelle Guyon, Ulrike von In IEEE 19th International Conference on
Luxburg, Samy Bengio, Hanna M. Wallach, Intelligent Transportation Systems (ITSC),
Rob Fergus, S. V. N. Vishwanathan, and pages 1528–1533, November 2016.
Roman Garnett, editors, Advances in Neural
Information Processing Systems 30: Annual [296] Param S. Rajpura, Ravi S. Hegde, and
Conference on Neural Information Process- Hristo Bojinov. Object detection using deep
ing Systems 2017, 4-9 December 2017, Long cnns trained on synthetic images. CoRR,
Beach, CA, USA, pages 5105–5114, 2017. abs/1706.06782, 2017. URL http://arxiv.
URL http://papers.nips.cc/paper/ org/abs/1706.06782.
7095-pointnet-deep-hierarchical-
feature-learning-on-point-sets-in-a- [297] Rajeev Ranjan, Vishal M. Patel, and Rama
metric-space. Chellappa. A deep pyramid deformable part
model for face detection. In IEEE 7th In-
[291] Weichao Qiu and Alan L. Yuille. Unrealcv: ternational Conference on Biometrics The-
Connecting computer vision to unreal en- ory, Applications and Systems, BTAS 2015,
gine. In Gang Hua and Hervé Jégou, editors, Arlington, VA, USA, September 8-11, 2015,
Computer Vision - ECCV 2016 - 14th Eu- pages 1–8. IEEE, 2015.
ropean Conference, Amsterdam, The Nether-
lands, October 11-14, 2016, volume 9915 of [298] Pekka Rantalankila, Juho Kannala, and Esa
Lecture Notes in Computer Science, pages Rahtu. Generating object segmentation pro-
909–916, 2016. URL https://doi.org/10. posals using global and local search. In 2014
1007/978-3-319-49409-8_75. IEEE Conference on Computer Vision and
Pattern Recognition, CVPR 2014, Columbus,
[292] Shafin Rahman, Salman Hameed Khan,
OH, USA, June 23-28, 2014, pages 2417–
and Fatih Porikli. Zero-shot object de-
2424, 2014.
tection: Learning to simultaneously recog-
nize and localize novel concepts. CoRR,
[299] Mohammad Rastegari, Vicente Ordonez,
abs/1803.06049, 2018. URL http://arxiv.
Joseph Redmon, and Ali Farhadi. Xnor-net:
org/abs/1803.06049.
Imagenet classification using binary convo-
[293] Esa Rahtu, Juho Kannala, and Matthew lutional neural networks. In Computer Vi-
Blaschko. Learning a category independent sion - ECCV 2016 - 14th European Confer-
object detection cascade. In IEEE Inter- ence, Amsterdam, The Netherlands, October
national Conference on Computer Vision, 11-14, 2016, pages 525–542, 2016.
ICCV 2011, Barcelona, Spain, November 6-
13, 2011, pages 1052–1059, 2011. [300] Alexander J Ratner, Henry Ehrenberg, Ze-
shan Hussain, Jared Dunnmon, and Christo-
[294] Anant Raj, Vinay P. Namboodiri, and Tinne pher Ré. Learning to compose domain-
Tuytelaars. Subspace Alignment Based Do- specific transformations for data augmenta-
main Adaptation for RCNN Detector. In tion. In Advances in Neural Information
Proceedings of the British Machine Vision Processing Systems 30: Annual Conference
Conference 2015, BMVC 2015, Swansea, on Neural Information Processing Systems
UK, September 7-10, 2015, pages 166.1– 2017, 4-9 December 2017, Long Beach, CA,
166.11, Swansea, 2015. USA, pages 3236–3246, 2017.

76
[301] Kumar S. Ray, Vijayan K. Asari, and Soma Unified, real-time object detection. In 2016
Chakraborty. Object detection by spatio- IEEE Conference on Computer Vision and
temporal analysis and tracking of the de- Pattern Recognition, CVPR 2016, Las Ve-
tected objects in a video with variable back- gas,NV, USA, June 27-30, 2016, pages 779–
ground. CoRR, abs/1705.02949, 2017. URL 788, 2016.
http://arxiv.org/abs/1705.02949.
[309] Shaoqing Ren, Kaiming He, Ross Girshick,
[302] Sébastien Razakarivony and Frédéric Jurie. and Jian Sun. Faster r-cnn: Towards real-
Vehicle detection in aerial imagery: A small time object detection with region proposal
target detection benchmark. Journal of Vi- networks. In Advances in Neural Informa-
sual Communication and Image Representa- tion Processing Systems 28: Annual Confer-
tion, 34:187–203, 2016. ence on Neural Information Processing Sys-
tems 2015, December 7-12, 2015, Montreal,
[303] Esteban Real, Jonathon Shlens, Stefano Quebec, Canada, pages 91–99, 2015.
Mazzocchi, Xin Pan, and Vincent Van-
houcke. Youtube-boundingboxes: A large [310] Shaoqing Ren, Kaiming He, Ross B. Gir-
high-precision human-annotated data set for shick, Xiangyu Zhang, and Jian Sun. Ob-
object detection in video. In 2017 IEEE ject detection networks on convolutional fea-
Conference on Computer Vision and Pat- ture maps. IEEE Transactions on Pattern
tern Recognition, CVPR 2017, Honolulu, HI, Analysis and Machine Intelligence, 39(7):
USA, July 21-26, 2017, pages 7464–7473. 1476–1481, 2017. URL https://doi.org/
IEEE Computer Society, 2017. 10.1109/TPAMI.2016.2601099.

[304] Sashank J Reddi, Satyen Kale, and Sanjiv [311] M. Rochan and Yang Wang. Weakly super-
Kumar. On the convergence of adam and be- vised localization of novel objects using ap-
yond. In International Conference on Learn- pearance transfer. In IEEE Conference on
ing Representations (ICLR), 2018. Computer Vision and Pattern Recognition,
CVPR 2015, Boston, MA, USA, June 7-12,
[305] Joseph Redmon and Anelia Angelova. Real- 2015, 2015.
time grasp detection using convolutional neu-
ral networks. In IEEE International Confer- [312] Mikel Rodriguez, Ivan Laptev, Josef Sivic,
ence on Robotics and Automation (ICRA), and Jean-Yves Audibert. Density-aware
2015. person detection and tracking in crowds.
In IEEE International Conference on Com-
[306] Joseph Redmon and Ali Farhadi. puter Vision, ICCV 2011, Barcelona, Spain,
YOLO9000: better, faster, stronger. In November 6-13, 2011, pages 2423–2430,
2017 IEEE Conference on Computer Vision 2011.
and Pattern Recognition, CVPR 2017,
Honolulu, HI, USA, July 21-26, 2017, pages [313] Stefan Romberg, Lluis Garcia Pueyo, Rainer
6517–6525. IEEE Computer Society, 2017. Lienhart, and Roelof Van Zwol. Scalable logo
recognition in real-world images. In Proceed-
[307] Joseph Redmon and Ali Farhadi. Yolov3: ings of the 1st ACM International Confer-
An incremental improvement. CoRR, ence on Multimedia Retrieval, page 25, 2011.
abs/1804.02767, 2018. URL http://arxiv.
org/abs/1804.02767. [314] Amir Rosenfeld, Richard Zemel, and John K.
Tsotsos. The elephant in the room. CoRR,
[308] Joseph Redmon, Santosh Divvala, Ross Gir- abs/1808.03305, 2018. URL http://arxiv.
shick, and Ali Farhadi. You only look once: org/abs/1808.03305.

77
[315] Rasmus Rothe, Matthieu Guillaumin, and Springs, CO, USA, 20-25 June 2011, pages
Luc Van Gool. Non-maximum suppression 1745–1752, 2011.
for object detection by passing messages be-
tween windows. In Computer Vision - ACCV [322] Mohammad Amin Sadeghi and David A.
2014 - 12th Asian Conference on Computer Forsyth. 30hz object detection with DPM
Vision, Singapore, Singapore, November 1-5, V5. In David J. Fleet, Tomás Pajdla,
2014, pages 290–306, 2014. Bernt Schiele, and Tinne Tuytelaars, edi-
tors, Computer Vision - ECCV 2014 - 13th
[316] Soumya Roy, Vinay P. Namboodiri, and European Conference, Zurich, Switzerland,
Arijit Biswas. Active learning with ver- September 6-12, 2014, volume 8689 of Lec-
sion spaces for object detection. CoRR, ture Notes in Computer Science, pages 65–
abs/1611.07285, 2016. URL http://arxiv. 79. Springer, 2014. URL https://doi.org/
org/abs/1611.07285. 10.1007/978-3-319-10590-1_5.

[317] Sitapa Rujikietgumjorn and Robert T [323] Wesam A. Sakla, Goran Konjevod, and
Collins. Optimized pedestrian detection for T. Nathan Mundhenk. Deep multi-modal ve-
multiple and occluded people. In 2013 IEEE hicle detection in aerial ISR imagery. In 2017
Conference on Computer Vision and Pattern IEEE Winter Conference on Applications of
Recognition, Portland, OR, USA, June 23- Computer Vision, WACV 2017, Santa Rosa,
28, 2013, pages 3690–3697, 2013. CA, USA, March 24-31, 2017, pages 916–
923. IEEE, 2017.
[318] David E Rumelhart, Geoffrey E Hinton, and
Ronald J Williams. Learning internal rep- [324] Mark Sandler, Andrew Howard, Menglong
resentations by error propagation. Technical Zhu, Andrey Zhmoginov, and Liang-Chieh
report, California Univ San Diego La Jolla Chen. Mobilenetv2: Inverted residuals and
Inst for Cognitive Science, 1985. linear bottlenecks. In Computer Vision and
Pattern Recognition (CVPR), 2018 IEEE
[319] Olga Russakovsky, Jia Deng, Hao Su, Conference on, pages 4510–4520, 2018.
Jonathan Krause, Sanjeev Satheesh, Sean
[325] P. A. Savalle and S. Tsogkas. Deformable
Ma, Zhiheng Huang, Andrej Karpathy,
part models with cnn features. In SAICSIT
Aditya Khosla, Michael Bernstein, Alexan-
Conf., 2014.
der C. Berg, and Li Fei-Fei. ImageNet Large
Scale Visual Recognition Challenge. Interna- [326] Henry Schneiderman and Takeo Kanade. Ob-
tional Journal of Computer Vision (IJCV), ject detection using the statistics of parts.
115(3):211–252, 2015. International Journal of Computer Vision
(IJCV), 56(3):151–177, 2004.
[320] Payam Sabzmeydani and Greg Mori. Detect-
ing pedestrians by learning shapelet features. [327] Pierre Sermanet, David Eigen, Xiang Zhang,
In 2007 IEEE Computer Society Conference Michaël Mathieu, Rob Fergus, and Yann Le-
on Computer Vision and Pattern Recognition Cun. Overfeat: Integrated recognition, lo-
(CVPR 2007), 18-23 June 2007, Minneapo- calization and detection using convolutional
lis, Minnesota, USA, 2007. networks. CoRR, abs/1312.6229, 2013. URL
http://arxiv.org/abs/1312.6229.
[321] Mohammad Amin Sadeghi and Ali Farhadi.
Recognition using visual phrases. In The 24th [328] Pierre Sermanet, Koray Kavukcuoglu,
IEEE Conference on Computer Vision and Soumith Chintala, and Yann LeCun.
Pattern Recognition, CVPR 2011, Colorado Pedestrian detection with unsupervised

78
multi-stage feature learning. In 2013 IEEE text in the wild (RCTW-17). CoRR,
Conference on Computer Vision and Pattern abs/1708.09585, 2017. URL http://arxiv.
Recognition, Portland, OR, USA, June org/abs/1708.09585.
23-28, 2013, pages 3626–3633, 2013.
[335] Xuepeng Shi, Shiguang Shan, Meina Kan,
[329] Mohammad Javad Shafiee, Brendan Chywl, Shuzhe Wu, and Xilin Chen. Real-time
Francis Li, and Alexander Wong. Fast rotation-invariant face detection with pro-
YOLO: A fast you only look once system gressive calibration networks. In Computer
for real-time embedded object detection in Vision and Pattern Recognition (CVPR),
video. CoRR, abs/1709.05943, 2017. URL 2018 IEEE Conference on, June 2018.
http://arxiv.org/abs/1709.05943.
[336] Konstantin Shmelkov, Cordelia Schmid, and
[330] Yunhan Shen, Rongrong Ji, Shengchuan Karteek Alahari. Incremental learning of ob-
Zhang, Wangmeng Zuo, and Yan Wang. ject detectors without catastrophic forget-
Generative adversarial learning towards fast ting. In IEEE International Conference on
weakly supervised detection. In Computer Computer Vision, ICCV 2017, Venice, Italy,
Vision and Pattern Recognition (CVPR), October 22-29, 2017, pages 3420–3429, 2017.
2018 IEEE Conference on, June 2018.
[337] Abhinav Shrivastava, Abhinav Gupta, and
[331] Zhiqiang Shen, Zhuang Liu, Jianguo Li, Yu-
Ross Girshick. Training region-based object
Gang Jiang, Yurong Chen, and Xiangyang
detectors with online hard example mining.
Xue. Dsod: Learning deeply supervised ob-
In 2016 IEEE Conference on Computer Vi-
ject detectors from scratch. In IEEE In-
sion and Pattern Recognition, CVPR 2016,
ternational Conference on Computer Vision,
Las Vegas,NV, USA, June 27-30, 2016, pages
ICCV 2017, Venice, Italy, October 22-29,
761–769, 2016.
2017, volume 3, page 7, 2017.
[332] Zhiqiang Shen, Honghui Shi, [338] Abhinav Shrivastava, Rahul Sukthankar, Ji-
Rogério Schmidt Feris, Liangliang Cao, tendra Malik, and Abhinav Gupta. Beyond
Shuicheng Yan, Ding Liu, Xinchao Wang, skip connections: Top-down modulation for
Xiangyang Xue, and Thomas S. Huang. object detection. CoRR, abs/1612.06851,
Learning object detectors from scratch 2016. URL http://arxiv.org/abs/1612.
with gated recurrent feature pyramids. 06851.
CoRR, abs/1712.00886, 2017. URL
[339] Ashish Shrivastava, Tomas Pfister, Oncel
http://arxiv.org/abs/1712.00886.
Tuzel, Joshua Susskind, Wenda Wang, and
[333] Baoguang Shi, Xiang Bai, and Serge J. Be- Russell Webb. Learning from Simulated
longie. Detecting oriented text in natural and Unsupervised Images through Adversar-
images by linking segments. In 2017 IEEE ial Training. 2017 IEEE Conference on Com-
Conference on Computer Vision and Pat- puter Vision and Pattern Recognition, CVPR
tern Recognition, CVPR 2017, Honolulu, HI, 2017, Honolulu, HI, USA, July 21-26, 2017,
USA, July 21-26, 2017, pages 3482–3490. pages 2242–2251, 2017.
IEEE Computer Society, 2017.
[340] Shai Silberstein, Dan Levi, Victoria Kogan,
[334] Baoguang Shi, Cong Yao, Minghui Liao, and Ran Gazit. Vision-based pedestrian de-
Mingkun Yang, Pei Xu, Linyan Cui, Serge J. tection for rear-view cameras. In Intelli-
Belongie, Shijian Lu, and Xiang Bai. IC- gent Vehicles Symposium Proceedings, 2014
DAR2017 competition on reading chinese IEEE, pages 853–860, 2014.

79
[341] Daniel L Silver, Qiang Yang, and Lianghao Transactions on Pattern Analysis and Ma-
Li. Lifelong Machine Learning Systems: Be- chine Intelligence, 22(12):32, 2000.
yond Learning Algorithms. In 2013 AAAI
Spring Symposium, page 7, 2013. [350] Lars W. Sommer, Tobias Schuchert, Jur-
gen Beyerer, Firooz A. Sadjadi, and Abhi-
[342] Martin Simon, Stefan Milz, Karl Amende, jit Mahalanobis. Deep learning based multi-
and Horst-Michael Gross. Complex-yolo: category object detection in aerial images. In
Real-time 3d object detection on point SPIE Defense+ Security, May 2017.
clouds. CoRR, abs/1803.06199, 2018. URL
http://arxiv.org/abs/1803.06199. [351] Lars Wilko Sommer, Tobias Schuchert, and
Jürgen Beyerer. Fast deep vehicle detection
[343] Karen Simonyan and Andrew Zisser- in aerial images. In 2017 IEEE Winter Con-
man. Very deep convolutional net- ference on Applications of Computer Vision,
works for large-scale image recognition. WACV 2017, Santa Rosa, CA, USA, March
CoRR, abs/1409.1556, 2014. URL 24-31, 2017, pages 311–319. IEEE, 2017.
http://arxiv.org/abs/1409.1556.
[352] Lars Wilko Sommer, Arne Schumann, Tobias
[344] Karen Simonyan, Andrea Vedaldi, and An- Schuchert, and Jürgen Beyerer. Multi fea-
drew Zisserman. Deep inside convolu- ture deconvolutional faster R-CNN for pre-
tional networks: Visualising image classifi- cise vehicle detection in aerial imagery. In
cation models and saliency maps. CoRR, 2018 IEEE Winter Conference on Applica-
abs/1312.6034, 2013. URL http://arxiv. tions of Computer Vision, WACV 2018, Lake
org/abs/1312.6034. Tahoe, NV, USA, March 12-15, 2018, pages
635–642. IEEE Computer Society, 2018.
[345] Bharat Singh and Larry S Davis. An analysis
of scale invariance in object detection-snip. [353] Hyun Oh Song, Ross B. Girshick, Ste-
In 2017 IEEE Conference on Computer Vi- fanie Jegelka, Julien Mairal, Zaı̈d Har-
sion and Pattern Recognition, CVPR 2017, chaoui, and Trevor Darrell. On learning
Honolulu, HI, USA, July 21-26, 2017, 2018. to localize objects with minimal supervi-
sion. In Proceedings of the 31th Inter-
[346] Bharat Singh, Hengduo Li, Abhishek national Conference on Machine Learning,
Sharma, and Larry S. Davis. R-FCN-3000 ICML 2014, Beijing, China, 21-26 June
at 30fps: Decoupling detection and classifi- 2014, volume 32 of JMLR Workshop and
cation. CoRR, abs/1712.01802, 2017. URL Conference Proceedings, pages 1611–1619.
http://arxiv.org/abs/1712.01802. JMLR.org, 2014. URL http://jmlr.org/
proceedings/papers/v32/songb14.html.
[347] Bharat Singh, Mahyar Najibi, and Larry S.
Davis. SNIPER: efficient multi-scale training. [354] Hyun Oh Song, Yong Jae Lee, Stefanie
CoRR, abs/1805.09300, 2018. URL http:// Jegelka, and Trevor Darrell. Weakly-
arxiv.org/abs/1805.09300. supervised discovery of visual pattern con-
figurations. In Advances in Neural Informa-
[348] Leon Sixt, Benjamin Wild, and Tim Land- tion Processing Systems 27: Annual Confer-
graf. Rendergan: Generating realistic labeled ence on Neural Information Processing Sys-
data. Front. Robotics and AI, 2018, 2018. tems 2014, December 8-13 2014, Montreal,
Quebec, Canada, pages 1637–1645, 2014.
[349] Arnold W M Smeulders, Amarnath Gupta,
and Ramesh Jain. Content-Based Image Re- [355] Jost Tobias Springenberg, Alexey Dosovit-
trieval at the End of the Early Years. IEEE skiy, Thomas Brox, and Martin A. Ried-

80
miller. Striving for simplicity: The all con- Pattern Recognition, CVPR 2016, Las Ve-
volutional net. CoRR, abs/1412.6806, 2014. gas,NV, USA, June 27-30, 2016, 2016.
URL http://arxiv.org/abs/1412.6806.
[363] Christian Szegedy, Scott E. Reed, Dumitru
[356] Siddharth Srivastava, Gaurav Sharma, and Erhan, and Dragomir Anguelov. Scal-
Brejesh Lall. Large scale novel object discov- able, high-quality object detection. CoRR,
ery in 3d. In 2018 IEEE Winter Conference abs/1412.1441, 2014. URL http://arxiv.
on Applications of Computer Vision, WACV org/abs/1412.1441.
2018, Lake Tahoe, NV, USA, March 12-15,
2018, pages 179–188. IEEE Computer Soci- [364] Christian Szegedy, Wei Liu, Yangqing Jia,
ety, 2018. Pierre Sermanet, Scott Reed, Dragomir
Anguelov, Dumitru Erhan, Vincent Van-
[357] Russell Stewart, Mykhaylo Andriluka, and houcke, Andrew Rabinovich, et al. Going
Andrew Y Ng. End-to-end people detection deeper with convolutions. In IEEE Confer-
in crowded scenes. In 2016 IEEE Conference ence on Computer Vision and Pattern Recog-
on Computer Vision and Pattern Recogni- nition, CVPR 2015, Boston, MA, USA, June
tion, CVPR 2016, Las Vegas,NV, USA, June 7-12, 2015, pages 1–9, 2015.
27-30, 2016, pages 2325–2333, 2016.
[365] Christian Szegedy, Vincent Vanhoucke,
[358] Hang Su, Shaogang Gong, and Xiatian Zhu. Sergey Ioffe, Jonathon Shlens, and Zbigniew
WebLogo-2M: Scalable Logo Detection by Wojna. Rethinking the inception architec-
Deep Learning from the Web. In ICCB ture for computer vision. In 2016 IEEE
Workshops, pages 270–279, October 2017. Conference on Computer Vision and Pattern
Recognition, CVPR 2016, Las Vegas, NV,
[359] Hang Su, Xiatian Zhu, and Shaogang Gong. USA, June 27-30, 2016, pages 2818–2826.
Deep Learning Logo Detection with Data Ex- IEEE Computer Society, 2016.
pansion by Synthesising Context. IEEE Win-
ter Conf. on Applications of Computer Vi- [366] Christian Szegedy, Sergey Ioffe, Vincent Van-
sion (WACV), pages 530–539, 2017. houcke, and Alexander A Alemi. Inception-
v4, inception-resnet and the impact of resid-
[360] Hang Su, Xiatian Zhu, and Shaogang Gong. ual connections on learning. In AAAI, vol-
Open Logo Detection Challenge. In Pro- ume 4, page 12, 2017.
ceedings of the British Machine Vision Con-
ference 2018, BMVC 2018, Newcastle, UK, [367] Mingxing Tan, Bo Chen, Ruoming Pang, Vi-
September 3-6, 2018, 2018. jay Vasudevan, and Quoc V. Le. Mnasnet:
Platform-aware neural architecture search for
[361] Baochen Sun and Kate Saenko. From vir- mobile. CoRR, abs/1807.11626, 2018. URL
tual to reality: Fast adaptation of virtual ob- http://arxiv.org/abs/1807.11626.
ject detectors to real domains. In British
Machine Vision Conference, BMVC 2014, [368] Kevin D. Tang, Vignesh Ramanathan, Fei-
Nottingham, UK, September 1-5, 2014, vol- Fei Li, and Daphne Koller. Shifting Weights:
ume 1, page 3, 2014. Adapting Object Detectors from Image to
Video. In Advances in Neural Information
[362] Chen Sun, Manohar Paluri, Ronan Col- Processing Systems 25: 26th Annual Confer-
lobert, Ram Nevatia, and Lubomir Bourdev. ence on Neural Information Processing Sys-
ProNet: Learning to Propose Object-Specific tems 2012. Proceedings of a meeting held
Boxes for Cascaded Neural Networks. In 2016 December 3-6, 2012, Lake Tahoe, Nevada,
IEEE Conference on Computer Vision and United States, 2012.

81
[369] Peng Tang, Xinggang Wang, Xiang Bai, and 2009 IEEE Applied Imagery Pattern Recog-
Wenyu Liu. Multiple instance detection net- nition Workshop (AIPR 2009), pages 1–8,
work with online instance classifier refine- 2009.
ment. In 2017 IEEE Conference on Com-
puter Vision and Pattern Recognition, CVPR [376] Luke Taylor and Geoff Nitschke. Improving
2017, Honolulu, HI, USA, July 21-26, 2017, deep learning using generic data augmenta-
2017. tion. CoRR, abs/1708.06020, 2017. URL
http://arxiv.org/abs/1708.06020.
[370] Siyu Tang, Mykhaylo Andriluka, and Bernt
Schiele. Detection and tracking of occluded [377] Yonglin Tian, Xuan Li, Kunfeng Wang, and
people. International Journal of Computer Fei-Yue Wang. Training and testing ob-
Vision (IJCV), 110(1):58–69, 2014. ject detectors with virtual images. CoRR,
abs/1712.08470, 2017. URL http://arxiv.
[371] Siyu Tang, Bjoern Andres, Miykhaylo An- org/abs/1712.08470.
driluka, and Bernt Schiele. Subgraph de-
composition for multi-target tracking. In [378] Tijmen Tieleman and Geoffrey Hinton. Lec-
IEEE Conference on Computer Vision and ture 6.5-rmsprop: Divide the gradient by
Pattern Recognition, CVPR 2015, Boston, a running average of its recent magnitude.
MA, USA, June 7-12, 2015, pages 5033– COURSERA: Neural networks for machine
5041, 2015. learning, 4(2):26–31, 2012.

[372] Tianyu Tang, Shilin Zhou, Zhipeng Deng, [379] Radu Timofte, Karel Zimmermann, and Luc
Lin Lei, and Huanxin Zou. Arbitrary- Van Gool. Multi-view traffic sign detection,
Oriented Vehicle Detection in Aerial Imagery recognition, and 3d localisation. Machine vi-
with Single Convolutional Neural Networks. sion and applications, 25(3):633–647, 2014.
Remote Sensing, 9:1170–17, November 2017.
[380] Tatiana Tommasi, Novi Patricia, Barbara
[373] Tianyu Tang, Shilin Zhou, Zhipeng Deng, Caputo, and Tinne Tuytelaars. A deeper
Huanxin Zou, and Lin Lei. Vehicle Detection look at dataset bias. In Gabriela Csurka,
in Aerial Images Based on Region Convolu- editor, Domain Adaptation in Computer Vi-
tional Neural Networks and Hard Negative sion Applications., Advances in Computer
Example Mining. Sensors, 17:336–17, Febru- Vision and Pattern Recognition, pages 37–
ary 2017. 55. Springer, 2017. URL https://doi.org/
10.1007/978-3-319-58347-1_2.
[374] Y. Tang, J. K. Wang, B. Gao, and E. Del-
landréa. Large Scale Semi-supervised Object [381] Antonio Torralba and Alexei A Efros. Un-
Detection using Visual and Semantic Knowl- biased look at dataset bias. In The 24th
edge Transfer. In 2016 IEEE Conference on IEEE Conference on Computer Vision and
Computer Vision and Pattern Recognition, Pattern Recognition, CVPR 2011, Colorado
CVPR 2016, Las Vegas,NV, USA, June 27- Springs, CO, USA, 20-25 June 2011, pages
30, 2016, 2016. 1521–1528, 2011.

[375] Franklin Tanner, Brian Colder, Craig Pullen, [382] Toan Tran, Trung Pham, Gustavo Carneiro,
David Heagy, Michael Eppolito, Veronica Lyle Palmer, and Ian Reid. A bayesian data
Carlan, Carsten Oertel, and Phil Sallee. augmentation approach for learning deep
Overhead imagery research data set???an an- models. In Advances in Neural Information
notated data library & tools to aid in the de- Processing Systems 30: Annual Conference
velopment of computer vision algorithms. In on Neural Information Processing Systems

82
2017, 4-9 December 2017, Long Beach, CA, [389] Andras Tüzkö, Christian Herrmann, Daniel
USA, pages 2797–2806, 2017. Manger, and Jürgen Beyerer. Open set logo
detection and retrieval. In Francisco H. Imai,
[383] Jonathan Tremblay, Aayush Prakash, David Alain Trémeau, and José Braz, editors, Pro-
Acuna, Mark Brophy, Varun Jampani, Cem ceedings of the 13th International Joint Con-
Anil, Thang To, Eric Cameracci, Shaad ference on Computer Vision, Imaging and
Boochoon, and Stan Birchfield. Training Computer Graphics Theory and Applications
deep networks with synthetic data: Bridg- (VISIGRAPP 2018) - Volume 5: VISAPP,
ing the reality gap by domain randomization. Funchal, Madeira, Portugal, January 27-29,
In The IEEE Conference on Computer Vi- 2018., pages 284–292. SciTePress, 2018.
sion and Pattern Recognition (CVPR) Work-
shops, June 2018. [390] Lachlan Tychsen-Smith and Lars Petersson.
Improving object localization with fitness
[384] Jonathan Tremblay, Thang To, and Stan NMS and bounded iou loss. In Computer Vi-
Birchfield. Falling things: A synthetic sion and Pattern Recognition (CVPR), 2018
dataset for 3d object detection and pose esti- IEEE Conference on, pages 6877–6885, 2018.
mation. CoRR, abs/1804.06534, 2018. URL doi: 10.1109/CVPR.2018.00719.
http://arxiv.org/abs/1804.06534.
[391] Jasper RR Uijlings, Koen EA Van De Sande,
[385] Subarna Tripathi, Zachary C. Lipton, Theo Gevers, and Arnold WM Smeul-
Serge J. Belongie, and Truong Q. Nguyen. ders. Selective search for object recogni-
Context matters: Refining object detec- tion. International Journal of Computer Vi-
tion in video with recurrent neural net- sion (IJCV), 104(2):154–171, 2013.
works. In Richard C. Wilson, Edwin R.
Hancock, and William A. P. Smith, edi- [392] Régis Vaillant, Christophe Monrocq, and
tors, Proceedings of the British Machine Vi- Yann Le Cun. Original approach for the
sion Conference 2016, BMVC 2016, York, localisation of objects in images. IEE
UK, September 19-22, 2016. BMVA Press, Proceedings-Vision, Image and Signal Pro-
2016. URL http://www.bmva.org/bmvc/ cessing, 141(4):245–250, 1994.
2016/papers/paper044/index.html.
[393] Koen EA Van de Sande, Jasper RR Uijlings,
[386] Zhuowen Tu and Xiang Bai. Auto-context Theo Gevers, and Arnold WM Smeulders.
and its application to high-level vision tasks Segmentation as selective search for object
and 3d brain image segmentation. IEEE recognition. In IEEE International Con-
Transactions on Pattern Analysis and Ma- ference on Computer Vision, ICCV 2011,
chine Intelligence, 32(10):1744–1757, 2010. Barcelona, Spain, November 6-13, 2011,
pages 1879–1886, 2011.
[387] Zhuowen Tu, Yi Ma, Wenyu Liu, Xiang Bai,
and Cong Yao. Detecting texts of arbitrary [394] Grant Van Horn, Oisin Mac Aodha, Yang
orientations in natural images. In 2012 IEEE Song, Yin Cui, Chen Sun, Alex Shepard,
Conference on Computer Vision and Pattern Hartwig Adam, Pietro Perona, and Serge Be-
Recognition, pages 1083–1090, 2012. longie. The iNaturalist Species Classifica-
tion and Detection Dataset. In Computer Vi-
[388] Oncel Tuzel, Fatih Porikli, and Peter Meer. sion and Pattern Recognition (CVPR), 2018
Pedestrian detection via classification on rie- IEEE Conference on, 2018.
mannian manifolds. IEEE Transactions on
Pattern Analysis and Machine Intelligence, [395] Gül Varol, Javier Romero, Xavier Martin,
30(10):1713–1727, 2008. Naureen Mahmood, Michael J. Black, Ivan

83
Laptev, and Cordelia Schmid. Learning from on Computer Vision and Pattern Recogni-
synthetic humans. In 2017 IEEE Conference tion, CVPR 2015, Boston, MA, USA, June
on Computer Vision and Pattern Recogni- 7-12, 2015, pages 851–859. IEEE Computer
tion, CVPR 2017, Honolulu, HI, USA, July Society, 2015.
21-26, 2017, pages 4627–4635. IEEE Com-
puter Society, 2017. [402] Chong Wang, Weiqiang Ren, Kaiqi Huang,
and Tieniu Tan. Weakly Supervised Object
[396] Andreas Veit, Tomas Matera, Lukas Neu- Localization with Latent Category Learning.
mann, Jiri Matas, and Serge J. Belongie. In Computer Vision - ECCV 2014 - 13th
Coco-text: Dataset and benchmark for text European Conference, Zurich, Switzerland,
detection and recognition in natural images. September 6-12, 2014, 2014.
CoRR, abs/1601.07140, 2016. URL http:
//arxiv.org/abs/1601.07140. [403] Kai Wang and Serge Belongie. Word spot-
ting in the wild. In Computer Vision -
[397] Alexander Vezhnevets and Vittorio Ferrari. ECCV 2010, 11th European Conference on
Object localization in imagenet by look- Computer Vision, Heraklion, Crete, Greece,
ing out of the window. In Xianghua Xie, September 5-11, 2010, pages 591–604, 2010.
Mark W. Jones, and Gary K. L. Tam, editors,
Proceedings of the British Machine Vision [404] Li Wang, Yao Lu, Hong Wang, Yingbin
Conference 2015, BMVC 2015, Swansea, Zheng, Hao Ye, and Xiangyang Xue. Evolv-
UK, September 7-10, 2015, pages 27.1–27.12. ing boxes for fast vehicle detection. ICME,
BMVA Press, 2015. pages 1135–1140, 2017.

[398] Paul A. Viola, Michael J. Jones, and Daniel [405] Robert J. Wang, Xiang Li, Shuang Ao, and
Snow. Detecting pedestrians using patterns Charles X. Ling. Pelee: A Real-Time Ob-
of motion and appearance. International ject Detection System on Mobile Devices. In
Journal of Computer Vision (IJCV), 63(2): International Conference on Learning Repre-
153–161, 2005. sentations (ICLR), 2018.

[399] Stefan Walk, Nikodem Majer, Konrad [406] Xiaolong Wang, Ross B. Girshick, Abhinav
Schindler, and Bernt Schiele. New fea- Gupta, and Kaiming He. Non-local neu-
tures and insights for pedestrian detection. ral networks. CoRR, abs/1711.07971, 2017.
In The Twenty-Third IEEE Conference on URL http://arxiv.org/abs/1711.07971.
Computer Vision and Pattern Recognition,
CVPR 2010, San Francisco, CA, USA, 13- [407] Xiaolong Wang, Abhinav Shrivastava, and
18 June 2010, pages 1030–1037, 2010. Abhinav Gupta. A-fast-rcnn: Hard positive
generation via adversary for object detection.
[400] Fang Wan, Pengxu Wei, Jianbin Jiao, Zhen- In 2017 IEEE Conference on Computer Vi-
jun Han, and Qixiang Ye. Min-entropy latent sion and Pattern Recognition, CVPR 2017,
model for weakly supervised object detection. Honolulu, HI, USA, July 21-26, 2017, pages
In Computer Vision and Pattern Recognition 3039–3048. IEEE Computer Society, 2017.
(CVPR), 2018 IEEE Conference on, June
2018. [408] Xiaoyu Wang, Tony X. Han, and Shuicheng
Yan. An HOG-LBP human detector with
[401] Li Wan, David Eigen, and Rob Fergus. partial occlusion handling. In IEEE 12th In-
End-to-end integration of a convolutional ternational Conference on Computer Vision,
network, deformable parts model and non- ICCV 2009, Kyoto, Japan, September 27 -
maximum suppression. In IEEE Conference October 4, 2009, pages 32–39, 2009.

84
[409] Xinlong Wang, Tete Xiao, Yuning Jiang, Tahoe, NV, USA, March 12-15, 2018, pages
Shuai Shao, Jian Sun, and Chunhua Shen. 1093–1102. IEEE Computer Society, 2018.
Repulsion Loss: Detecting Pedestrians in a
Crowd. In Computer Vision and Pattern [416] Bichen Wu, Forrest N. Iandola, Peter H.
Recognition (CVPR), 2018 IEEE Conference Jin, and Kurt Keutzer. Squeezedet: Unified,
on, 2018. small, low power fully convolutional neural
networks for real-time object detection for
[410] Maurice Weiler, Fred A. Hamprecht, and autonomous driving. In 2017 IEEE Confer-
Martin Storath. Learning steerable filters for ence on Computer Vision and Pattern Recog-
rotation equivariant cnns. In Computer Vi- nition Workshops, CVPR Workshops, Hon-
sion and Pattern Recognition (CVPR), 2018 olulu, HI, USA, July 21-26, 2017, pages 446–
IEEE Conference on, June 2018. 454. IEEE Computer Society, 2017.

[411] Longyin Wen, Dawei Du, Zhaowei Cai, Zhen [417] Bo Wu and Ram Nevatia. Cluster boosted
Lei, Ming-Ching Chang, Honggang Qi, Jong- tree classifier for multi-view, multi-pose
woo Lim, Ming-Hsuan Yang, and Siwei object detection. In IEEE 11th Inter-
Lyu. DETRAC: A new benchmark and national Conference on Computer Vision,
protocol for multi-object tracking. CoRR, ICCV 2007, Rio de Janeiro, Brazil, October
abs/1511.04136, 2015. URL http://arxiv. 14-20, 2007, pages 1–8, 2007.
org/abs/1511.04136.
[418] Bo Wu and Ramakant Nevatia. Detection
[412] Cameron Whitelam, Emma Taborsky, of multiple, partially occluded humans in
Austin Blanton, Brianna Maze, Jocelyn a single image by bayesian combination of
Adams, Tim Miller, Nathan Kalka, Anil K edgelet part detectors. In 10th IEEE In-
Jain, James A Duncan, Kristen Allen, et al. ternational Conference on Computer Vision
Iarpa janus benchmark-b face dataset. In (ICCV 2005), 17-20 October 2005, Beijing,
CVPR Workshop on Biometrics, 2017. China, pages 90–97, 2005.

[413] Christian Wojek, Gyuri Dorkó, André [419] Tianfu Wu, Bo Li, and Song-Chun Zhu.
Schulz, and Bernt Schiele. Sliding-windows Learning and-or model to represent context
for rapid object class localization: A paral- and occlusion for car detection and view-
lel technique. In Joint Pattern Recognition point estimation. IEEE Transactions on Pat-
Symposium, pages 71–81, 2008. tern Analysis and Machine Intelligence, 38
(9):1829–1843, 2016.
[414] Christian Wojek, Stefan Walk, and Bernt
Schiele. Multi-cue onboard pedestrian de- [420] Yue Wu and Qiang Ji. Facial Landmark De-
tection. In 2009 IEEE Computer Society tection: A Literature Survey. International
Conference on Computer Vision and Pattern Journal of Computer Vision (IJCV), To ap-
Recognition (CVPR 2009), 20-25 June 2009, pear, May 2018.
Miami, Florida, USA, pages 794–801. IEEE
Computer Society, 2009. [421] Gui-Song Xia, Xiang Bai, Jian Ding, Zhen
Zhu, Serge J. Belongie, Jiebo Luo, Mi-
[415] Sanghyun Woo, Soonmin Hwang, and In So hai Datcu, Marcello Pelillo, and Liangpei
Kweon. Stairnet: Top-down semantic aggre- Zhang. DOTA: A large-scale dataset for
gation for accurate one shot detection. In object detection in aerial images. CoRR,
2018 IEEE Winter Conference on Applica- abs/1711.10398, 2017. URL http://arxiv.
tions of Computer Vision, WACV 2018, Lake org/abs/1711.10398.

85
[422] Wei Xiang, Dong-Qing Zhang, Heather Yu, [429] Zhaozhuo Xu, Xin Xu, Lei Wang, Rui Yang,
and Vassilis Athitsos. Context-aware single- and Fangling Pu. Deformable ConvNet with
shot detector. pages 1784–1793, 2018. doi: Aspect Ratio Constrained NMS for Object
10.1109/WACV.2018.00198. Detection in Remote Sensing Imagery. Re-
mote Sensing, 9:1312–19, December 2017.
[423] Yu Xiang and S. Savarese. Estimating the
aspect layout of object categories. In 2012 [430] Junjie Yan, Xuzong Zhang, Zhen Lei, and
IEEE Conference on Computer Vision and Stan Z. Li. Face detection by structural mod-
Pattern Recognition, Providence, RI, USA, els. Image and Vision Computing, 32(10):
June 16-21, 2012, 2012. 790–799, October 2014.

[424] Yu Xiang, Wongun Choi, Yuanqing Lin, and [431] Fan Yang, Wongun Choi, and Yuanqing Lin.
Silvio Savarese. Data-driven 3d voxel pat- Exploit all the layers: Fast and accurate cnn
terns for object category recognition. In object detector with scale dependent pooling
IEEE Conference on Computer Vision and and cascaded rejection classifiers. In 2016
Pattern Recognition, CVPR 2015, Boston, IEEE Conference on Computer Vision and
MA, USA, June 7-12, 2015, pages 1903– Pattern Recognition, CVPR 2016, Las Ve-
1911. IEEE Computer Society, 2015. gas,NV, USA, June 27-30, 2016, pages 2129–
2137, 2016.
[425] Yao Xiao, Cewu Lu, E. Tsougenis, Yongyi
[432] Shuo Yang, Ping Luo, Chen Change Loy,
Lu, and Chi-Keung Tang. Complexity-
and Xiaoou Tang. From facial parts re-
adaptive distance metric for object propos-
sponses to face detection: A deep learning
als generation. In IEEE Conference on Com-
approach. In 2015 IEEE International Con-
puter Vision and Pattern Recognition, CVPR
ference on Computer Vision, ICCV 2015,
2015, Boston, MA, USA, June 7-12, 2015,
Santiago, Chile, December 7-13, 2015, pages
2015.
3676–3684. IEEE Computer Society, 2015.
[426] Saining Xie, Ross Girshick, Piotr Dollár, [433] Shuo Yang, Ping Luo, Chen-Change Loy, and
Zhuowen Tu, and Kaiming He. Aggregated Xiaoou Tang. Wider face: A face detection
residual transformations for deep neural net- benchmark. In 2016 IEEE Conference on
works. In 2017 IEEE Conference on Com- Computer Vision and Pattern Recognition,
puter Vision and Pattern Recognition, CVPR CVPR 2016, Las Vegas,NV, USA, June 27-
2017, Honolulu, HI, USA, July 21-26, 2017, 30, 2016, pages 5525–5533, 2016.
pages 5987–5995, 2017.
[434] Zhenheng Yang and Ramakant Nevatia. A
[427] Hongyu Xu, Xutao Lv, Xiaoyu Wang, Zhou multi-scale cascade fully convolutional net-
Ren, and Rama Chellappa. Deep regionlets work face detector. In 23rd International
for object detection. CoRR, abs/1712.02408, Conference on Pattern Recognition, ICPR
2017. URL http://arxiv.org/abs/1712. 2016, Cancún, Mexico, December 4-8, 2016,
02408. pages 633–638. IEEE, 2016.

[428] Jiaolong Xu, Sebastian Ramos, David [435] Cong Yao, Xiang Bai, Nong Sang, Xinyu
Vázquez, and Antonio M López. Domain Zhou, Shuchang Zhou, and Zhimin Cao.
adaptation of deformable part-based mod- Scene text detection via holistic, multi-
els. IEEE Transactions on Pattern Analysis channel prediction. CoRR, abs/1606.09002,
and Machine Intelligence, 36(12):2367–2380, 2016. URL http://arxiv.org/abs/1606.
2014. 09002.

86
[436] Ryota Yoshihashi, Tu Tuan Trinh, Rei the British Machine Vision Conference 2016,
Kawakami, Shaodi You, Makoto Iida, and BMVC 2016, York, UK, September 19-22,
Takeshi Naemura. Learning multi-frame 2016, September 2016.
visual representation for joint detection
and tracking of small objects. CoRR, [443] Yuan Yuan, Xiaodan Liang, Xiaolong Wang,
abs/1709.04666, 2017. URL http://arxiv. Dit-Yan Yeung, and Abhinav Gupta. Tem-
org/abs/1709.04666. poral dynamic graph lstm for action-driven
video object detection. In IEEE Inter-
[437] Yang You, Zhao Zhang, Cho-Jui Hsieh, national Conference on Computer Vision,
James Demmel, and Kurt Keutzer. Ima- ICCV 2017, Venice, Italy, October 22-29,
genet training in minutes. In Proceedings of 2017, Oct 2017.
the 47th International Conference on Parallel
Processing, ICPP 2018, Eugene, OR, USA, [444] Mehmet Kerim Yucel, Yunus Can Bilge,
August 13-16, 2018, pages 1:1–1:10. ACM, Oguzhan Oguz, Nazli Ikizler-Cinbis, Pinar
2018. Duygulu, and Ramazan Gokberk Cinbis.
Wildest faces: Face detection and recognition
[438] Fisher Yu and Vladlen Koltun. Multi-scale in violent settings. CoRR, abs/1805.07566,
context aggregation by dilated convolutions. 2018. URL http://arxiv.org/abs/1805.
CoRR, abs/1511.07122, 2015. URL http:// 07566.
arxiv.org/abs/1511.07122.
[445] Sergey Zagoruyko and Nikos Komodakis.
[439] Fisher Yu, Vladlen Koltun, and Thomas A. Wide residual networks. In Richard C. Wil-
Funkhouser. Dilated residual networks. In son, Edwin R. Hancock, and William A. P.
2017 IEEE Conference on Computer Vision Smith, editors, Proceedings of the British
and Pattern Recognition, CVPR 2017, Hon- Machine Vision Conference 2016, BMVC
olulu, HI, USA, July 21-26, 2017, pages 636– 2016, York, UK, September 19-22, 2016.
644. IEEE Computer Society, 2017. doi: BMVA Press. URL http://www.bmva.org/
10.1109/CVPR.2017.75. bmvc/2016/papers/paper087/index.html.
[440] Fisher Yu, Wenqi Xian, Yingying Chen, [446] Sergey Zagoruyko, Adam Lerer, Tsung-Yi
Fangchen Liu, Mike Liao, Vashisht Madha- Lin, Pedro Oliveira Pinheiro, Sam Gross,
van, and Trevor Darrell. BDD100K: A di- Soumith Chintala, and Piotr Dollár. A
verse driving video database with scalable multipath network for object detection. In
annotation tooling. CoRR, abs/1805.04687, Richard C. Wilson, Edwin R. Hancock, and
2018. URL http://arxiv.org/abs/1805. William A. P. Smith, editors, Proceedings of
04687. the British Machine Vision Conference 2016,
BMVC 2016, York, UK, September 19-22,
[441] Jiahui Yu, Yuning Jiang, Zhangyang Wang, 2016, 2016. URL http://www.bmva.org/
Zhimin Cao, and Thomas S. Huang. Unitbox: bmvc/2016/papers/paper015/index.html.
An advanced object detection network. In
Proceedings of the 2016 ACM Conference on [447] Matthew D. Zeiler. ADADELTA: an
Multimedia Conference, MM 2016, Amster- adaptive learning rate method. CoRR,
dam, The Netherlands, October 15-19, 2016, abs/1212.5701, 2012. URL http://arxiv.
pages 516–520, 2016. org/abs/1212.5701.
[442] Ruichi Yu, Xi Chen, Vlad I. Morariu, and [448] Matthew D. Zeiler and Rob Fergus. Vi-
Larry S. Davis. The Role of Context Selec- sualizing and understanding convolutional
tion in Object Detection. In Proceedings of networks. In Computer Vision - ECCV

87
2014 - 13th European Conference, Zurich, ECCV 2016 - 14th European Conference,
Switzerland, September 6-12, 2014, pages Amsterdam, The Netherlands, October 11-
818–833, 2014. URL https://doi.org/10. 14, 2016, volume 9906 of Lecture Notes in
1007/978-3-319-10590-1_53. Computer Science, pages 443–457. Springer,
2016. URL https://doi.org/10.1007/
[449] Matthew D Zeiler and Rob Fergus. Visu- 978-3-319-46475-6_28.
alizing and understanding convolutional net-
works. In Computer Vision - ECCV 2014 - [456] Shanshan Zhang, Rodrigo Benenson, and
13th European Conference, Zurich, Switzer- Bernt Schiele. Citypersons: A diverse dataset
land, September 6-12, 2014, pages 818–833, for pedestrian detection. In 2017 IEEE
2014. Conference on Computer Vision and Pat-
tern Recognition, CVPR 2017, Honolulu, HI,
[450] Xingyu Zeng, Wanli Ouyang, Bin Yang, Jun-
USA, July 21-26, 2017, pages 4457–4465.
jie Yan, and Xiaogang Wang. Gated Bi-
IEEE Computer Society, 2017.
directional CNN for Object Detection. In
Computer Vision - ECCV 2016 - 14th Eu-
[457] Shanshan Zhang, Jian Yang, and Bernt
ropean Conference, Amsterdam, The Nether-
Schiele. Occluded Pedestrian Detection
lands, October 11-14, 2016, October 2016.
Through Guided Attention in CNNs. In
[451] Xingyu Zeng, Wanli Ouyang, Junjie Yan, Computer Vision and Pattern Recognition
Hongsheng Li, Tong Xiao, Kun Wang, (CVPR), 2018 IEEE Conference on, page 9,
Yu Liu, Yucong Zhou, Bin Yang, Zhe Wang, 2018.
et al. Crafting gbd-net for object detection.
IEEE Transactions on Pattern Analysis and [458] Shifeng Zhang, Xiangyu Zhu, Zhen Lei,
Machine Intelligence, 2017. Hailin Shi, Xiaobo Wang, and Stan Z. Li.
S$3̂$FD: Single Shot Scale-invariant Face De-
[452] Yao Zhai, Jingjing Fu, Yan Lu, and Houqiang tector. In IEEE International Conference on
Li. Feature selective networks for object de- Computer Vision, ICCV 2017, Venice, Italy,
tection. In Computer Vision and Pattern October 22-29, 2017, 2017.
Recognition (CVPR), 2018 IEEE Conference
on, June 2018. [459] Shifeng Zhang, Longyin Wen, Xiao Bian,
Zhen Lei, and Stan Z. Li. Occlusion-aware
[453] Cha Zhang and Zhengyou Zhang. A survey of
R-CNN: detecting pedestrians in a crowd.
recent advances in face detection. Technical
CoRR, abs/1807.08407, 2018. URL http:
report, Tech. rep., Microsoft Research, 2010.
//arxiv.org/abs/1807.08407.
[454] Dongqing Zhang, Jiaolong Yang,
Dongqiangzi Ye, and Gang Hua. Lq- [460] Shifeng Zhang, Longyin Wen, Xiao Bian,
nets: Learned quantization for highly Zhen Lei, and Stan Z. Li. Single-shot re-
accurate and compact deep neural net- finement neural network for object detection.
works. CoRR, abs/1807.10029, 2018. URL In Computer Vision and Pattern Recognition
http://arxiv.org/abs/1807.10029. (CVPR), 2018 IEEE Conference on, 2018.

[455] Liliang Zhang, Liang Lin, Xiaodan Liang, [461] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin,
and Kaiming He. Is faster R-CNN do- and Jian Sun. Shufflenet: An extremely effi-
ing well for pedestrian detection? In cient convolutional neural network for mobile
Bastian Leibe, Jiri Matas, Nicu Sebe, and devices. CoRR, abs/1707.01083, 2017. URL
Max Welling, editors, Computer Vision - http://arxiv.org/abs/1707.01083.

88
[462] Xiaolin Zhang, Yunchao Wei, Jiashi Feng, [468] Fan Zhao, Yao Yang, Hai-yan Zhang, Lin-
Yi Yang, and Thomas S. Huang. Adversar- lin Yang, and Lin Zhang. Sign text detec-
ial complementary learning for weakly super- tion in street view images using an integrated
vised object localization. In Computer Vi- feature. Multimedia Tools and Applications,
sion and Pattern Recognition (CVPR), 2018 April 2018.
IEEE Conference on, June 2018.
[469] Xiangyun Zhao, Shuang Liang, and Yichen
[463] Xiaopeng Zhang, Jiashi Feng, Hongkai Wei. Pseudo mask augmented object detec-
Xiong, and Qi Tian. Zigzag learning for tion. In Computer Vision and Pattern Recog-
weakly supervised object detection. In nition (CVPR), 2018 IEEE Conference on,
Computer Vision and Pattern Recognition June 2018.
(CVPR), 2018 IEEE Conference on, June [470] Zhong-Qiu Zhao, Peng Zheng, Shou-tao Xu,
2018. and Xindong Wu. Object detection with deep
learning: A review. CoRR, abs/1807.05511,
[464] Yongqiang Zhang, Yancheng Bai, Min- 2018. URL http://arxiv.org/abs/1807.
gli Ding, Yongqiang Li, and Bernard 05511.
Ghanem. W2f: A weakly-supervised to fully-
supervised framework for object detection. [471] Liwen Zheng, Canmiao Fu, and Yong Zhao.
In Computer Vision and Pattern Recognition Extend the shallow part of single shot multi-
(CVPR), 2018 IEEE Conference on, June box detector via convolutional neural net-
2018. work. CoRR, abs/1801.05918, 2018. URL
http://arxiv.org/abs/1801.05918.
[465] Yuting Zhang, Kihyuk Sohn, R. Villegas,
Gang Pan, and Honglak Lee. Improving ob- [472] 15 Bolei Zhou, Aditya Khosla, Àgata
ject detection with deep convolutional net- Lapedriza, Aude Oliva, and Antonio Tor-
works via Bayesian optimization and struc- ralba. Object detectors emerge in deep scene
tured prediction. In IEEE Conference on cnns. In IEEE Conference on Computer Vi-
Computer Vision and Pattern Recognition, sion and Pattern Recognition, CVPR 2015,
CVPR 2015, Boston, MA, USA, June 7-12, Boston, MA, USA, June 7-12, 2015, 2015.
2015, 2015. [473] Bolei Zhou, Àgata Lapedriza, Jianxiong
Xiao, Antonio Torralba, and Aude Oliva.
[466] Zheng Zhang, Chengquan Zhang, Wei Shen, Learning deep features for scene recognition
Cong Yao, Wenyu Liu, and Xiang Bai. Multi- using places database. In Advances in Neu-
oriented text detection with fully convolu- ral Information Processing Systems 27: An-
tional networks. In 2016 IEEE Conference on nual Conference on Neural Information Pro-
Computer Vision and Pattern Recognition, cessing Systems 2014, December 8-13 2014,
CVPR 2016, Las Vegas, NV, USA, June 27- Montreal, Quebec, Canada, 2014.
30, 2016, pages 4159–4167. IEEE Computer
Society, 2016. [474] Bolei Zhou, Aditya Khosla, Àgata Lapedriza,
Aude Oliva, and Antonio Torralba. Learning
[467] Zhishuai Zhang, Siyuan Qiao, Cihang Xie, deep features for discriminative localization.
Wei Shen, Bo Wang, and Alan L. Yuille. In 2016 IEEE Conference on Computer Vi-
Single-shot object detection with enriched se- sion and Pattern Recognition, CVPR 2016,
mantics. In Computer Vision and Pattern Las Vegas, NV, USA, June 27-30, 2016,
Recognition (CVPR), 2018 IEEE Conference pages 2921–2929. IEEE Computer Society,
on, June 2018. 2016.

89
[475] Peng Zhou, Bingbing Ni, Cong Geng, Jian- [482] Pengkai Zhu, Hanxiao Wang, Tolga Boluk-
guo Hu, and Yi Xu. Scale-Transferrable Ob- basi, and Venkatesh Saligrama. Zero-shot de-
ject Detection. In Computer Vision and Pat- tection. CoRR, abs/1803.07113, 2018. URL
tern Recognition (CVPR), 2018 IEEE Con- http://arxiv.org/abs/1803.07113.
ference on, page 10, 2018.
[483] Xiangxin Zhu and Deva Ramanan. Face de-
[476] Shuchang Zhou, Zekun Ni, Xinyu Zhou, tection, pose estimation, and landmark lo-
He Wen, Yuxin Wu, and Yuheng Zou. calization in the wild. In 2012 IEEE Confer-
Dorefa-net: Training low bitwidth convolu- ence on Computer Vision and Pattern Recog-
tional neural networks with low bitwidth gra- nition, Providence, RI, USA, June 16-21,
dients. CoRR, abs/1606.06160, 2016. URL 2012, pages 2879–2886. IEEE Computer So-
http://arxiv.org/abs/1606.06160. ciety, 2012.
[477] Xinyu Zhou, Cong Yao, He Wen, Yuzhi [484] Xizhou Zhu, Yujie Wang, Jifeng Dai,
Wang, Shuchang Zhou, Weiran He, and Ji- Lu Yuan, and Yichen Wei. Flow-guided
ajun Liang. East: An efficient and accurate feature aggregation for video object detec-
scene text detector. In 2017 IEEE Confer- tion. In IEEE International Conference on
ence on Computer Vision and Pattern Recog- Computer Vision, ICCV 2017, Venice, Italy,
nition, CVPR 2017, Honolulu, HI, USA, October 22-29, 2017, pages 408–417. IEEE
July 21-26, 2017, July 2017. Computer Society, 2017.
[478] Yin Zhou and Oncel Tuzel. Voxelnet: End-
[485] Xizhou Zhu, Yujie Wang, Jifeng Dai,
to-end learning for point cloud based 3d ob-
Lu Yuan, and Yichen Wei. Flow-guided fea-
ject detection. CoRR, abs/1711.06396, 2017.
ture aggregation for video object detection.
URL http://arxiv.org/abs/1711.06396.
In IEEE International Conference on Com-
[479] Haigang Zhu, Xiaogang Chen, Weiqun Dai, puter Vision, ICCV 2017, Venice, Italy, Oc-
Kun Fu, Qixiang Ye, and Jianbin Jiao. Ori- tober 22-29, 2017, pages 408–417, 2017. doi:
entation robust object detection in aerial im- 10.1109/ICCV.2017.52.
ages using deep convolutional neural net-
work. In Image Processing (ICIP), 2015 [486] Xizhou Zhu, Yuwen Xiong, Jifeng Dai,
IEEE International Conference on, pages Lu Yuan, and Yichen Wei. Deep feature
3735–3739, 2015. flow for video recognition. In 2017 IEEE
Conference on Computer Vision and Pat-
[480] Jun-Yan Zhu, Taesung Park, Phillip Isola, tern Recognition, CVPR 2017, Honolulu, HI,
and Alexei A. Efros. Unpaired image-to- USA, July 21-26, 2017, volume 2, page 7,
image translation using cycle-consistent ad- 2017.
versarial networks. In IEEE International
Conference on Computer Vision, ICCV [487] Xizhou Zhu, Jifeng Dai, Xingchi Zhu, Yichen
2017, Venice, Italy, October 22-29, 2017, Wei, and Lu Yuan. Towards high perfor-
pages 2242–2251. IEEE Computer Society, mance video object detection for mobiles.
2017. CoRR, abs/1804.05830, 2018. URL http:
//arxiv.org/abs/1804.05830.
[481] Pengfei Zhu, Longyin Wen, Xiao Bian,
Haibin Ling, and Qinghua Hu. Vision meets [488] Yousong Zhu, Chaoyang Zhao, Jinqiao
drones: A challenge. CoRR, abs/1804.07437, Wang, Xu Zhao, Yi Wu, and Hanqing Lu.
2018. URL http://arxiv.org/abs/1804. Couplenet: Coupling global structure with
07437. local parts for object detection. In IEEE

90
International Conference on Computer Vi- A Datasets and Results
sion, ICCV 2017, Venice, Italy, October 22-
29, 2017, pages 4126–4134, 2017. doi: 10. Most of the object detection’s influential ideas, con-
1109/ICCV.2017.444. cepts and literature having been now reviewed, the
rest of the article dives into the datasets used to
train and evaluate these detectors.
Public datasets play an essential role as they
not only allow to measure and compare the per-
[489] Yukun Zhu, R. Urtasun, R. Salakhutdinov, formance of object detectors but also provides re-
and S. Fidler. segDeepM: Exploiting segmen- sources allowing to learn object models from exam-
tation and context in deep neural networks ples. In the area of deep learning, these resources
for object detection. In IEEE Conference on play an essential role, as it has been clearly shown
Computer Vision and Pattern Recognition, that deep convolutional neural networks are de-
CVPR 2015, Boston, MA, USA, June 7-12, signed to benefit and learn from massive amount of
2015, 2015. data [473]. This section discusses the main datasets
used in the recent literature on object detection and
present state-of-the-art methods for each dataset.

A.1 Classical Datasets with Com-


[490] Zhe Zhu, Dun Liang, Songhai Zhang, Xiaolei mon Objects
Huang, Baoli Li, and Shimin Hu. Traffic-
sign detection and classification in the wild. We first start by presenting the datasets contain-
In 2016 IEEE Conference on Computer Vi- ing everyday life object taken from consumer cam-
sion and Pattern Recognition, CVPR 2016, eras. This category contains the most important
Las Vegas,NV, USA, June 27-30, 2016, pages datasets for the domain, attracting the largest part
2110–2118, 2016. of the community. We will discuss in a second sec-
tion the datasets devoted to specific detection tasks
(e.g., face detection, pedestrian detection, etc.).

A.1.1 Pascal-VOC
[491] C. L. Zitnick and P. Dollar. Edge boxes: Lo- Pascal-VOC [88] is the most iconic object detec-
cating object proposals from edges. In Com- tion dataset. It has changed over the years but
puter Vision - ECCV 2014 - 13th European the format everyone is familiar with is the one that
Conference, Zurich, Switzerland, September emerged in 2007 with 20 classes (Person: person;
6-12, 2014, 2014. Animal: bird, cat, cow, dog, horse, sheep; Ve-
hicle: aeroplane, bicycle, boat, bus, car, motor-
bike, train; Indoor: bottle, chair, dining table, pot-
ted plant, sofa, tv/monitor). It is now used as a
test bed for most new algorithms. As it is quite
[492] Zhen Zuo, Bing Shuai, Gang Wang 0012, small there have been claims that we are start-
Xiao Liu, Xingxing Wang, Bing Wang, and ing to overfit on the test set and therefore, MS-
Yushi Chen. Learning Contextual Depen- COCO (see next section) is preferred nowadays to
dence With Convolutional Hierarchical Re- demonstrate the quality of a new algorithm. The
current Neural Networks. IEEE Transactions 0.5 IoU based metrics this dataset introduced has
on Image Processing, 2016. now become the de facto standard for every single

91
detection problem. Overall, this dataset’s impact Method Backbone mAP
on the development of innovative methods in ob- [247] ResNeXt-101 83.1
ject detection cannot be overstated. It is quite [427] ResNet-101 83.1
hard to find all relevant literature but we have [452] ResNet-101 82.9
tried to be as thorough as possible in terms of best [63] ResNet-101 82.6
performing methods. The URL of the dataset is [176] ResNet-101 82.4
http://host.robots.ox.ac.uk/pascal/VOC/. [207] ResNet-101 82.1
Two versions of Pascal-VOC are commonly used [467] VGG-16 81.7
in the literature, namely VOC2007 and VOC2012: [92] ResNet-101 81.5
[475] DenseNet-169 80.9
• VOC07, with 9,963 images containing 24,640 [469] ResNet-101 80.7
annotated objects, is small. For this reason, [62] ResNet-101 80.5
papers using VOC07 often train on the union
of VOC07 and VOC12 trainvals (VOC07+12). Table 6: State-of-the-art methods on VOC07 test
The Average Precision (AP) averaged across set (Using VOC07+12).
the 20 classes is saturating at around 80 points
@0.5 IoU. Some methods got extra points but Method Backbone mAP
it seems one cannot go over around 85 points [427] ResNet-101 81.2
(without pre-training on MS COCO). Using [176] ResNet-101 81.1
MS COCO data in addition, one can get up to [247] ResNeXt-101 80.9
86.3 AP (see [207]). We chose to display meth- [207] ResNet-101 80.6
ods with mAP over 80 points only on Table 6. [452] ResNet-101 80.5
We do not distinguish between the methods [467] VGG-16 80.3
that do multiple inference tricks or the meth- [92] ResNet-101 80.0
ods that reports results as is. However for each [221] ResNet-101 78.5
method we reported for the highest published [62] ResNet-101 77.6
results we could get.
Table 7: State-of-the-art methods on VOC12 test
• VOC12 is a little bit harder than its 2007 coun-
set (Using VOC07++12).
terpart, and we have just gone over the 80
point mark. As it is harder, this time, most lit-
erature uses the union of the whole VOC2007
118,000 training images, 5,000 validation images
data (trainval+test) and VOC2012 trainval; It
and 41,000 testing images. They have also released
is referred to as 07++12. Again better results
120K unlabeled images that follow the same class
are obtained with pre-training on COCO data
distribution as the labeled images. They may be
(83.8 points in [117]). Results above 75 points
useful for semi-supervised learning on COCO. The
are presented in Table 7.
MS COCO challenge has been ongoing since 2015.
On both splits all backbones used by the leaders of There are 80 object categories, over 4 times more
the board are heavy backbones with more than a than Pascal-VOC. MS COCO is a fine replacement
100 layers except for [467] that gets close to state for Pascal-VOC, that has arguably started to age
of the art using only VGG-16. a little. Like ImageNet in its time, MS-COCO has
become the de facto standard for the object de-
tection community and any method winning the
A.1.2 MS COCO
state-of-the-art on it is assured to gain much trac-
MS COCO [214] is the most challenging object tion and visibility. The AP is calculated similar to
detection dataset available today. It consists of Pascal-VOC but averaged on multiple IoUs from

92
0.5 to 0.95. Nanjing University of Information Science and Im-
Most available alternatives stemmed from Faster perial College London. It ranked first on 85 cate-
R-CNN [309], which in its first iteration won the gories with an overall AP of 73.13. As far as we
first challenge with 37.3 mAP with a ResNet101 know, there is no paper describing the approach
backbone. In the second iteration of the challenge precisely (but some slides are available at the work-
the mAP went up to 41.5 with an ensemble of shop page). The 2nd ranked method was from
Faster R-CNN [309] that used a different imple- Bae et al. [8], who observed that modern convolu-
mentation of RoI-Pooling. This maybe inspired tional detectors behave differently for each object
the RoI-Align of Mask R-CNN [118]. Tao Kong class. The authors consequently built an ensem-
claimed that a single Faster R-CNN with Hyper- ble detector by finding the best detector for each
Net features [174] can reach 42.0 mAP. The best object class. They obtained a AP of 59.30 points
published single model method [274] nowadays is and won 10 categories. ImageNet is available at
around 50.5 (52.5 with an ensemble) and relied http://image-net.org.
on different techniques already mentioned in this
survey. Among them one can mention FPN [215], A.1.4 VisualGenome
large batch training [274] and GCN [275]. Ensem-
bling Mask R-CNNs [118] gave around the same VisualGenome [179] is a very peculiar dataset fo-
performance as [274] at around 50.3 mAP. De- cusing on object relationships. It contains over
formable R-FCN [63] is not lagging too far behind 100,000 images. Each image has bounding boxes
with 48.5 mAP single model performance (50.4 but also complete scene graphs. Over 17,000 cate-
mAP with an ensemble) using Soft NMS [21] and gories of objects are present. The first ones in terms
the ”mandatory” FPN [215]. Other entries were of representativeness by far are man and woman
based mostly on Mask R-CNN [118]. We display followed by trees and sky. On average there are
the current leaderboard (http://cocodataset. 21 objects per image. It is unclear if it qualifies
org/#detection-leaderboard) also visible at for as an object detection dataset as the paper does
all the past challenges with the main-ideas present not include clear object detection metrics or eval-
in the winning entries Figure 22. The URL of the uation as its focus is on scene graphs and visual
dataset is http://cocodataset.org. relationships. However, it is undoubtedly an enor-
mous source of strongly supervised images to train
A.1.3 ImageNet Detection Task object detectors. The Visual Genome Dataset has
huge number of classes, most of them being small
ImageNet is a dataset organized according to the and hard to detect. The mAP reported in the
nouns of the WordNet hierarchy. Each node of the literature is therefore, much smaller compared to
hierarchy is depicted by hundreds and thousands of previous datasets. One of the best performing ap-
images, with an average of over 5,000 images per proaches is of Li et al. [204] which reached 7.43
node. Since 2010, the Large Scale Visual Recogni- mAP by linking object detection, scene graph gen-
tion Challenge is organized each year and contains eration and region captioning. Faster R-CNN [104]
a detection challenge using ImageNet images. The has a mAP of 6.72 points on this dataset. The URL
detection task, in which each object instance has of the dataset is https://visualgenome.org.
to be detected, has 200 categories. There is also a
classification and localization task, with 1,000 cate- A.1.5 OpenImages
gories in which algorithms have to produce 5 labels
(and 5 bounding boxes) only, allowing not to pe- The challenge OpenImagesV4 [178] that will be or-
nalize the detection of objects that are present, but ganized for the first time at ECCV2018 offers the
not included in the ground truth. In the 2017 con- largest to date common objects detection dataset
test, the top detector was proposed by a team from with up to 500 classes (including the familiar ones

93
Figure 22: This plot displays the performance advances in the bounding boxes detection COCO challenge
over the years. For each year we present the main ideas behind the three best performing entries in terms
of mmAP. In 2015 the main frameworks were Fast R-CNN [104], DeepMask [282] and Faster R-CNN [309]
supported by the new Deep ResNets [117]. In 2016, the same pipelines won the competition with the
addition of AttractioNet [101] and LocNet [103] for better proposals and localization accuracy. In 2017
Mask R-CNN [118], FPN [215] and MegDet [274] proved that more complex ideas could allow to go over
the 50% mark. In 2018 the same pipelines, as in 2017 (namely Mask R-CNN), were enriched with the
multi-stages of Cascade R-CNN [27], a new RPN and backbones that were for the first time specifically
designed for the detection task. The last entry of 2018 reached 53% mmAP and we can extrapolate the two
first entries to be around 55% bbox mmAP based on their ranking for instance segmentation.

94
from Pascal-VOC) on 1,743,000 images and more A.2.1 Aerial Imagery
than 12,000,000 bounding boxes with an average
of 7 objects per image for training, and 125,436 The detection of small vehicles in aerial imagery is
images for tests (41,620 for validation). The object an old problem that has gained much attraction in
detection metric is the [email protected] averaged across recent times. However, it was only in the last years
classes taking into account the hierarchical struc- that large dataset have been made publicly avail-
ture of the classes with some technical subtleties on able, making the topic even more popular. The fol-
how to deal with groups of objects closely packed lowing paragraphs take inventory of these datasets
together. This is the first detection dataset to have and of the best performing methods.
so many classes and images and it will surely re- Google Earth [120] comprises 30 images of the
quire some new breakthrough to get it right. At city of Bruxelles with 1,319 small cars and verti-
the time of writing there is no published or non- cal bounding boxes, its variability is not enormous
published results on it, although the results of an but it is still widely used in the literature. There
Inception ResNet Faster R-CNN baseline can be are 5 folds. The CNN best result is [52] with 94.6
found on their site to have 37 mAP. The URL AP. It was later augmented with angle annota-
of the project is https://storage.googleapis. tions by Henriques and Vedaldi [122]. The data
com/openimages/web/index.html. can be found on Geremy Heitz webpage (http:
For industrial applications, more often than not, //ai.stanford.edu/~gaheitz/Research/TAS/).
the objects to detect does not come from the cate- OIRDS [375], with only 180 vehicles this dataset,
gories present in VOC or MS-COCO. Furthermore, is not very much used by the community.
they do not share the same variances; Rotation DLR 3k Munich Dataset [219] is one of the
variance for instance, is a property of several appli- most used datasets in the small vehicle detection
cations domains but is not present in any classical literature with 20 extra large images. 10 training
common object dataset. That is why, pushed by images with up to 3,500 cars and 70 trucks and
the industry needs, several other object detection 10 test images with 5,800 cars 90 trucks. Other
domains have appeared all with their respective lit- classes are also available like car or truck’s trails
erature. The most famous of them are listed in the and dashed lines. The state-of-the-art seems to
following sections. belong to [373] at 83% of F1 on both cars and
trucks and [372] at 82%, which provide oriented
boxes. Some relevant articles that compare on
A.2 Specialized datasets this dataset are [67, 350, 351]. The data can be
downloaded by asking the provided contact on
To find interesting domains one has to find interest- https://www.dlr.de/eoc/en/desktopdefault.
ing products or applications that drive them. The aspx/tabid-5431/9230_read-42467/.
industry has given birth to many sub-fields in ob- VeDAI [302] is for vehicle detection is aerial im-
ject detection: they wanted to have self-driving ages. The vehicles contained in the database, in
cars so we built pedestrian detection and traffic addition to being small, exhibit different variability
signs detection datasets; they wanted to monitor such as multiple orientations, lighting/shadowing
traffic so we had to have aerial imagery datasets; changes, occlusions. etc. Furthermore, each image
they wanted to be able to read text for blind per- is available in several spectral bands and resolu-
sons or automatic translations of foreign languages tions. They provide the same images in 2 reso-
so we constructed text detection datasets; some lutions 512x512 and 1024x1024. There are a to-
people wanted to do personalized advertising (ar- tal of 9 classes and 1,200 images with an aver-
guably not a good idea) so we engineered logo age of 5.5 instances per image. It is one of the
datasets. They all have their place in this special- few datasets to have 10 folds and the metric is
ized dataset section. based on an ellipse based distance between the cen-

95
ter of the ground truth and the centers of the de- quences formed by 179,264 frames and 10,209 static
tections. The state-of-the-art is currently held by images and contains different objects such pedes-
[259]. Although many recent articles used their trian, vehicles, bicycles, etc. and density (sparse
own metrics, which makes them difficult to com- and crowded scenes). Frames are manually anno-
pare [323, 351, 352, 372, 373]. VeDAI is available tated with more than 2.5 million bounding boxes
at https://downloads.greyc.fr/vedai/. and some attributes, e.g. scene visibility, object
COWC [252], introduced in ECCV2016, is a class and occlusion, are provided. VisDrone is very
very large dataset with regions from all over the recent and no results are available yet. VisDrone
world and more than 32,000 cars. It also con- is available at http://www.aiskyeye.com.
tains almost 60,000 hard negative patches hand-
picked, which is a blessing when training detectors A.2.2 Text Detection in Images
that do not include hard-example mining strate-
gies. Unfortunately, no test data annotations are Text detection in images or videos is a common
available so detection methods cannot yet be prop- way to extract content from images and opens the
erly tested on it. COWC is available at https: door to image retrieval or automatic text transla-
//gdo152.llnl.gov/cowc/. tion applications. We inventory, in the following,
DOTA [421], released this year at CVPR, is the main datasets as well as the best practices to
the first mainstream dataset to change its metric address this problem.
to incorporate rotated bounding boxes similar to ICDAR 2003 [227] was one of the first public
the text detections datasets. The images are of datasets for text detection. The dataset contains
very different resolutions and zoom factors. There 509 scene images and the scene text is mostly cen-
are 2,800 images with almost 200,000 instances tered and iconic. Delakis and Garcia [65] was one
and 15 categories. This dataset will surely be- of the first to use CNN on this dataset.
come one of the important ones in the near future. Street View Text (SVT) [403]. Taken from
The leader board https://captain-whu.github. Google StreetView, it is a dataset filled with busi-
io/DOTA/results.html shows that Mask R-CNN ness names mostly, from outdoor streets. There
structures are the best at this task for the moment are 350 images and 725 instances. One of the
with the winner culminating at 76.2 oriented mAP best performing methods on SVT is [468] with a
but no other published method apart from [421] F-measure of 83%. SVT can be downloaded from
yet. UCAS-AOD [479], NWPU VHR10 [54] and http://tc11.cvc.uab.es/datasets/SVT_1.
HRSC2016 [223] all provided oriented annotations MSRA-TD500 [387] contains 500 natural im-
also but they are hard to find and very few articles ages, which are taken from indoor (office and mall)
actually use them. DOTA is available at https: and outdoor (street) scenes. The resolutions of
//captain-whu.github.io/DOTA/dataset.html the images vary from 1296 × 864 to 1920 × 1280.
xView [186] is a very large scale dataset gathered There are Chinese and English texts and mixed
by the pentagon, containing 60 classes and 1 million too. The training set contains 300 images ran-
instances. It is split in three parts train, val and domly selected from the original dataset and the
test. xView is available at http://xviewdataset. remaining 200 images constitute the test set. Best
org. First challenge will end in August 2018, no performing method on MSRA-TD500 is [212]
results are available yet. with a F-measure of 79%. Shi et al. [333], Yao
VisDrone [481] is the most recent dataset in- et al. [435], Ma et al. [228] and Zhang et al. [466]
cluding aerial images. Images, captured by dif- also performed very well (F-measures of 77%,
ferent drones flying over 14 different cities sepa- 76%, 75% and 75% respectively). The dataset
rated by thousands of kilometers in China, in dif- is available at http://www.iapr-tc11.org/
ferent scenarios under various weather and lighting mediawiki/index.php/MSRA_Text_Detection_
conditions. The dataset consists of 263 video se- 500_Database_(MSRA-TD500).

96
IIIT 5k-word [242] has 1,120 images and 5,000 Face Detection Data Set and Benchmark
words from both street scene texts and born- (FDDB) [155] is built using Yahoo!, with 2845 im-
digital images. 380 images are used to train ages and a total of 5171 faces; it has a wide range of
and the remaining to test. Each text has also difficulties such as occlusions, strong pose changes,
a category label easy or hard. [212] is state-of- low resolution and out-of-focus faces, with both
the-art, as for MSRA-TD500. IIIT 5k-word is grayscale and color images. Zhang et al. [458] ob-
available at http://cvit.iiit.ac.in/projects/ tained an AUR of 98.3% on this dataset and is cur-
SceneTextUnderstanding/IIIT5K.html. rently state-of-the-art for this dataset. Najibi et al.
Synth90K [153] is a completely generated [255] obtained 98.1%. The dataset can be down-
grayscale text dataset with multiple fonts and vo- loaded at http://vis-www.cs.umass.edu/fddb/
cabulary well blended into scenes with 9 million index.html.
images from a 90,000 vocabulary. It can be found Annotated Facial Landmarks in the Wild
on the VGG page at http://www.robots.ox.ac. (AFLW) [177] is made from a collection of images
uk/~vgg/data/text/ collected on Flickr, with a large variety in face ap-
ICDAR 2015 [165] is another popular iteration pearance (pose, expression, ethnicity, age, gender)
of the ICDAR challenge, following ICDAR 2013. and environmental conditions. It has the partic-
Busta et al. [26] got state-of-the-art 87% of F mea- ularity to not to be aimed at face detection only,
sure in comparison to the 83.8% of Liao et al. [212] but more oriented towards landmark detection and
and the 82.54% of Jiang et al. [159]. TextBoxes++ face alignment. In total 25,993 faces in 21,997 real-
[211] reached 81.7% and Shi et al. [333] is at 75%. world images are annotated. Annotations come
COCO Text [396], based on MS COCO, is the with rich facial landmark information (21 land-
biggest dataset for text detection. It has 63,000 marks per faces). The dataset can be downloaded
images with 173,000 annotations. [212] is the only from https://www.tugraz.at/institute/icg/
published result with [477] yet that differs from research/team-bischof/lrs/downloads/aflw/.
the baselines implemented in the dataset paper
Annotated Face in-the-Wild (AFW) [483] is a
[396]. So there must still be room for improvement.
dataset containing faces in real conditions, with
The very recent [211] outperformed [477]. COCO
their associated annotations (bounding box, facial
Text is available at https://bgshih.github.io/
landmarks and pose angle labels). Each image con-
cocotext/.
tains multiple, non-frontal faces. The dataset con-
RCTW-17 (ICDAR 2017) [334] is the latest IC- tains 205 images with 468 faces. Zhang et al. [458]
DAR database. It is a large line-based dataset with obtained an AP of 99.85% on this dataset and is
mostly Chinese text. Liao et al. [212] achieved currently state-of-the-art for this dataset.
SOTA on this one too with 67.0% of F mea-
sure. The dataset is available at http://www. PASCAL Faces [430] contains images selected
icdar2017chinese.site/dataset/. from PASCAL VOC [88] in which the faces have
been annotated. [458] obtained an AP of 98.49%
on this dataset, and is currently state-of-the-art for
A.2.3 Face Detection this dataset.
Face detection is one of the most widely addressed Multi-Attribute Labeled Faces (MALF ) [20] in-
detection tasks. Even if the detection of frontal in corporates richer semantic annotations such as
high resolution images is an almost solved problem, pose, gender and occlusion information as well
there is room for improvement when the conditions as expression information. It contains 5,250 im-
are harder (non-frontal images, small faces, etc.). ages collected from the Internet and approximately
These harder conditions are reflected by the follow- 12,000 labeled faces. The dataset and up-to-date
ing recent datasets. The main characteristics of the results of the evaluation can be found at http:
different face datasets are proposed in Table 8. //www.cbsr.ia.ac.cn/faceevaluation/.

97
Dataset #Images #Faces Source Type
Wider Face [433] is one of the largest datasets FDDB [155] 2,845 5,171 Yahoo! News Images
for face detection. Each annotation includes infor- AFLW [177] 21,997 25,993 Flickr Images
AFW [483] 205 473 Flickr Images
mation such as scale, occlusion, pose, overall dif- PASCAL Faces [430] 851 1,335 Pascal-VOC Images
MALF [20] 5,250 11,931 Flickr, Baidu Inc. Images
ficulty and events, which makes possible in-depth IJB-A [172] 24,327 67,183 Google, Bing, etc. Images/Videos
analyses. This dataset is very challenging espe- IIIT-CFW [241] 8,927 8,928 Google Images
Wider Face [433] 32,203 393,703 Google, Bing Images
cially for the ’hard set’. Najibi et al. [255] ob- IJB-B [412] 76,824 125,474 Freebase Images/Videos
IJB-C [238] 148,876 540,630 Freebase Images/Videos
tained an AP of 93.1% (easy), 92.1% (medium) Wildest Faces [444] 67,889 109,771 YouTube Videos
and 84.5% (hard) on this dataset and is currently UFDD [253] 6,424 10,895 Google, Bing, etc. Images

state-of-the-art for this dataset. Zhang et al. [458]


are also very good with AP of 92.8% (easy), 91.3% Table 8: Datasets for face detection.
(medium) and 84.0% (hard). Datasets and results
can be downloaded at http://mmlab.ie.cuhk. IIIT-Cartoon Faces in the Wild) [241] contains
edu.hk/projects/WIDERFace/. 8,927 annotated images of cartoon faces belong-
IARPA Janus Benchmark A (IJ-A) [172] con- ing to 100 famous personalities, harvested from
tains images and videos from 500 subjects captured Google image search, with annotations including
from ’in the wild’ environment, and contains anno- attributes such as age group, view, expression,
tations for both recognition and detection tasks. pose, etc. The benchmark includes 7 challenges:
All labeled faces are localized with bounding boxes Cartoon face recognition, Cartoon face verification,
as well as with landmarks (center of the two eyes, Cartoon gender identification, photo2cartoon and
base of the nose). IJB-B [412] extended this dataset cartoon2photo, face detection, pose estimation and
with 1,845 subjects, for 21,798 still images and landmark detection, relative attributes in Cartoon
55,026 frames from 7,011 videos. IJB-C [238], and attribute-based cartoon search. Jha et al.
which is the new extended version of the IARPA [157] have published SOTA detection results using
Janus Benchmark A and B, adds 1,661 new sub- a Haar features-based detector, with a F measure
jects to the 1,870 subjects released in IJB-B. The of 84%. The dataset can be downloaded from
NIST Face Challenges are at https://www.nist. http://cvit.iiit.ac.in/research/projects/
gov/programs-projects/face-challenges. cvit-projects/cartoonfaces
Un-constrained Face Detection Dataset (UFDD) Wildest Faces [444] is a dataset where the em-
[253] was built after noting that in many chal- phasis is put on violent scenes in unconstrained sce-
lenges large variations in scale, pose, appearance narios. It contains images of diverse quality, resolu-
are successfully addressed but there is a gap in tion and motion blur. It includes 68K images (aka
the performance of state-of-the-art detectors and video frames) and 2186 shots of 64 fighting celebri-
real-world requirements, not captured by existing ties. All of the video frames are manually anno-
methods or datasets. UFDD aimed at identify- tated to foster research for detection and recogni-
ing the next set of challenges and collect a new tion, both. The dataset is not released at the time
dataset of face images that involve variations such this survey is written.
as weather-based degradations, motion blur and fo-
cus blur. The authors also provide an in-depth A.2.4 Pedestrian Detection
analysis of the results and failure cases of these
methods. This dataset is very recent and has Pedestrian detection is one of the specific tasks
not been used specifically yet. However, Nada abundantly studied in the literature, especially
et al. [253] reported the performances (in terms since research on autonomous vehicles has inten-
of AP) of Faster-RCNN [309] (52.1%), SSH [255] sified.
(69.5%), S3FD [458] (72.5%) and HR-ER [137] MIT [272] is one of the first pedestrian
(74.2%). Dataset and results can be downloaded datasets. It’s puny in size (509 training and
at http://www.ufdd.info/. 200 testing images). The images were extracted

98
FDDB ETH [87] was captured from a stroller. There are
500000 AFLW

AFW
490 training frames with 1578 annotations. There
100000
PASCAL Faces are three test sets. The first test set has 999
50000
MALF

IJB-A
frames with 5193 annotations, the second one 450
IIIT-CFW and 2359 and the third one 354 and 1828 respec-
Faces

10000 Wider Face

IJB-B
tively. The stereo cues are available. It is a diffi-
5000
IJB-C cult dataset where the state-of-the-art from Zhang
1000
Wildest Faces

UFDD
et al. [459] trained on CityPersons still remains at
500 24.5% log average miss rate. The boosted forest
of Zhang et al. [455] gets 30.2% only. It is avail-
0

00

00

00
50

00

00
10

50

00
10

50

10

Images
able at https://data.vision.ee.ethz.ch/cvl/
aess/iccv2007/
Daimler DB [84] is an old dataset captured in an
Figure 23: Number of images vs number of faces in urban setting, builds on DaimlerChrysler datasets
each dataset (Table 8) on a log scale. The size of with only grayscale images. It has been recently
the bubble indicates average number of faces per extended with Cyclist annotations into the Ts-
image which can be used as an estimate of com- inghua Daimler Cyclist (TDC) dataset [202] with
plexity of the dataset. color images. The dataset is available at http:
//www.gavrila.net/Datasets/datasets.html.
TUD-Brussels [414] is from the TU Darmstadt
from the LabelMe database. You can find University and contains image pairs recorded in a
it at http://cbcl.mit.edu/software-datasets/ crowded urban setting with an on-board camera
PedestrianData.html from a car. There are 1092 image pairs with
INRIA [64] is currently one of the most popu- 1776 annotations in the training set. The test set
lar static pedestrian detection datasets introduced contains 508 image pairs with 1326 pedestrians.
in the seminal HOG paper [64]. It uses obvi- The evaluation is measured from the recall at 90%
ously the Caltech metric. Zhang et al. [459] gained precision, somehow reminiscent of KITTI dataset.
state-of-the-art with 6.4% log average miss rate. TUD-Brussels is available at https://www.mpi-
Method at the second position is [455] with 6.9% inf.mpg.de/departments/computer-vision-
using the RPN from Faster R-CNN and boosted and-multimodal-computing/research/people-
forests on extracted features. The others are detection-pose-estimation-and-tracking/
not CNN methods (the third one using pooling multi-cue-onboard-pedestrian-detection/.
with HOG, LBP and covariance matrices). It can Caltech USA [71] contains images are captured
be found at http://pascal.inrialpes.fr/data/ in the Greater Los Angeles area by an independent
human/. Similarly, PASCAL Persons dataset is a driver to simulate real-life conditions without any
subset of the aforementioned Pascal-VOC dataset. bias. 192,000 pedestrian instances are available for
training. 155,000 for testing. The evaluation use
CVC-ADAS [100] is a collection of datasets in- Pascal-VOC criteria at 0.5 IoU. The performance
cluding videos acquired on board, virtual-world measure is the log average miss rate as application
pedestrians and real pedestrians. It can be found wise one cannot have too many False Positive per
at following http://adas.cvc.uab.es/site/. Image (FPPI). It is computed by averaging miss
USC [417] is an old small pedestrian rates at 9 FPPIs from 10−2 to 1 uniformly in log
dataset taken largely from surveillance scale. State-of-the-art algorithms are at around 4%
videos. It is still downloadable at http: log average miss rate. Wang et al. [409] got 4.0%
//iris.usc.edu/Vision-Users/OldUsers/ by using a novel bounding box regression loss. Fol-
bowu/DatasetWebpage/dataset.html lowing it, we have Zhang et al. [459] at 4.1% using

99
a novel RoI-Pooling of parts helping with occlu- age miss rate. The dataset is available at https:
sions and pre-training on CityPersons. Mao et al. //bitbucket.org/shanshanzhang/citypersons.
[231] is lagging behind with 5.5%, using a Faster EuroCity [25] is the largest pedestrian detec-
R-CNN with additional aggregated features. There tion dataset ever released with 238,300 instances
also exists a CalTech Japan dataset. The bench- in 47,300 images. Images are taken over 31 cities
mark is hosted at http://www.vision.caltech. in 12 different European countries. The metric is
edu/Image_Datasets/CaltechPedestrians/. the same as CalTech. Three baselines were tested
KITTI [98] is one of the most famous datasets (Faster R-CNN, R-FCN and YOLOv3). Faster R-
in Computer Vision taken over the city of Karl- CNN dominated on the reasonable set with 8.1%,
sruhe in Germany. There are 100,000 instances of followed by YOLOv3 with 8.5% and R-FCN lag-
pedestrians. With around 6000 identities and one ging behind with 12.1%. On other subsets with
person in average per image. The preferred met- heavily occluded or small pedestrians the ranking
ric is the AP (Average Precision) on the moderate is not the same. We refer the reader to the dataset
(persons who are less than 25 pixels tall are left be- paper of [25].
hind for ranking) set. Li et al. [200] got 65.01 AP
on moderate by using an adapted version of Fast A.2.5 Logo Detection
R-CNN with different heads to deal with different
scales. The state-of-the-art of Chen et al. [45] had Logo detection was attracting a lot of attention in
to rely on stereo information to get good object the past, due to the specificity of the task. At
proposals and 67.47 AP. All KITTI related datasets the moment we write this survey, there are fewer
are found at http://www.cvlibs.net/datasets/ papers on this topic and most of the logo detection
kitti/index.php. pipelines are direct applications of Faster RCNN
GM-ATCI [340] is a dataset captured from a [309].
fisheye-lens camera that uses CalTech evaluation BelgaLogos [160] images come from the BELGA
system. We could not find any CNN detection press agency. The dataset is composed of 10,000
results on it possibly because the state-of-the-art images covering all aspects of life and current af-
using multiple cues is already pretty good with fairs: politics and economics, finance and social
3.5% log average miss rate. The sequences can affairs, sports, culture and personalities. All im-
be downloaded here https://sites.google.com/ ages are in JPEG format and have been re-sized
site/rearviewpeds1/ with a maximum value of height and width equal
CityPersons [456] is a relatively new dataset that to 800 pixels, preserving aspect ratio. There are 26
builds upon CityScapes [58]. It is a semantic seg- different logos. Only a few images are annotated
mentation dataset recorded in 27 different cities in with bounding boxes. The dataset can be down-
Germany. There are 19,744 persons in the train- loaded at https://www-sop.inria.fr/members/
ing set and around 11,000 in the test set. There Alexis.Joly/BelgaLogos/BelgaLogos.html.
are way more identities present than in CalTech FlickrLogos [80, 313] consists of real-world im-
even though there are fewer instances (1300 in Cal- ages collected from Flickr, depicting company lo-
Tech w.r.t. 19000 in CityPersons). Therefore, it gos in various situations. The dataset comes in two
is more diverse and thus, more challenging. The versions: The original FlickrLogos-32 dataset and
metric is the same as CalTech with some subsets the FlickrLogos-47 [80] dataset. In FlickrLogos-
like the Reasonable: the pedestrians that are more 32 the annotations for object detection were of-
than 50 pixels tall and less than 35% occluded. ten incomplete, since only the most prominent
Again Zhang et al. [459] and Wang et al. [409] take logo instances were labeled. FlickrLogos-47 uses
the lead with 11.32% and 11.48% respectively on the same image corpus as FlickrLogos-32 but new
the reasonable set w.r.t. the baseline on adapted classes were introduced (logo and text as separate
Faster R-CNN that stands at 12.97% log aver- classes) and missing object instances have been an-

100
notated. FlickrLogos-47 contains 833 training and Dataset #Classes #Images
1402 testing images. The dataset can be down- BelgaLogos [160] 26 10,000
loaded at http://www.multimedia-computing. FlickrLogos-32 [313] 32 8,240
de/flickrlogos/. FlickrLogos-47 [80] 47 8,240
Logo32plus [17] is an extension of the train Logo32plus [17] 32 7,830
set of FlickrLogos-32 [80]. It has the same WebLogo-2M [358] 194 2,190,757
classes of objects but much more training in- SportsLogo [213] 20 1,978
stances (12,312 instances). The dataset can be Logos in the Wild [389] 871 11,054
downloaded at http://www.ivl.disco.unimib. OpenLogos [360] 309 27,189
it/activities/logorecognition.
WebLogo-2M [358] is very large, but annotated Table 9: Datasets for logo detection.
at image level only and does not contain bound-
ing boxes. It contains 194 logo classes and over 2
million logo images. Labels are noisy as the an- classes have no labeled training data. It contrasts
notations are automatically generated. Therefore, with previous logo datasets which assumed all the
this dataset is designed for large-scale logo detec- logo classes are annotated. The OpenLogo chal-
tion model learning from noisy training data. For lenge contains 27,189 images from 309 logo classes,
performance evaluation, the dataset includes 6,569 built by aggregating/refining 7 existing datasets
test images with manually labeled logo bounding and establishing an open logo detection evalua-
boxes for all the 194 logo classes. The dataset can tion protocol. The dataset can be downloaded at
be downloaded at http://www.eecs.qmul.ac.uk/ https://qmul-openlogo.github.io.
%7Ehs308/WebLogo-2M.html/.
SportsLogo [213], in the absence of public video A.2.6 Traffic Signs Detection
logo dataset, was collected on a set of tennis videos
containing 20 different tennis video clips with cam- This section reviews the 4 main datasets and
era motions (blurring) and occlusion. The logos benchmarks for evaluating traffic sign detectors
can appear on the background as well as on play- [133, 246, 379, 490], as well as the Bosch Small
ers and staffs clothes. 20 logos are annotated, with Traffic Lights [13]. The most challenging one is the
about 100 images for each logo. Tsinghua Tencent 100k (TTK100) [490], on which
Logos in the Wild [389] contains images collected Faster RCNN like detectors detectors such as [285]
from the web with logo annotations provided in have an overall precision/recall of 44%/68%, which
Pascal-VOC style. It contains large varieties of shows the difficulty of the dataset.
brands in-the-wild. The latest version (v2.0) of LISA Traffic Sign Dataset [246] was among the
the dataset consists of 11,054 images with 32,850 first datasets for traffic sign detection. It contains
annotated logo bounding boxes of 871 brands. It 47 US signs and 7,855 annotations on 6,610 video
contains from 4 to 608 images per searched brand, frames. Sign sizes vary from 6x6 to 167x168 pixels.
and 238 brands occur at least 10 times. It has up Each sign is annotated with sign type, position,
to 118 logos in one image. Only the links to the im- size, occluded (yes/no), on side road (yes/no). The
ages are released, which is problematic as numer- URL for this dataset is http://cvrr.ucsd.edu/
ous images have already disappeared, making exact LISA/lisa-traffic-sign-dataset.html
comparisons impossible. The dataset can be down- The German Traffic Sign Detection Benchmark
loaded from https://www.iosb.fraunhofer.de/ (GTSDB) [133] is one of the most popular traf-
servlet/is/78045/. fic signs detection benchmarks. It introduced a
Open Logo Detection Challenge [360]. This dataset with evaluation metrics, baseline results,
dataset assumes that only on a small proportion and a web interface for comparing approaches. The
of logo classes are annotated whilst the remaining dataset provides a total of 900 images with 1,206

101
traffic signs. The traffic sign sizes vary between A.2.7 Other Datasets
16 and 128 pixels w.r.t. the longest edge. The im-
age resolution is 1360 × 800; images capture dif- Some datasets do not fit in any of the previously
ferent scenarios (urban, rural, highway) during the mentioned category but deserve to be mentioned
daytime and dusk featuring various weather condi- because of the interest the community has for them.
tions. It can be found at http://benchmark.ini. iNaturalist Species Classification and Detection
rub.de/?section=gtsdb&subsection=news. Dataset [394] contains 859,000 images from over
5,000 different species of plants and animals. The
goal of this dataset is to encourage the devel-
Belgian TSD [379] consists of 7,356 still images opment of algorithms for ’in the wild’ data fea-
for training, with a total of 11,219 annotations, turing large numbers of imbalanced, one-grained,
corresponding to 2,459 traffic signs visible at less categories. The dataset can be downloaded
than 50 meters in at least one view. The test set at https://github.com/visipedia/inat_comp/
contains 4 sequences, captured by 8 roof-mounted tree/master/2017.
cameras on the van, with a total of 121,632 frames Below we give all known datasets that can be
and 269 different traffic signs for evaluating the used to tackle object detection with the different
detectors. For each sign, the type and 3D loca- modalities that we presented in the Sec. 4.1.
tion is given. The dataset can be downloaded at
https://btsd.ethz.ch/shareddata/.
A.3 3D Datasets
Tsinghua Tencent 100k (TTK100) [490] pro- KITTI object detection benchmark [98] is the most
vides 2048 × 2048 images for traffic signs detec- widely used dataset for evaluating detection in 3D
tion and classification, with various illumination point clouds. It contains 3 main categories (namely
and weather conditions. It’s the largest dataset 2D, 3D and birds-eye-view objects), 3 object cat-
for traffic signs detection, with 100,000 images out egories (cars, pedestrians and cyclists), and 3 dif-
of which 16,787 contain traffic signs instances, for ficulty levels (easy, moderate and hard consider-
a total of 30,000 traffic instances. There are a to- ing the object size, distance, occlusion and trun-
tal of 128 classes. Each instance is annotated with cation). The dataset is public and contains 7,481
class label, bounding box and pixel mask. It has images for training and 7,518 for testing, compris-
small objects in abundance and huge scale varia- ing a total of 80,256 labeled objects. The 3D point
tions. Some signs which are naturally rare, e.g. clouds are acquired with a Velodyne laser scanner.
signs to warn the driver to be cautious on mountain 3D object detection performance is evaluated using
roads appear, have quite low number of instances. the PASCAL criteria also used for 2D object detec-
There are 45 classes with at least 100 instances tion. For cars a 3D bounding box overlap of 70%
present. The dataset can be obtained at http: is required, while for pedestrians and cyclists a 3D
//cg.cs.tsinghua.edu.cn/traffic%2Dsign/. bounding box overlap of 50% is required. For eval-
uation, precision-recall curves are computed and
Bosch Small Traffic Lights [13] is made for the methods are ranked according to average preci-
benchmarking traffic light detectors. It contains sion. The algorithms can use the following sources
13,427 images of size 1280 × 720 pixels with around of information: i) Stereo: Method uses left and
24,000 annotated traffic lights, annotated with right (stereo) images ii) Flow: Method uses optical
bounding boxes and states (active light). Best flow (2 temporally adjacent images) iii) Multiview:
performing algorithm is [285] which obtained a Method uses more than 2 temporally adjacent im-
mAP of 53 on this dataset. Bosch Small Traffic ages iv) Laser Points: Method uses point clouds
Lights can be downloaded at https://hci.iwr. from Velodyne laser scanner v) Additional train-
uni-heidelberg.de/node/6132. ing data: Use of additional data sources for train-

102
ing. The datasets and performance of SOTA de- the ImageNet VID challenge [319]. Both are re-
tectors can be download at http://www.cvlibs. viewed in this section.
net/datasets/kitti/, and the leader board YouTube-BoundingBoxes [303] is a data set of
is at http://www.cvlibs.net/datasets/kitti/ video URLs with the single object bounding box
eval_object.php?obj_benchmark=3d. One of the annotations. All video sequences are annotated
leading methods is [342] which is at an mAP of with classifications and bounding boxes, at 1 frame
67.72/64.00/63.01 (Easy/Mod./Hard) for the car per second. There is a total of about 380,000 video
category, at 50 fps. Slower (10 fps) but more accu- segments of 15-20 seconds, from 240,000 publicly
rate, [182] has a performance of 81.94/71.88/66.38 available YouTube videos, featuring objects in nat-
on cars. Chen et al. [47], Zhou and Tuzel [478] and ural settings, without editing or post-processing.
Qi et al. [289] also gave very good results. Real et al. [303] reported a mAP of 59 on this
Active Vision Dataset (AVD) [5] contains dataset. This dataset can be downloaded at https:
30,000+ RGBD images, 30+ frequently occur- //research.google.com/youtube-bb/.
ring instances, 15 scenes, and 70,000+ 2D bound- ImageNet VID challenge [319] was a part of
ing boxes. This dataset focused on simulating the ILSVRC 2015 challenge. It has a training
robotic vision tasks in everyday indoor environ- set of 3,862 fully annotated video sequences hav-
ments using real imagery. The dataset can be ing a length from 6 frames to 5,492 frames per
downloaded at http://cs.unc.edu/~ammirato/ video. The validation set contains 555 fully an-
active_vision_dataset_website/. notated videos, ranging from 11 frames to 2898
SceneNet RGB-D [239] is a synthetic dataset de- frames per video. Finally, the test set contains
signed for scene understanding problems such as se- 937 video sequences and the ground-truth anno-
mantic segmentation, instance segmentation, and tation are not publicly available. One of the best
object detection. It provides camera poses and performing methods on ImageNet VID is [89] with
depth data and permits to create any scene con- a mAP of 79.8, by combining detection and track-
figuration. 5M rendered RGB-D images from 16K ing. Zhu et al. [484] reached 76.3 points with a flow
randomly generated 3D trajectories in synthetic best approach. This dataset can be downloaded at
layouts are also provided. The dataset can be http://image-net.org/challenges/LSVRC.
downloaded at http://robotvault.bitbucket. VisDrone [481] contains video clips acquired by
io/scenenet-rgbd.html. drones. This dataset is presented in Section 5.2.1
Falling Things [384] introduced a novel synthetic
dataset for 3D object detection and pose estima- A.5 Concluding Remarks
tion, the Falling Things dataset. The dataset con-
tains 60k annotated photos of 21 household objects This appendix gave a large overview of the datasets
taken from the YCB dataset. For each image, the introduced by the community for developing and
3D poses, per-pixel class segmentation, and 2D/3D evaluating object detectors in images, videos or
bounding box coordinates for all objects are given. 3D point clouds. Each object detection dataset
To facilitate testing different input modalities, presents a very biased view of the world, as shown
mono and stereo RGB images are provided, along in [169, 380, 381], representative of the user’s needs
with registered dense depth images. The dataset when they built it. The bias is not only in the
can be downloaded at http://research.nvidia. images they chose (specific views of objects, ob-
com/publication/2018-06_Falling-Things. jects imbalance [264], objects categories) but also
in the metric they created and the evaluation pro-
tocol they devised. The community is trying its
A.4 Video Datasets
best to build more and more datasets with less and
The two most popular datasets for video object de- less bias and as a result it has become quite hard
tection are the YouTube-BoundingBoxes [303] and to find its way in this jungle of datasets, especially

103
when one needs: older datasets that have fallen
out of fashion or even exhaustive lists of state-of-
the-art algorithms performances on modern ones.
Through this survey we have partially addressed
this need of a common source for information on
datasets.

104

You might also like