0% found this document useful (0 votes)

6 views

ref16

The article presents Faster R-CNN, a novel object detection framework that integrates Region Proposal Networks (RPN) to enhance detection speed and accuracy by sharing convolutional features with the detection network. This approach significantly reduces the computational burden of region proposals, achieving a frame rate of 5fps on a GPU while maintaining state-of-the-art accuracy on various datasets. The method has been successfully utilized in competitive settings, leading to first-place entries in several object detection competitions.

Uploaded by

satishbokka1619

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views

ref16

Uploaded by

satishbokka1619

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPAMI.2016.2577031, IEEE Transactions on Pattern Analysis and Machine Intelligence

Faster R-CNN: Towards Real-Time Object

Detection with Region Proposal Networks
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun

Abstract—State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations.
Advances like SPPnet [1] and Fast R-CNN [2] have reduced the running time of these detection networks, exposing region
proposal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image
convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully convolutional
network that simultaneously predicts object bounds and objectness scores at each position. The RPN is trained end-to-end to
generate high-quality region proposals, which are used by Fast R-CNN for detection. We further merge RPN and Fast R-CNN
into a single network by sharing their convolutional features—using the recently popular terminology of neural networks with
’attention’ mechanisms, the RPN component tells the unified network where to look. For the very deep VGG-16 model [3],
our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection
accuracy on PASCAL VOC 2007, 2012, and MS COCO datasets with only 300 proposals per image. In ILSVRC and COCO
2015 competitions, Faster R-CNN and RPN are the foundations of the 1st-place winning entries in several tracks. Code has been
made publicly available.

Index Terms—Object Detection, Region Proposal, Convolutional Neural Network.

1 1 I NTRODUCTION One may note that fast region-based CNNs take 26

2 Recent advances in object detection are driven by advantage of GPUs, while the region proposal meth- 27

3 the success of region proposal methods (e.g., [4]) ods used in research are implemented on the CPU, 28

4 and region-based convolutional neural networks (R- making such runtime comparisons inequitable. An ob- 29

5 CNNs) [5]. Although region-based CNNs were com- vious way to accelerate proposal computation is to re- 30

6 putationally expensive as originally developed in [5], implement it for the GPU. This may be an effective en- 31

7 their cost has been drastically reduced thanks to shar- gineering solution, but re-implementation ignores the 32

8 ing convolutions across proposals [1], [2]. The latest down-stream detection network and therefore misses 33

9 incarnation, Fast R-CNN [2], achieves near real-time important opportunities for sharing computation. 34

10 rates using very deep networks [3], when ignoring the In this paper, we show that an algorithmic change— 35

11 time spent on region proposals. Now, proposals are the computing proposals with a deep convolutional neu- 36

12 test-time computational bottleneck in state-of-the-art ral network—leads to an elegant and effective solution 37

13 detection systems. where proposal computation is nearly cost-free given 38

the detection network’s computation. To this end, we 39

14 Region proposal methods typically rely on inex-
introduce novel Region Proposal Networks (RPNs) that 40
15 pensive features and economical inference schemes.
share convolutional layers with state-of-the-art object 41
16 Selective Search [4], one of the most popular meth-
detection networks [1], [2]. By sharing convolutions at 42
17 ods, greedily merges superpixels based on engineered
test-time, the marginal cost for computing proposals 43
18 low-level features. Yet when compared to efficient
is small (e.g., 10ms per image). 44
19 detection networks [2], Selective Search is an order of
Our observation is that the convolutional feature 45
20 magnitude slower, at 2 seconds per image in a CPU
maps used by region-based detectors, like Fast R- 46
21 implementation. EdgeBoxes [6] currently provides the
CNN, can also be used for generating region pro- 47
22 best tradeoff between proposal quality and speed,
posals. On top of these convolutional features, we 48
23 at 0.2 seconds per image. Nevertheless, the region
construct an RPN by adding a few additional con- 49
24 proposal step still consumes as much running time
volutional layers that simultaneously regress region 50
25 as the detection network.
bounds and objectness scores at each location on a 51

• S. Ren is with University of Science and Technology of China, Hefei,

regular grid. The RPN is thus a kind of fully convo- 52

China. This work was done when S. Ren was an intern at Microsoft lutional network (FCN) [7] and can be trained end-to- 53

Research. Email: [email protected] end specifically for the task for generating detection 54
• K. He and J. Sun are with Visual Computing Group, Microsoft
Research. E-mail: {kahe,jiansun}@microsoft.com
proposals. 55

• R. Girshick is with Facebook AI Research. The majority of this work RPNs are designed to efficiently predict region pro- 56

was done when R. Girshick was with Microsoft Research. E-mail: posals with a wide range of scales and aspect ratios. In 57
[email protected]
contrast to prevalent methods [8], [9], [1], [2] that use 58

0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPAMI.2016.2577031, IEEE Transactions on Pattern Analysis and Machine Intelligence

multiple filter sizes

multiple references

feature map feature map feature map

multiple scaled images

image image image

(a) (b) (c)

Figure 1: Different schemes for addressing multiple scales and sizes. (a) Pyramids of images and feature maps
are built, and the classifier is run at all scales. (b) Pyramids of filters with multiple scales/sizes are run on
the feature map. (c) We use pyramids of reference boxes in the regression functions.

59 pyramids of images (Figure 1, a) or pyramids of filters engagement improvements reported. 101

60 (Figure 1, b), we introduce novel “anchor” boxes In ILSVRC and COCO 2015 competitions, Faster 102

61 that serve as references at multiple scales and aspect R-CNN and RPN are the basis of several 1st-place 103

62 ratios. Our scheme can be thought of as a pyramid entries [18] in the tracks of ImageNet detection, Ima- 104

63 of regression references (Figure 1, c), which avoids geNet localization, COCO detection, and COCO seg- 105

64 enumerating images or filters of multiple scales or mentation. RPNs completely learn to propose regions 106

65 aspect ratios. This model performs well when trained from data, and thus can easily benefit from deeper 107

66 and tested using single-scale images and thus benefits and more expressive features (such as the 101-layer 108

67 running speed. residual nets adopted in [18]). Faster R-CNN and RPN 109

68 To unify RPNs with Fast R-CNN [2] object detec- are also used by several other leading entries in these 110

69 tion networks, we propose a training scheme that competitions2 . These results suggest that our method 111

70 alternates between fine-tuning for the region proposal is not only a cost-efficient solution for practical usage, 112

71 task and then fine-tuning for object detection, while but also an effective way of improving object detec- 113

72 keeping the proposals fixed. This scheme converges tion accuracy. 114

73 quickly and produces a unified network with convo-

74 lutional features that are shared between both tasks.1 2 R ELATED W ORK 115
75 We comprehensively evaluate our method on the
Object Proposals. There is a large literature on object 116
76 PASCAL VOC detection benchmarks [11] where RPNs
proposal methods. Comprehensive surveys and com- 117
77 with Fast R-CNNs produce detection accuracy bet-
parisons of object proposal methods can be found in 118
78 ter than the strong baseline of Selective Search with
[19], [20], [21]. Widely used object proposal methods 119
79 Fast R-CNNs. Meanwhile, our method waives nearly
include those based on grouping super-pixels (e.g., 120
80 all computational burdens of Selective Search at
Selective Search [4], CPMC [22], MCG [23]) and those 121
81 test-time—the effective running time for proposals
based on sliding windows (e.g., objectness in windows 122
82 is just 10 milliseconds. Using the expensive very
[24], EdgeBoxes [6]). Object proposal methods were 123
83 deep models of [3], our detection method still has
adopted as external modules independent of the de- 124
84 a frame rate of 5fps (including all steps) on a GPU,
tectors (e.g., Selective Search [4] object detectors, R- 125
85 and thus is a practical object detection system in
CNN [5], and Fast R-CNN [2]). 126
86 terms of both speed and accuracy. We also report
87 results on the MS COCO dataset [12] and investi- Deep Networks for Object Detection. The R-CNN 127

88 gate the improvements on PASCAL VOC using the method [5] trains CNNs end-to-end to classify the 128

89 COCO data. Code has been made publicly available proposal regions into object categories or background. 129

90 at https://github.com/shaoqingren/faster_ R-CNN mainly plays as a classifier, and it does not 130

91 rcnn (in MATLAB) and https://github.com/ predict object bounds (except for refining by bounding 131

92 rbgirshick/py-faster-rcnn (in Python). box regression). Its accuracy depends on the perfor- 132

93 A preliminary version of this manuscript was pub- mance of the region proposal module (see compar- 133

94 lished previously [10]. Since then, the frameworks of isons in [20]). Several papers have proposed ways of 134

95 RPN and Faster R-CNN have been adopted and gen- using deep networks for predicting object bounding 135

96 eralized to other methods, such as 3D object detection boxes [25], [9], [26], [27]. In the OverFeat method [9], 136

97 [13], part-based detection [14], instance segmentation a fully-connected layer is trained to predict the box 137

98 [15], and image captioning [16]. Our fast and effective coordinates for the localization task that assumes a 138

99 object detection system has also been built in com- single object. The fully-connected layer is then turned 139

100 mercial systems such as at Pinterests [17], with user into a convolutional layer for detecting multiple class- 140

specific objects. The MultiBox methods [26], [27] gen- 141

1. Since the publication of the conference version of this paper erate region proposals from a network whose last 142

[10], we have also found that RPNs can be trained jointly with Fast
R-CNN networks leading to less training time. 2. http://image-net.org/challenges/LSVRC/2015/results
0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPAMI.2016.2577031, IEEE Transactions on Pattern Analysis and Machine Intelligence

classifier
develop algorithms for training both modules with 178

features shared. 179

RoI pooling
3.1 Region Proposal Networks 180

proposals A Region Proposal Network (RPN) takes an image 181

(of any size) as input and outputs a set of rectangular 182

object proposals, each with an objectness score.3 We 183

Region Proposal Network model this process with a fully convolutional network 184

feature maps [7], which we describe in this section. Because our ulti- 185

mate goal is to share computation with a Fast R-CNN 186

object detection network [2], we assume that both nets 187

share a common set of convolutional layers. In our ex- 188

periments, we investigate the Zeiler and Fergus model 189

conv layers
[32] (ZF), which has 5 shareable convolutional layers 190

and the Simonyan and Zisserman model [3] (VGG-16), 191

image which has 13 shareable convolutional layers. 192

To generate region proposals, we slide a small 193

Figure 2: Faster R-CNN is a single, unified network network over the convolutional feature map output 194

for object detection. The RPN module serves as the by the last shared convolutional layer. This small 195

‘attention’ of this unified network. network takes as input an n × n spatial window of 196

the input convolutional feature map. Each sliding 197

143 fully-connected layer simultaneously predicts mul- window is mapped to a lower-dimensional feature 198
144 tiple class-agnostic boxes, generalizing the “single- (256-d for ZF and 512-d for VGG, with ReLU [33] 199
145 box” fashion of OverFeat. These class-agnostic boxes following). This feature is fed into two sibling fully- 200
146 are used as proposals for R-CNN [5]. The MultiBox connected layers—a box-regression layer (reg) and a 201
147 proposal network is applied on a single image crop or box-classification layer (cls). We use n = 3 in this 202
148 multiple large image crops (e.g., 224×224), in contrast paper, noting that the effective receptive field on the 203
149 to our fully convolutional scheme. MultiBox does not input image is large (171 and 228 pixels for ZF and 204
150 share features between the proposal and detection VGG, respectively). This mini-network is illustrated 205
151 networks. We discuss OverFeat and MultiBox in more at a single position in Figure 3 (left). Note that be- 206
152 depth later in context with our method. Concurrent cause the mini-network operates in a sliding-window 207
153 with our work, the DeepMask method [28] is devel- fashion, the fully-connected layers are shared across 208
154 oped for learning segmentation proposals. all spatial locations. This architecture is naturally im- 209
155 Shared computation of convolutions [9], [1], [29], plemented with an n × n convolutional layer followed 210
156 [7], [2] has been attracting increasing attention for ef- by two sibling 1 × 1 convolutional layers (for reg and 211
157 ficient, yet accurate, visual recognition. The OverFeat cls, respectively). 212
158 paper [9] computes convolutional features from an
159 image pyramid for classification, localization, and de- 3.1.1 Anchors 213
160 tection. Adaptively-sized pooling (SPP) [1] on shared At each sliding-window location, we simultaneously 214
161 convolutional feature maps is developed for efficient predict multiple region proposals, where the number 215
162 region-based object detection [1], [30] and semantic of maximum possible proposals for each location is 216
163 segmentation [29]. Fast R-CNN [2] enables end-to-end denoted as k. So the reg layer has 4k outputs encoding 217
164 detector training on shared convolutional features and the coordinates of k boxes, and the cls layer outputs 218
165 shows compelling accuracy and speed. 2k scores that estimate probability of object or not 219

object for each proposal4 . The k proposals are param- 220

166 3 FASTER R-CNN eterized relative to k reference boxes, which we call 221

167 Our object detection system, called Faster R-CNN, is anchors. An anchor is centered at the sliding window 222

168 composed of two modules. The first module is a deep in question, and is associated with a scale and aspect 223

169 fully convolutional network that proposes regions, ratio (Figure 3, left). By default we use 3 scales and 224

170 and the second module is the Fast R-CNN detector [2] 3 aspect ratios, yielding k = 9 anchors at each sliding 225

171 that uses the proposed regions. The entire system is a position. For a convolutional feature map of a size 226

172 single, unified network for object detection (Figure 2).

3. “Region” is a generic term and in this paper we only consider
173 Using the recently popular terminology of neural rectangular regions, as is common for many methods (e.g., [27], [4],
174 networks with ‘attention’ [31] mechanisms, the RPN [6]). “Objectness” measures membership to a set of object classes
175 module tells the Fast R-CNN module where to look. vs. background.
4. For simplicity we implement the cls layer as a two-class
176 In Section 3.1 we introduce the designs and properties softmax layer. Alternatively, one may use logistic regression to
177 of the network for region proposal. In Section 3.2 we produce k scores.
0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPAMI.2016.2577031, IEEE Transactions on Pattern Analysis and Machine Intelligence

person : 0.992
2k scores 4k coordinates k anchor boxes dog : 0.994
horse : 0.993

cls layer reg layer car : 1.000 cat : 0.982

dog : 0.997 person : 0.979

256-d
intermediate layer

bus : 0.996
boat : 0.970

person : 0.983
person : 0.736 person : 0.983
person : 0.925

person : 0.989

sliding window
conv feature map

Figure 3: Left: Region Proposal Network (RPN). Right: Example detections using RPN proposals on PASCAL
VOC 2007 test. Our method detects objects in a wide range of scales and aspect ratios.

227 W × H (typically ∼2,400), there are W Hk anchors in image/feature pyramids, e.g., in DPM [8] and CNN- 261

228 total. based methods [9], [1], [2]. The images are resized at 262

multiple scales, and feature maps (HOG [8] or deep 263

229 Translation-Invariant Anchors
convolutional features [9], [1], [2]) are computed for 264
230 An important property of our approach is that it
each scale (Figure 1(a)). This way is often useful but 265
231 is translation invariant, both in terms of the anchors
is time-consuming. The second way is to use sliding 266
232 and the functions that compute proposals relative to
windows of multiple scales (and/or aspect ratios) on 267
233 the anchors. If one translates an object in an image,
the feature maps. For example, in DPM [8], models 268
234 the proposal should translate and the same function
of different aspect ratios are trained separately using 269
235 should be able to predict the proposal in either lo-
different filter sizes (such as 5×7 and 7×5). If this way 270
236 cation. This translation-invariant property is guaran-
is used to address multiple scales, it can be thought 271
237 teed by our method5 . As a comparison, the MultiBox
of as a “pyramid of filters” (Figure 1(b)). The second 272
238 method [27] uses k-means to generate 800 anchors,
way is usually adopted jointly with the first way [8]. 273
239 which are not translation invariant. So MultiBox does
As a comparison, our anchor-based method is built 274
240 not guarantee that the same proposal is generated if
on a pyramid of anchors, which is more cost-efficient. 275
241 an object is translated.
Our method classifies and regresses bounding boxes 276
242 The translation-invariant property also reduces the
with reference to anchor boxes of multiple scales and 277
243 model size. MultiBox has a (4 + 1) × 800-dimensional
aspect ratios. It only relies on images and feature 278
244 fully-connected output layer, whereas our method has
maps of a single scale, and uses filters (sliding win- 279
245 a (4 + 2) × 9-dimensional convolutional output layer
dows on the feature map) of a single size. We show by 280
246 in the case of k = 9 anchors6 . As a result, our output
experiments the effects of this scheme for addressing 281
247 layer has 2.8 × 104 parameters (512 × (4 + 2) × 9
multiple scales and sizes (Table 8). 282
248 for VGG-16), two orders of magnitude fewer than
Because of this multi-scale design based on anchors, 283
249 MultiBox’s output layer that has 6.1 × 106 parameters
we can simply use the convolutional features com- 284
250 (1536 × (4 + 1) × 800 for GoogleNet [34] in MultiBox
puted on a single-scale image, as is also done by 285
251 [27]). If considering the feature projection layers, our
the Fast R-CNN detector [2]. The design of multi- 286
252 proposal layers still have an order of magnitude fewer
scale anchors is a key component for sharing features 287
253 parameters than MultiBox7 . We expect our method
without extra cost for addressing scales. 288
254 to have less risk of overfitting on small datasets, like
255 PASCAL VOC.
3.1.2 Loss Function 289

256 Multi-Scale Anchors as Regression References For training RPNs, we assign a binary class label 290
257 Our design of anchors presents a novel scheme (of being an object or not) to each anchor. We as- 291
258 for addressing multiple scales (and aspect ratios). As sign a positive label to two kinds of anchors: (i) the 292
259 shown in Figure 1, there have been two popular ways anchor/anchors with the highest Intersection-over- 293
260 for multi-scale predictions. The first way is based on Union (IoU) overlap with a ground-truth box, or (ii) an 294

anchor that has an IoU overlap higher than 0.7 with 295
5. As is the case of FCNs [7], our network is translation invariant
up to the network’s total stride.
any ground-truth box. Note that a single ground-truth 296

6. 4 is the dimension of reg term for each category, and 1 or 2 is box may assign positive labels to multiple anchors. 297

the dimension of cls term of sigmoid or softmax for each category Usually the second condition is sufficient to determine 298
7. Considering the feature projection layers, our proposal layers’ the positive samples; but we still adopt the first 299
parameter count is 3 × 3 × 512 × 512 + 512 × 6 × 9 = 2.4 × 106 ;
MultiBox’s proposal layers’ parameter count is 7 × 7 × (64 + 96 + condition for the reason that in some rare cases the 300

64 + 64) × 1536 + 1536 × 5 × 800 = 27 × 106 . second condition may find no positive sample. We 301

302 assign a negative label to a non-positive anchor if its pooled from arbitrarily sized RoIs, and the regression 348

303 IoU ratio is lower than 0.3 for all ground-truth boxes. weights are shared by all region sizes. In our formula- 349

304 Anchors that are neither positive nor negative do not tion, the features used for regression are of the same 350

305 contribute to the training objective. spatial size (3 × 3) on the feature maps. To account 351

306 With these definitions, we minimize an objective for varying sizes, a set of k bounding-box regressors 352

307 function following the multi-task loss in Fast R-CNN are learned. Each regressor is responsible for one scale 353

308 [2]. Our loss function for an image is defined as: and one aspect ratio, and the k regressors do not share 354

weights. As such, it is still possible to predict boxes of 355

1 X
L({pi }, {ti }) = Lcls (pi , p∗i ) various sizes even though the features are of a fixed 356
Ncls i size/scale, thanks to the design of anchors. 357
(1)
1 X ∗
+λ pi Lreg (ti , t∗i ).
Nreg i 3.1.3 Training RPNs 358

The RPN can be trained end-to-end by back- 359

309 Here, i is the index of an anchor in a mini-batch and propagation and stochastic gradient descent (SGD) 360
310 pi is the predicted probability of anchor i being an [35]. We follow the “image-centric” sampling strategy 361
311 object. The ground-truth label p∗i is 1 if the anchor from [2] to train this network. Each mini-batch arises 362
312 is positive, and is 0 if the anchor is negative. ti is a from a single image that contains many positive and 363
313 vector representing the 4 parameterized coordinates negative example anchors. It is possible to optimize 364
314 of the predicted bounding box, and t∗i is that of the for the loss functions of all anchors, but this will 365
315 ground-truth box associated with a positive anchor. bias towards negative samples as they are dominate. 366
316 The classification loss Lcls is log loss over two classes Instead, we randomly sample 256 anchors in an image 367
317 (object vs. not object). For the regression loss, we use to compute the loss function of a mini-batch, where 368
318 Lreg (ti , t∗i ) = R(ti − t∗i ) where R is the robust loss the sampled positive and negative anchors have a 369
319 function (smooth L1 ) defined in [2]. The term p∗i Lreg ratio of up to 1:1. If there are fewer than 128 positive 370
320 means the regression loss is activated only for positive samples in an image, we pad the mini-batch with 371
321 anchors (p∗i = 1) and is disabled otherwise (p∗i = 0). negative ones. 372
322 The outputs of the cls and reg layers consist of {pi } We randomly initialize all new layers by drawing 373
323 and {ti } respectively. weights from a zero-mean Gaussian distribution with 374
324 The two terms are normalized by Ncls and Nreg standard deviation 0.01. All other layers (i.e., the 375
325 and weighted by a balancing parameter λ. In our shared convolutional layers) are initialized by pre- 376
326 current implementation (as in the released code), the training a model for ImageNet classification [36], as 377
327 cls term in Eqn.(1) is normalized by the mini-batch is standard practice [5]. We tune all layers of the 378
328 size (i.e., Ncls = 256) and the reg term is normalized ZF net, and conv3 1 and up for the VGG net to 379
329 by the number of anchor locations (i.e., Nreg ∼ 2, 400). conserve memory [2]. We use a learning rate of 0.001 380
330 By default we set λ = 10, and thus both cls and for 60k mini-batches, and 0.0001 for the next 20k 381
331 reg terms are roughly equally weighted. We show mini-batches on the PASCAL VOC dataset. We use a 382
332 by experiments that the results are insensitive to the momentum of 0.9 and a weight decay of 0.0005 [37]. 383
333 values of λ in a wide range (Table 9). We also note Our implementation uses Caffe [38]. 384
334 that the normalization as above is not required and
335 could be simplified.
336 For bounding box regression, we adopt the param- 3.2 Sharing Features for RPN and Fast R-CNN 385

337 eterizations of the 4 coordinates following [5]: Thus far we have described how to train a network 386

for region proposal generation, without considering 387

tx = (x − xa )/wa , ty = (y − ya )/ha ,
the region-based object detection CNN that will utilize 388
tw = log(w/wa ), th = log(h/ha ), these proposals. For the detection network, we adopt 389
(2)
t∗x = (x∗ − xa )/wa , t∗y = (y ∗ − ya )/ha , Fast R-CNN [2]. Next we describe algorithms that 390

t∗w ∗
= log(w /wa ), t∗h ∗
= log(h /ha ), learn a unified network composed of RPN and Fast 391

R-CNN with shared convolutional layers (Figure 2). 392

338 where x, y, w, and h denote the box’s center coordi- Both RPN and Fast R-CNN, trained independently, 393

339 nates and its width and height. Variables x, xa , and will modify their convolutional layers in different 394

340 x∗ are for the predicted box, anchor box, and ground- ways. We therefore need to develop a technique that 395

341 truth box respectively (likewise for y, w, h). This can allows for sharing convolutional layers between the 396

342 be thought of as bounding-box regression from an two networks, rather than learning two separate net- 397

343 anchor box to a nearby ground-truth box. works. We discuss three ways for training networks 398

344 Nevertheless, our method achieves bounding-box with features shared: 399

345 regression by a different manner from previous RoI- (i) Alternating training. In this solution, we first train 400

346 based (Region of Interest) methods [1], [2]. In [1], RPN, and use the proposals to train Fast R-CNN. 401

347 [2], bounding-box regression is performed on features The network tuned by Fast R-CNN is then used to 402

Table 1: the learned average proposal size for each anchor using the ZF net (numbers for s = 600).
anchor 1282 , 2:1 1282 , 1:1 1282 , 1:2 2562 , 2:1 2562 , 1:1 2562 , 1:2 5122 , 2:1 5122 , 1:1 5122 , 1:2
proposal 188×111 113×114 70×92 416×229 261×284 174×332 768×437 499×501 355×715

403 initialize RPN, and this process is iterated. This is the network. A similar alternating training can be run 456

404 solution that is used in all experiments in this paper. for more iterations, but we have observed negligible 457

405 (ii) Approximate joint training. In this solution, the improvements. 458

406 RPN and Fast R-CNN networks are merged into one
407 network during training as in Figure 2. In each SGD 3.3 Implementation Details 459

408 iteration, the forward pass generates region propos- We train and test both region proposal and object 460

409 als which are treated just like fixed, pre-computed detection networks on images of a single scale [1], [2]. 461

410 proposals when training a Fast R-CNN detector. The We re-scale the images such that their shorter side 462

411 backward propagation takes place as usual, where for is s = 600 pixels [2]. Multi-scale feature extraction 463

412 the shared layers the backward propagated signals (using an image pyramid) may improve accuracy but 464

413 from both the RPN loss and the Fast R-CNN loss does not exhibit a good speed-accuracy trade-off [2]. 465

414 are combined. This solution is easy to implement. But On the re-scaled images, the total stride for both ZF 466

415 this solution ignores the derivative w.r.t. the proposal and VGG nets on the last convolutional layer is 16 467

416 boxes’ coordinates that are also network responses, so pixels, and thus is ∼10 pixels on a typical PASCAL 468

417 is approximate. In our experiments, we have empiri- image before resizing (∼500×375). Even such a large 469

418 cally found this solver produces close results (mAP stride provides good results, though accuracy may be 470

419 70.0% compared with 69.9% of alternating training further improved with a smaller stride. 471

420 reported in Table 3), yet reduces the training time For anchors, we use 3 scales with box areas of 1282 , 472

421 by about 25-50% comparing with alternating training. 2562 , and 5122 pixels, and 3 aspect ratios of 1:1, 1:2, 473

422 This solver is included in our released Python code. and 2:1. These hyper-parameters are not carefully cho- 474

423 (iii) Non-approximate joint training. As discussed sen for a particular dataset, and we provide ablation 475

424 above, the bounding boxes predicted by RPN are experiments on their effects in the next section. As dis- 476

425 also functions of the input. The RoI pooling layer cussed, our solution does not need an image pyramid 477

426 [2] in Fast R-CNN accepts the convolutional features or filter pyramid to predict regions of multiple scales, 478

427 and also the predicted bounding boxes as input, so saving considerable running time. Figure 3 (right) 479

428 a theoretically valid backpropagation solver should shows the capability of our method for a wide range 480

429 also involve gradients w.r.t. the box coordinates. These of scales and aspect ratios. Table 1 shows the learned 481

430 gradients are ignored in the above approximate joint average proposal size for each anchor using the ZF 482

431 training. In a non-approximate joint training solution, net. We note that our algorithm allows predictions 483

432 we need an RoI pooling layer that is differentiable that are larger than the underlying receptive field. 484

433 w.r.t. the box coordinates. This is a nontrivial problem Such predictions are not impossible—one may still 485

434 and a solution can be given by an “RoI warping” layer roughly infer the extent of an object if only the middle 486

435 as developed in [15], which is beyond the scope of this of the object is visible. 487

436 paper. The anchor boxes that cross image boundaries need 488

to be handled with care. During training, we ignore 489

437 4-Step Alternating Training. In this paper, we adopt all cross-boundary anchors so they do not contribute 490

438 a pragmatic 4-step training algorithm to learn shared to the loss. For a typical 1000 × 600 image, there 491

439 features via alternating optimization. In the first step, will be roughly 20000 (≈ 60 × 40 × 9) anchors in 492

440 we train the RPN as described in Section 3.1.3. This total. With the cross-boundary anchors ignored, there 493

441 network is initialized with an ImageNet-pre-trained are about 6000 anchors per image for training. If the 494

442 model and fine-tuned end-to-end for the region pro- boundary-crossing outliers are not ignored in training, 495

443 posal task. In the second step, we train a separate they introduce large, difficult to correct error terms in 496

444 detection network by Fast R-CNN using the proposals the objective, and training does not converge. During 497

445 generated by the step-1 RPN. This detection net- testing, however, we still apply the fully convolutional 498

446 work is also initialized by the ImageNet-pre-trained RPN to the entire image. This may generate cross- 499

447 model. At this point the two networks do not share boundary proposal boxes, which we clip to the image 500

448 convolutional layers. In the third step, we use the boundary. 501

449 detector network to initialize RPN training, but we Some RPN proposals highly overlap with each 502

450 fix the shared convolutional layers and only fine-tune other. To reduce redundancy, we adopt non-maximum 503

451 the layers unique to RPN. Now the two networks suppression (NMS) on the proposal regions based on 504

452 share convolutional layers. Finally, keeping the shared their cls scores. We fix the IoU threshold for NMS 505

453 convolutional layers fixed, we fine-tune the unique at 0.7, which leaves us about 2000 proposal regions 506

454 layers of Fast R-CNN. As such, both networks share per image. As we will show, NMS does not harm the 507

455 the same convolutional layers and form a unified ultimate detection accuracy, but substantially reduces 508

Table 2: Detection results on PASCAL VOC 2007 test set (trained on VOC 2007 trainval). The detectors are
Fast R-CNN with ZF, but using various proposal methods for training and testing.
train-time region proposals test-time region proposals
method # boxes method # proposals mAP (%)
SS 2000 SS 2000 58.7
EB 2000 EB 2000 58.6
RPN+ZF, shared 2000 RPN+ZF, shared 300 59.9
ablation experiments follow below
RPN+ZF, unshared 2000 RPN+ZF, unshared 300 58.7
SS 2000 RPN+ZF 100 55.1
SS 2000 RPN+ZF 300 56.8
SS 2000 RPN+ZF 1000 56.3
SS 2000 RPN+ZF (no NMS) 6000 55.2
SS 2000 RPN+ZF (no cls) 100 44.6
SS 2000 RPN+ZF (no cls) 300 51.4
SS 2000 RPN+ZF (no cls) 1000 55.8
SS 2000 RPN+ZF (no reg) 300 52.1
SS 2000 RPN+ZF (no reg) 1000 51.3
SS 2000 RPN+VGG 300 59.2

509 the number of proposals. After NMS, we use the using either SS or EB because of shared convolutional 541

510 top-N ranked proposal regions for detection. In the computations; the fewer proposals also reduce the 542

511 following, we train Fast R-CNN using 2000 RPN pro- region-wise fully-connected layers’ cost (Table 5). 543

512 posals, but evaluate different numbers of proposals at Ablation Experiments on RPN. To investigate the be- 544
513 test-time. havior of RPNs as a proposal method, we conducted 545

several ablation studies. First, we show the effect of 546

514 4 E XPERIMENTS sharing convolutional layers between the RPN and 547

Fast R-CNN detection network. To do this, we stop 548

515 4.1 Experiments on PASCAL VOC
after the second step in the 4-step training process. 549
516 We comprehensively evaluate our method on the Using separate networks reduces the result slightly to 550
517 PASCAL VOC 2007 detection benchmark [11]. This 58.7% (RPN+ZF, unshared, Table 2). We observe that 551
518 dataset consists of about 5k trainval images and 5k this is because in the third step when the detector- 552
519 test images over 20 object categories. We also provide tuned features are used to fine-tune the RPN, the 553
520 results on the PASCAL VOC 2012 benchmark for a proposal quality is improved. 554
521 few models. For the ImageNet pre-trained network, Next, we disentangle the RPN’s influence on train- 555
522 we use the “fast” version of ZF net [32] that has ing the Fast R-CNN detection network. For this pur- 556
523 5 convolutional layers and 3 fully-connected layers, pose, we train a Fast R-CNN model by using the 557
524 and the public VGG-16 model8 [3] that has 13 con- 2000 SS proposals and ZF net. We fix this detector 558
525 volutional layers and 3 fully-connected layers. We and evaluate the detection mAP by changing the 559
526 primarily evaluate detection mean Average Precision proposal regions used at test-time. In these ablation 560
527 (mAP), because this is the actual metric for object experiments, the RPN does not share features with 561
528 detection (rather than focusing on object proposal the detector. 562
529 proxy metrics). Replacing SS with 300 RPN proposals at test-time 563

530 Table 2 (top) shows Fast R-CNN results when leads to an mAP of 56.8%. The loss in mAP is because 564

531 trained and tested using various region proposal of the inconsistency between the training/testing pro- 565

532 methods. These results use the ZF net. For Selective posals. This result serves as the baseline for the fol- 566

533 Search (SS) [4], we generate about 2000 proposals by lowing comparisons. 567

534 the “fast” mode. For EdgeBoxes (EB) [6], we generate Somewhat surprisingly, the RPN still leads to a 568

535 the proposals by the default EB setting tuned for 0.7 competitive result (55.1%) when using the top-ranked 569

536 IoU. SS has an mAP of 58.7% and EB has an mAP 100 proposals at test-time, indicating that the top- 570

537 of 58.6% under the Fast R-CNN framework. RPN ranked RPN proposals are accurate. On the other 571

538 with Fast R-CNN achieves competitive results, with extreme, using the top-ranked 6000 RPN proposals 572

539 an mAP of 59.9% while using up to 300 proposals9 . (without NMS) has a comparable mAP (55.2%), sug- 573

540 Using RPN yields a much faster detection system than gesting NMS does not harm the detection mAP and 574

may reduce false alarms. 575

8. www.robots.ox.ac.uk/∼vgg/research/very deep/ Next, we separately investigate the roles of RPN’s 576
9. For RPN, the number of proposals (e.g., 300) is the maximum
number for an image. RPN may produce fewer proposals after cls and reg outputs by turning off either of them 577

NMS, and thus the average number of proposals is smaller. at test-time. When the cls layer is removed at test- 578

Table 3: Detection results on PASCAL VOC 2007 test set. The detector is Fast R-CNN and VGG-16. Training
data: “07”: VOC 2007 trainval, “07+12”: union set of VOC 2007 trainval and VOC 2012 trainval. For RPN,
the train-time proposals for Fast R-CNN are 2000. † : this number was reported in [2]; using the repository
provided by this paper, this result is higher (68.1).
method # proposals data mAP (%)
SS 2000 07 66.9†
SS 2000 07+12 70.0
RPN+VGG, unshared 300 07 68.5
RPN+VGG, shared 300 07 69.9
RPN+VGG, shared 300 07+12 73.2
RPN+VGG, shared 300 COCO+07+12 78.8

Table 4: Detection results on PASCAL VOC 2012 test set. The detector is Fast R-CNN and VGG-16. Training
data: “07”: VOC 2007 trainval, “07++12”: union set of VOC 2007 trainval+test and VOC 2012 trainval. For
RPN, the train-time proposals for Fast R-CNN are 2000. † : http://host.robots.ox.ac.uk:8080/anonymous/HZJTQA.html. ‡ :
http://host.robots.ox.ac.uk:8080/anonymous/YNPLXB.html. § : http://host.robots.ox.ac.uk:8080/anonymous/XEDH10.html.
method # proposals data mAP (%)
SS 2000 12 65.7
SS 2000 07++12 68.4
RPN+VGG, shared† 300 12 67.0
RPN+VGG, shared‡ 300 07++12 70.4
RPN+VGG, shared§ 300 COCO+07++12 75.9

Table 5: Timing (ms) on a K40 GPU, except SS proposal is evaluated in a CPU. “Region-wise” includes NMS,
pooling, fully-connected, and softmax layers. See our released code for the profiling of running time.
model system conv proposal region-wise total rate
VGG SS + Fast R-CNN 146 1510 174 1830 0.5 fps
VGG RPN + Fast R-CNN 141 10 47 198 5 fps
ZF RPN + Fast R-CNN 31 3 25 59 17 fps

579 time (thus no NMS/ranking is used), we randomly slightly higher than the SS baseline. As shown above, 607

580 sample N proposals from the unscored regions. The this is because the proposals generated by RPN+VGG 608

581 mAP is nearly unchanged with N = 1000 (55.8%), but are more accurate than SS. Unlike SS that is pre- 609

582 degrades considerably to 44.6% when N = 100. This defined, the RPN is actively trained and benefits from 610

583 shows that the cls scores account for the accuracy of better networks. For the feature-shared variant, the 611

584 the highest ranked proposals. result is 69.9%—better than the strong SS baseline, yet 612

585 On the other hand, when the reg layer is removed with nearly cost-free proposals. We further train the 613

586 at test-time (so the proposals become anchor boxes), RPN and detection network on the union set of PAS- 614

587 the mAP drops to 52.1%. This suggests that the high- CAL VOC 2007 trainval and 2012 trainval. The mAP 615

588 quality proposals are mainly due to the regressed box is 73.2%. Figure 5 shows some results on the PASCAL 616

589 bounds. The anchor boxes, though having multiple VOC 2007 test set. On the PASCAL VOC 2012 test set 617

590 scales and aspect ratios, are not sufficient for accurate (Table 4), our method has an mAP of 70.4% trained 618

591 detection. on the union set of VOC 2007 trainval+test and VOC 619

592 We also evaluate the effects of more powerful net- 2012 trainval. Table 6 and Table 7 show the detailed 620

593 works on the proposal quality of RPN alone. We use numbers. 621

594 VGG-16 to train the RPN, and still use the above In Table 5 we summarize the running time of the 622

595 detector of SS+ZF. The mAP improves from 56.8% entire object detection system. SS takes 1-2 seconds 623

596 (using RPN+ZF) to 59.2% (using RPN+VGG). This is a depending on content (on average about 1.5s), and 624

597 promising result, because it suggests that the proposal Fast R-CNN with VGG-16 takes 320ms on 2000 SS 625

598 quality of RPN+VGG is better than that of RPN+ZF. proposals (or 223ms if using SVD on fully-connected 626

599 Because proposals of RPN+ZF are competitive with layers [2]). Our system with VGG-16 takes in total 627

600 SS (both are 58.7% when consistently used for training 198ms for both proposal and detection. With the con- 628

601 and testing), we may expect RPN+VGG to be better volutional features shared, the RPN alone only takes 629

602 than SS. The following experiments justify this hy- 10ms computing the additional layers. Our region- 630

603 pothesis. wise computation is also lower, thanks to fewer pro- 631

posals (300 per image). Our system has a frame-rate 632

604 Performance of VGG-16. Table 3 shows the results of 17 fps with the ZF net. 633
605 of VGG-16 for both proposal and detection. Using
606 RPN+VGG, the result is 68.5% for unshared features, Sensitivities to Hyper-parameters. In Table 8 we 634

Table 6: Results on PASCAL VOC 2007 test set with Fast R-CNN detectors and VGG-16. For RPN, the train-time
proposals for Fast R-CNN are 2000. RPN∗ denotes the unsharing feature version.
method # box data mAP areo bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv

SS 2000 07 66.9 74.5 78.3 69.2 53.2 36.6 77.3 78.2 82.0 40.7 72.7 67.9 79.6 79.2 73.0 69.0 30.1 65.4 70.2 75.8 65.8
SS 2000 07+12 70.0 77.0 78.1 69.3 59.4 38.3 81.6 78.6 86.7 42.8 78.8 68.9 84.7 82.0 76.6 69.9 31.8 70.1 74.8 80.4 70.4
RPN∗ 300 07 68.5 74.1 77.2 67.7 53.9 51.0 75.1 79.2 78.9 50.7 78.0 61.1 79.1 81.9 72.2 75.9 37.2 71.4 62.5 77.4 66.4
RPN 300 07 69.9 70.0 80.6 70.1 57.3 49.9 78.2 80.4 82.0 52.2 75.3 67.2 80.3 79.8 75.0 76.3 39.1 68.3 67.3 81.1 67.6
RPN 300 07+12 73.2 76.5 79.0 70.9 65.5 52.1 83.1 84.7 86.4 52.0 81.9 65.7 84.8 84.6 77.5 76.7 38.8 73.6 73.9 83.0 72.6
RPN 300 COCO+07+12 78.8 84.3 82.0 77.7 68.9 65.7 88.1 88.4 88.9 63.6 86.3 70.8 85.9 87.6 80.1 82.3 53.6 80.4 75.8 86.6 78.9

Table 7: Results on PASCAL VOC 2012 test set with Fast R-CNN detectors and VGG-16. For RPN, the train-time
proposals for Fast R-CNN are 2000.
method # box data mAP areo bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv

SS 2000 12 65.7 80.3 74.7 66.9 46.9 37.7 73.9 68.6 87.7 41.7 71.1 51.1 86.0 77.8 79.8 69.8 32.1 65.5 63.8 76.4 61.7
SS 2000 07++12 68.4 82.3 78.4 70.8 52.3 38.7 77.8 71.6 89.3 44.2 73.0 55.0 87.5 80.5 80.8 72.0 35.1 68.3 65.7 80.4 64.2
RPN 300 12 67.0 82.3 76.4 71.0 48.4 45.2 72.1 72.3 87.3 42.2 73.7 50.0 86.8 78.7 78.4 77.4 34.5 70.1 57.1 77.1 58.9
RPN 300 07++12 70.4 84.9 79.8 74.3 53.9 49.8 77.5 75.9 88.5 45.6 77.1 55.3 86.9 81.7 80.9 79.6 40.1 72.6 60.9 81.2 61.5
RPN 300 COCO+07++12 75.9 87.4 83.6 76.8 62.9 59.6 81.9 82.0 91.3 54.9 82.6 59.0 89.0 85.5 84.7 84.1 52.2 78.9 65.5 85.4 70.2

Figure 4: Recall vs. IoU overlap ratio on the PASCAL VOC 2007 test set.

Table 11: One-Stage Detection vs. Two-Stage Proposal + Detection. Detection results are on the PASCAL
VOC 2007 test set using the ZF model and Fast R-CNN. RPN uses unshared features.
proposals detector mAP (%)
Two-Stage RPN + ZF, unshared 300 Fast R-CNN + ZF, 1 scale 58.7
One-Stage dense, 3 scales, 3 aspect ratios 20000 Fast R-CNN + ZF, 1 scale 53.8
One-Stage dense, 3 scales, 3 aspect ratios 20000 Fast R-CNN + ZF, 5 scales 53.9

Table 8: Detection results of Faster R-CNN on PAS- Table 9: Detection results of Faster R-CNN on PAS-
CAL VOC 2007 test set using different settings of CAL VOC 2007 test set using different values of λ
anchors. The network is VGG-16. The training data in Equation (1). The network is VGG-16. The training
is VOC 2007 trainval. The default setting of using 3 data is VOC 2007 trainval. The default setting of using
scales and 3 aspect ratios (69.9%) is the same as that λ = 10 (69.9%) is the same as that in Table 3.
in Table 3. λ 0.1 1 10 100
settings anchor scales aspect ratios mAP (%) mAP (%) 67.2 68.9 69.9 69.1
1282 1:1 65.8 Table 10: Detection results of Faster R-CNN on PAS-
1 scale, 1 ratio
2562 1:1 66.7
CAL VOC 2007 test set using different numbers of
1282 {2:1, 1:1, 1:2} 68.8
1 scale, 3 ratios proposals in testing. The network is VGG-16. The
2562 {2:1, 1:1, 1:2} 67.9
3 scales, 1 ratios {128 , 2562 , 5122 }
2 training data is VOC 2007 trainval. The default setting
1:1 69.8
3 scales, 3 ratios {1282 , 2562 , 5122 } {2:1, 1:1, 1:2} 69.9 of using 300 proposals is the same as that in Table 3.
# proposals 50 100 150 200 300 500 1000
mAP (%) 66.3 68.9 69.5 69.8 69.9 69.8 69.8

635 investigate the settings of anchors. By default we use

636 3 scales and 3 aspect ratios (69.9% mAP in Table 8). aspect ratio (69.8%) is as good as using 3 scales with 643

637 If using just one anchor at each position, the mAP 3 aspect ratios on this dataset, suggesting that scales 644

638 drops by a considerable margin of 3-4%. The mAP and aspect ratios are not disentangled dimensions for 645

639 is higher if using 3 scales (with 1 aspect ratio) or 3 the detection accuracy. But we still adopt these two 646

640 aspect ratios (with 1 scale), demonstrating that using dimensions in our designs to keep our system flexible. 647

641 anchors of multiple sizes as the regression references In Table 9 we compare different values of λ in Equa- 648

642 is an effective solution. Using just 3 scales with 1 tion (1). By default we use λ = 10 which makes the 649

650 two terms in Equation (1) roughly equally weighted evaluate using convolutional features extracted from 707

651 after normalization. Table 9 shows that our result is 5 scales. We use those 5 scales as in [1], [2]. 708

652 impacted just marginally (by ∼ 1%) when λ is within Table 11 compares the two-stage system and two 709

653 a scale of about two orders of magnitude (1 to 100). variants of the one-stage system. Using the ZF model, 710

654 This demonstrates that the result is insensitive to λ in the one-stage system has an mAP of 53.9%. This is 711

655 a wide range. lower than the two-stage system (58.7%) by 4.8%. 712

656 In Table 10 we investigate the numbers of proposals This experiment justifies the effectiveness of cascaded 713

657 in testing. region proposals and object detection. Similar obser- 714

vations are reported in [2], [39], where replacing SS 715

658 Analysis of Recall-to-IoU. Next we compute the
region proposals with sliding windows leads to ∼6% 716
659 recall of proposals at different IoU ratios with ground-
degradation in both papers. We also note that the one- 717
660 truth boxes. It is noteworthy that the Recall-to-IoU
stage system is slower as it has considerably more 718
661 metric is just loosely [19], [20], [21] related to the
proposals to process. 719
662 ultimate detection accuracy. It is more appropriate to
663 use this metric to diagnose the proposal method than
664 to evaluate it. 4.2 Experiments on MS COCO 720

665 In Figure 4, we show the results of using 300, 1000, We present more results on the Microsoft COCO 721
666 and 2000 proposals. We compare with SS, EB and object detection dataset [12]. This dataset involves 80 722
667 MCG, and the N proposals are the top-N ranked ones object categories. We experiment with the 80k images 723
668 based on the confidence generated by these meth- on the training set, 40k images on the validation set, 724
669 ods. The plots show that the RPN method behaves and 20k images on the test-dev set. We evaluate the 725
670 gracefully when the number of proposals drops from mAP averaged for IoU ∈ [0.5 : 0.05 : 0.95] (COCO’s 726
671 2000 to 300. This explains why the RPN has a good standard metric, simply denoted as mAP@[.5, .95]) 727
672 ultimate detection mAP when using as few as 300 and [email protected] (PASCAL VOC’s metric). 728
673 proposals. As we analyzed before, this property is
There are a few minor changes of our system made 729
674 mainly attributed to the cls term of the RPN. The recall
for this dataset. We train our models on an 8-GPU 730
675 of SS, EB and MCG drops more quickly than RPN
implementation, and the effective mini-batch size be- 731
676 when the proposals are fewer.
comes 8 for RPN (1 per GPU) and 16 for Fast R-CNN 732

677 One-Stage Detection vs. Two-Stage Proposal + De- (2 per GPU). The RPN step and Fast R-CNN step are 733

678 tection. The OverFeat paper [9] proposes a detection both trained for 240k iterations with a learning rate 734

679 method that uses regressors and classifiers on sliding of 0.003 and then for 80k iterations with 0.0003. We 735

680 windows over convolutional feature maps. OverFeat modify the learning rates (starting with 0.003 instead 736

681 is a one-stage, class-specific detection pipeline, and ours of 0.001) because the mini-batch size is changed. For 737

682 is a two-stage cascade consisting of class-agnostic pro- the anchors, we use 3 aspect ratios and 4 scales 738

683 posals and class-specific detections. In OverFeat, the (adding 642 ), mainly motivated by handling small 739

684 region-wise features come from a sliding window of objects on this dataset. In addition, in our Fast R-CNN 740

685 one aspect ratio over a scale pyramid. These features step, the negative samples are defined as those with 741

686 are used to simultaneously determine the location and a maximum IoU with ground truth in the interval of 742

687 category of objects. In RPN, the features are from [0, 0.5), instead of [0.1, 0.5) used in [1], [2]. We note 743

688 square (3×3) sliding windows and predict proposals that in the SPPnet system [1], the negative samples 744

689 relative to anchors with different scales and aspect in [0.1, 0.5) are used for network fine-tuning, but the 745

690 ratios. Though both methods use sliding windows, the negative samples in [0, 0.5) are still visited in the SVM 746

691 region proposal task is only the first stage of Faster R- step with hard-negative mining. But the Fast R-CNN 747

692 CNN—the downstream Fast R-CNN detector attends system [2] abandons the SVM step, so the negative 748

693 to the proposals to refine them. In the second stage of samples in [0, 0.1) are never visited. Including these 749

694 our cascade, the region-wise features are adaptively [0, 0.1) samples improves [email protected] on the COCO 750

695 pooled [1], [2] from proposal boxes that more faith- dataset for both Fast R-CNN and Faster R-CNN sys- 751

696 fully cover the features of the regions. We believe tems (but the impact is negligible on PASCAL VOC). 752

697 these features lead to more accurate detections. The rest of the implementation details are the same 753

698 To compare the one-stage and two-stage systems, as on PASCAL VOC. In particular, we keep using 754

699 we emulate the OverFeat system (and thus also circum- 300 proposals and single-scale (s = 600) testing. The 755

700 vent other differences of implementation details) by testing time is still about 200ms per image on the 756

701 one-stage Fast R-CNN. In this system, the “proposals” COCO dataset. 757

702 are dense sliding windows of 3 scales (128, 256, 512) In Table 12 we first report the results of the Fast 758

703 and 3 aspect ratios (1:1, 1:2, 2:1). Fast R-CNN is R-CNN system [2] using the implementation in this 759

704 trained to predict class-specific scores and regress box paper. Our Fast R-CNN baseline has 39.3% [email protected] 760

705 locations from these sliding windows. Because the on the test-dev set, higher than that reported in [2]. 761

706 OverFeat system adopts an image pyramid, we also We conjecture that the reason for this gap is mainly 762

Table 12: Object detection results (%) on the MS COCO dataset. The model is VGG-16.
COCO val COCO test-dev
method proposals training data [email protected] mAP@[.5, .95] [email protected] mAP@[.5, .95]
Fast R-CNN [2] SS, 2000 COCO train - - 35.9 19.7
Fast R-CNN [impl. in this paper] SS, 2000 COCO train 38.6 18.9 39.3 19.3
Faster R-CNN RPN, 300 COCO train 41.5 21.2 42.1 21.5
Faster R-CNN RPN, 300 COCO trainval - - 42.7 21.9

763 due to the definition of the negative samples and also Table 13: Detection mAP (%) of Faster R-CNN on
764 the changes of the mini-batch sizes. We also note that PASCAL VOC 2007 test set and 2012 test set us-
765 the mAP@[.5, .95] is just comparable. ing different training data. The model is VGG-16.
766 Next we evaluate our Faster R-CNN system. Using “COCO” denotes that the COCO trainval set is used
767 the COCO training set to train, Faster R-CNN has for training. See also Table 6 and Table 7.
768 42.1% [email protected] and 21.5% mAP@[.5, .95] on the training data 2007 test 2012 test
769 COCO test-dev set. This is 2.8% higher for [email protected] VOC07 69.9 67.0
770 and 2.2% higher for mAP@[.5, .95] than the Fast R- VOC07+12 73.2 -
771 CNN counterpart under the same protocol (Table 12). VOC07++12 - 70.4
COCO (no VOC) 76.1 73.0
772 This indicates that RPN performs excellent for im-
COCO+VOC07+12 78.8 -
773 proving the localization accuracy at higher IoU thresh- COCO+VOC07++12 - 75.9
774 olds. Using the COCO trainval set to train, Faster R-
775 CNN has 42.7% [email protected] and 21.9% mAP@[.5, .95] on VOC07+12 COCO+VOC07+12
776 the COCO test-dev set. Figure 6 shows some results Cor: 77.1% Cor: 83.3%
777 on the MS COCO test-dev set. Loc: 8.1% Loc: 7.1%
Sim: 2.0% Sim: 1.7%
778 Faster R-CNN in ILSVRC & COCO 2015 compe- Oth: 1.3% Oth: 1.3%
BG: 11.6% BG: 6.7%
779 titions We have demonstrated that Faster R-CNN Loc: 8.1% Loc: 7.1%
780 benefits more from better features, thanks to the fact Cor: 77.1% Cor: 83.3%
781 that the RPN completely learns to propose regions by
782 neural networks. This observation is still valid even
783 when one increases the depth substantially to over
784 100 layers [18]. Only by replacing VGG-16 with a 101-
785 layer residual net (ResNet-101) [18], the Faster R-CNN Figure 7: Error analyses on models trained with and
786 system increases the mAP from 41.5%/21.2% (VGG- without MS COCO data. The test set is PASCAL VOC
787 16) to 48.4%/27.2% (ResNet-101) on the COCO val 2007 test. Distribution of top-ranked Cor (correct), Loc
788 set. With other improvements orthogonal to Faster R- (false due to poor localization), Sim (confusion with
789 CNN, He et al. [18] obtained a single-model result of a similar category), Oth (confusion with a dissimlar
790 55.7%/34.9% and an ensemble result of 59.0%/37.4% category), BG (fired on background) is shown, which
791 on the COCO test-dev set, which won the 1st place is generated by the published diagnosis code of [40].
792 in the COCO 2015 object detection competition. The
793 same system [18] also won the 1st place in the ILSVRC this experiment, and the softmax layer is performed 811

794 2015 object detection competition, surpassing the sec- only on the 20 categories plus background. The mAP 812

795 ond place by absolute 8.5%. RPN is also a building under this setting is 76.1% on the PASCAL VOC 2007 813

796 block of the 1st-place winning entries in ILSVRC 2015 test set (Table 13). This result is better than that trained 814

797 localization and COCO 2015 segmentation competi- on VOC07+12 (73.2%) by a good margin, even though 815

798 tions, for which the details are available in [18] and the PASCAL VOC data are not exploited. 816

799 [15] respectively. Then we fine-tune the COCO detection model on 817

the VOC dataset. In this experiment, the COCO model 818

is in place of the ImageNet-pre-trained model (that 819

800 4.3 From MS COCO to PASCAL VOC is used to initialize the network weights), and the 820

801 Large-scale data is of crucial importance for improv- Faster R-CNN system is fine-tuned as described in 821

802 ing deep neural networks. Next, we investigate how Section 3.2. Doing so leads to 78.8% mAP on the 822

803 the MS COCO dataset can help with the detection PASCAL VOC 2007 test set. The extra data from 823

804 performance on PASCAL VOC. the COCO set increases the mAP by 5.6%. Table 6 824

805 As a simple baseline, we directly evaluate the shows that the model trained on COCO+VOC has 825

806 COCO detection model on the PASCAL VOC dataset, the best AP for every individual category on PASCAL 826

807 without fine-tuning on any PASCAL VOC data. This VOC 2007. This improvement is mainly resulted from 827

808 evaluation is possible because the categories on fewer false alarms on background (Figure 7). Similar 828

809 COCO are a superset of those on PASCAL VOC. The improvements are observed on the PASCAL VOC 829

810 categories that are exclusive on COCO are ignored in 2012 test set (Table 13 and Table 7). We note that 830

person : 0.918 cow : 0.995

bird : 0.902

person : 0.988
person : 0.992
car : 0.745
.745 person : 0.797 bird : 0.978
car : 0.955
55 horse : 0.991

bird : 0.972 cow : 0.998

bird : 0.941

bottle : 0.726

person : 0.964 person : 0.988

p
pers person : 0.986
86
car : 0.999 personn person
: 0.993
0 993 : 0.959
person : 0.976
person : 0.929 person : 0.994
person : 0.991 car : 0.997 car : 0.980

dog : 0.981

cow : 0.979 person : 0.998

person : 0.961 cow : 0.974
person : 0.958
cow : 0.979
bus : 0.999
person : 0.960 cow : 0.892
cow : 0.985
person : 0.985 person : 0.995
person : 0.996
per
person : 0.757 person : 0.994

dog : 0.697

cat : 0.998

person : 0.917
boat : 0.671
car : 1.000 boat : 0.895 boat : 0.749

boat : 0.877

person : 0.988

person : 0.995

person : 0.994 bicycle

b
bicyc
4person e :: 0.981
0.987
0 987
person : 0.930 person : 0.940
person
940 : 0.893
bicycle : 0.972
bicycle : 0.977
77
boat : 0.992
person : 0.962
dog : 0.987
pottedplant : 0.951

bottle : 0.851
bottle : 0
0.962
962

boat : 0.693 diningtable : 0.791

boat : 0.846
person : 0.948
person : 0.972 person : 0.919

pottedplant : 0.728
car : 1.000 car : 0.880
car : 0.981
car : 0.982 chair : 0.630
boat : 0.995
boat : 0.948
diningtable : 0.862
bottle : 0.826

boat : 0.692
boat : 0
0.808
808

person : 0.975

aeroplane : 0.992 bird : 0.998

aeroplane : 0.986
sheep : 0.970

bird : 0.980

bird : 0.806 person : 0.670

horse : 0.984

aeroplane : 0.998

pottedplant : 0.820

chair : 0.984
984
diningtable : 0.997
pottedplant : 0.993 chair : 0.978
chair : 0.962
chair : 0.976
pottedplant : 0.715 car : 0.907
907
person : 0.993 person : 0.987

pottedplant : 0.940

pottedplant : 0.869

tvmonitor : 0.945
person : 0.983

aeroplane : 0.978 bird : 0.997

tvmonitor : 0.993

chair : 0.723
person : 0.968 chair : 0.982 tvmonitor : 0.993 person : 0.959
bottle
e : 0.789

person : 0.988
diningtable : 0.903 bottle : 0
bot 0.858
chair : 0.852 bottle : 0.616 b
bottle :person
0
0.903
903 : 0.897
person : 0.870

bottle : 0.884
bird : 0.727

Figure 5: Selected examples of object detection results on the PASCAL VOC 2007 test set using the Faster
R-CNN system. The model is VGG-16 and the training data is 07+12 trainval (73.2% mAP on the 2007 test
set). Our method detects objects of a wide range of scales and aspect ratios. Each output box is associated
with a category label and a softmax score in [0, 1]. A score threshold of 0.6 is used to display these images.
The running time for obtaining these results is 198ms per image, including all steps.

831 the test-time speed of obtaining these state-of-the-art improves region proposal quality and thus the overall 840

832 results is still about 200ms per image. object detection accuracy. 841

833 5 C ONCLUSION R EFERENCES 842

834 We have presented RPNs for efficient and accurate [1] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling 843
in deep convolutional networks for visual recognition,” in 844
835 region proposal generation. By sharing convolutional European Conference on Computer Vision (ECCV), 2014. 845
836 features with the down-stream detection network, the [2] R. Girshick, “Fast R-CNN,” in IEEE International Conference on 846

837 region proposal step is nearly cost-free. Our method Computer Vision (ICCV), 2015. 847
[3] K. Simonyan and A. Zisserman, “Very deep convolutional 848
838 enables a unified, deep-learning-based object detec- networks for large-scale image recognition,” in International 849
839 tion system to run at 5-17 fps. The learned RPN also Conference on Learning Representations (ICLR), 2015. 850

person
person
son
on : 0
0.975
975 traffic light : 0.802
person : 0.941
.941
4
person : 0.673 person : 0.928 person nperson
: 0.
0.958
958: 0.823
airplane : 0.997
person : 0.759
p
person : 0.766 backpack : 0.756 person : 0.772
0person
976 : 0.939
person : 0.976 0 939 person : 0.842
0.84 person : 0.841
person : 0.867 umbrella : 0.824 person : 0.897 car : 0.957
person
on : 0
0.950
50
handbag : 0.848 person : 0.805 clock : 0.986
person : 0.950
p person : 0.931
person : 0.970 clock : 0.981
person : 0.916

motorcycle : 0.713
dog : 0.996
bicycle : 0.891
dog : 0.691 bicycle : 0.639
person : 0.996
person : 0.800
motorcycle : 0.827

person : 0.808
pizza : 0.985
person : 0.998
dining table : 0.956
pizza : 0.938
bed : 0.999
pizza : 0.995
pizza : 0.982
clock : 0.982

skis : 0.919 bottle : 0.627

bowl : 0.759

giraffe : 0.989 giraffe : 0.993

giraffe : 0.988 person : 0.999

broccoli : 0.953
boat : 0.992
person : 0.934

surfboard : 0.979 umbrella : 0.885

person : 0.691 p
person : 0.716
person : 0.940
person : 0.854 person : 0.927
927
person : 0.665
person : 0.692 person : 0.618
person : 0.825
5person : 0.813 person : 0.864

teddy bear : 0.999

bus : 0.999

teddy bear : 0.738 teddy bear : 0.802

potted plant : 0.769
teddy bear : 0.890

person
person
erson : 0.869
: 0.970

bowl : 0.602 6 sink : 0.938

sink : 0.976
sink : 0.994
toilet : 0.921 sink : 0.992
sink : 0.969

book : 0.611

tv : 0.964

bottle : 0.768 traffic light : 0.713

laptop : 0.986 traffic light : 0.869
couch : 0.627
train : 0.965
couch : 0.991 mouse : 0.871
m boat : 0.613 boat : 0.746
couch : 0.719 tv : 0.959 boat : 0.758
keyboard : 0.956
dining table : 0.637
mouse : 0.677
chair : 0.631 bench : 0.971

chair : 0.644 person : 0.986

cup : 0.720

frisbee : 0.998
person : 0.723

cup : 0.931
dining table : 0.941 cup : 0.986
bird : 0.968
dog : 0.966
bowl : 0.958
zebra : 0.996
zebra : 0.970
970
zebra : 0.848
zebra : 0.993 sandwich : 0.629

bird : 0.987

bird : 0.894

person :tv
0 : 0.711
0.792
792 person : 0.917
refrigerator : 0.699
person : 0.993

bottle : 0.982
laptop : 0.973
tennis :racket
person
perso 0.999 : 0.960 horse : 0.990
bird : 0.746
oven : 0.655 bird : 0.956

keyboard : 0.638
bird : 0.906
keyboard : 0.615
mouse : 0.981

dining table : 0.888 cup : 0.990 car : 0.816 toothbrush : 0.668

person : 0.984

refrigerator : 0.631 pizza : 0.919

kite : 0.934

clock : 0.988
bowl : 0.744
bowl : 0.816
bowl : 0.710 person : 0.998
bowl : 0.847

cup : 0.807

pizza : 0.965

chair : 0.772
oven : 0.969
dining table : 0.618

Figure 6: Selected examples of object detection results on the MS COCO test-dev set using the Faster R-CNN
system. The model is VGG-16 and the training data is COCO trainval (42.7% [email protected] on the test-dev set).
Each output box is associated with a category label and a softmax score in [0, 1]. A score threshold of 0.6 is
used to display these images. For each image, one color represents one object category in that image.

851 [4] J. R. Uijlings, K. E. van de Sande, T. Gevers, and A. W. Smeul- based models,” IEEE Transactions on Pattern Analysis and Ma- 866
852 ders, “Selective search for object recognition,” International chine Intelligence (TPAMI), 2010. 867
853 Journal of Computer Vision (IJCV), 2013. [9] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, 868
854 [5] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature and Y. LeCun, “Overfeat: Integrated recognition, localization 869
855 hierarchies for accurate object detection and semantic seg- and detection using convolutional networks,” in International 870
856 mentation,” in IEEE Conference on Computer Vision and Pattern Conference on Learning Representations (ICLR), 2014. 871
857 Recognition (CVPR), 2014. [10] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards 872
858 [6] C. L. Zitnick and P. Dollár, “Edge boxes: Locating object real-time object detection with region proposal networks,” in 873
859 proposals from edges,” in European Conference on Computer Neural Information Processing Systems (NIPS), 2015. 874
860 Vision (ECCV), 2014. [11] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and 875
861 [7] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional A. Zisserman, “The PASCAL Visual Object Classes Challenge 876
862 networks for semantic segmentation,” in IEEE Conference on 2007 (VOC2007) Results,” 2007. 877
863 Computer Vision and Pattern Recognition (CVPR), 2015. [12] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra- 878
864 [8] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ra- manan, P. Dollár, and C. L. Zitnick, “Microsoft COCO: Com- 879
865 manan, “Object detection with discriminatively trained part- mon Objects in Context,” in European Conference on Computer 880

881 Vision (ECCV), 2014. [38] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Gir- 957
882 [13] S. Song and J. Xiao, “Deep sliding shapes for amodal 3d object shick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional 958
883 detection in rgb-d images,” arXiv:1511.02300, 2015. architecture for fast feature embedding,” arXiv:1408.5093, 2014. 959
884 [14] J. Zhu, X. Chen, and A. L. Yuille, “DeePM: A deep part-based [39] K. Lenc and A. Vedaldi, “R-CNN minus R,” in British Machine 960
885 model for object detection and semantic part localization,” Vision Conference (BMVC), 2015. 961
886 arXiv:1511.07131, 2015. [40] D. Hoiem, Y. Chodpathumwan, and Q. Dai, “Diagnosing error 962

887 [15] J. Dai, K. He, and J. Sun, “Instance-aware semantic segmenta- in object detectors,” in European Conference on Computer Vision 963

888 tion via multi-task network cascades,” arXiv:1512.04412, 2015. (ECCV), 2012. 964

889 [16] J. Johnson, A. Karpathy, and L. Fei-Fei, “Densecap: Fully

890 convolutional localization networks for dense captioning,”
891 arXiv:1511.07571, 2015.
892 [17] D. Kislyuk, Y. Liu, D. Liu, E. Tzeng, and Y. Jing, “Human cu-
893 ration and convnets: Powering item-to-item recommendations
894 on pinterest,” arXiv:1511.04003, 2015.
Shaoqing Ren received the BS degree from 965
895 [18] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning
the University of Science and Technology of 966
896 for image recognition,” arXiv:1512.03385, 2015.
China in 2011. He is currently a PhD student 967
897 [19] J. Hosang, R. Benenson, and B. Schiele, “How good are de-
in a joint PhD program between University 968
898 tection proposals, really?” in British Machine Vision Conference
of Science and Technology of China and 969
899 (BMVC), 2014.
Microsoft Research Asia. His research in- 970
900 [20] J. Hosang, R. Benenson, P. Dollár, and B. Schiele, “What makes terests are in computer vision, especially in 971
901 for effective detection proposals?” IEEE Transactions on Pattern detection and localization of general objects 972
902 Analysis and Machine Intelligence (TPAMI), 2015. and faces. 973
903 [21] N. Chavali, H. Agrawal, A. Mahendru, and D. Batra, 974
904 “Object-Proposal Evaluation Protocol is ’Gameable’,” arXiv:
905 1505.05836, 2015.
906 [22] J. Carreira and C. Sminchisescu, “CPMC: Automatic ob-
907 ject segmentation using constrained parametric min-cuts,”
908 IEEE Transactions on Pattern Analysis and Machine Intelligence
909 (TPAMI), 2012.
910 [23] P. Arbeláez, J. Pont-Tuset, J. T. Barron, F. Marques, and J. Malik, Kaiming He is a lead researcher at Microsoft 975

911 “Multiscale combinatorial grouping,” in IEEE Conference on Research Asia. He received the BS degree 976

912 Computer Vision and Pattern Recognition (CVPR), 2014. from Tsinghua University in 2007, and the 977

913 [24] B. Alexe, T. Deselaers, and V. Ferrari, “Measuring the object- PhD degree from the Chinese University of 978

914 ness of image windows,” IEEE Transactions on Pattern Analysis Hong Kong in 2011. He joined Microsoft Re- 979

915 and Machine Intelligence (TPAMI), 2012. search Asia in 2011. His current research 980

916 [25] C. Szegedy, A. Toshev, and D. Erhan, “Deep neural networks interests are deep learning for visual recog- 981

917 for object detection,” in Neural Information Processing Systems nition, including image classification, object 982

918 (NIPS), 2013. detection, and semantic segmentation. He 983

919 [26] D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov, “Scalable has won the Best Paper Award at CVPR 984

920 object detection using deep neural networks,” in IEEE Confer- 2009. 985

921 ence on Computer Vision and Pattern Recognition (CVPR), 2014.

922 [27] C. Szegedy, S. Reed, D. Erhan, and D. Anguelov, “Scalable,
923 high-quality object detection,” arXiv:1412.1441 (v1), 2015.
924 [28] P. O. Pinheiro, R. Collobert, and P. Dollar, “Learning to
925 segment object candidates,” in Neural Information Processing
926 Systems (NIPS), 2015. Ross Girshick is a Research Scientist at 986
927 [29] J. Dai, K. He, and J. Sun, “Convolutional feature masking Facebook AI Research. He holds a PhD and 987
928 for joint object and stuff segmentation,” in IEEE Conference on MS in computer science, both from the Uni- 988
929 Computer Vision and Pattern Recognition (CVPR), 2015. versity of Chicago where he studied under 989
930 [30] S. Ren, K. He, R. Girshick, X. Zhang, and J. Sun, “Ob- the supervision of Pedro Felzenszwalb. Prior 990
931 ject detection networks on convolutional feature maps,” to joining Facebook AI Research, Ross was 991
932 arXiv:1504.06066, 2015. a Researcher at Microsoft Research, and 992
933 [31] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and a Postdoctorial Fellow at the University of 993
934 Y. Bengio, “Attention-based models for speech recognition,” California, Berkeley where he collaborated 994
935 in Neural Information Processing Systems (NIPS), 2015. with Jitendra Malik and Trevor Darrell. During 995
936 [32] M. D. Zeiler and R. Fergus, “Visualizing and understanding the course of PASCAL VOC object detection 996
937 convolutional neural networks,” in European Conference on challenge, Ross participated in multiple winning object detection 997
938 Computer Vision (ECCV), 2014. entries and was awarded a “lifetime achievement” prize for his work 998
939 [33] V. Nair and G. E. Hinton, “Rectified linear units improve on the widely used Deformable Part Models. 999
940 restricted boltzmann machines,” in International Conference on
941 Machine Learning (ICML), 2010.
942 [34] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov,
943 D. Erhan, and A. Rabinovich, “Going deeper with convo-
944 lutions,” in IEEE Conference on Computer Vision and Pattern
945 Recognition (CVPR), 2015. Jian Sun is a principal researcher at Mi- 1000
946 [35] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, crosoft Research Asia. He got the BS de- 1001
947 W. Hubbard, and L. D. Jackel, “Backpropagation applied to gree, MS degree and PhD degree from Xian 1002
948 handwritten zip code recognition,” Neural computation, 1989. Jiaotong University in 1997, 2000 and 2003. 1003
949 [36] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, He joined Microsoft Research Asia in July, 1004
950 Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, 2003. His current major research interests 1005
951 and L. Fei-Fei, “ImageNet Large Scale Visual Recognition are computer vision, computational photog- 1006
952 Challenge,” in International Journal of Computer Vision (IJCV), raphy, and deep learning. He has won the 1007
953 2015. Best Paper Award at CVPR 2009. 1008
954 [37] A. Krizhevsky, I. Sutskever, and G. Hinton, “Imagenet classi-
1009
955 fication with deep convolutional neural networks,” in Neural
956 Information Processing Systems (NIPS), 2012.
0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
information.