ref16
ref16
fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPAMI.2016.2577031, IEEE Transactions on Pattern Analysis and Machine Intelligence
Abstract—State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations.
Advances like SPPnet [1] and Fast R-CNN [2] have reduced the running time of these detection networks, exposing region
proposal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image
convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully convolutional
network that simultaneously predicts object bounds and objectness scores at each position. The RPN is trained end-to-end to
generate high-quality region proposals, which are used by Fast R-CNN for detection. We further merge RPN and Fast R-CNN
into a single network by sharing their convolutional features—using the recently popular terminology of neural networks with
’attention’ mechanisms, the RPN component tells the unified network where to look. For the very deep VGG-16 model [3],
our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection
accuracy on PASCAL VOC 2007, 2012, and MS COCO datasets with only 300 proposals per image. In ILSVRC and COCO
2015 competitions, Faster R-CNN and RPN are the foundations of the 1st-place winning entries in several tracks. Code has been
made publicly available.
2 Recent advances in object detection are driven by advantage of GPUs, while the region proposal meth- 27
3 the success of region proposal methods (e.g., [4]) ods used in research are implemented on the CPU, 28
4 and region-based convolutional neural networks (R- making such runtime comparisons inequitable. An ob- 29
5 CNNs) [5]. Although region-based CNNs were com- vious way to accelerate proposal computation is to re- 30
6 putationally expensive as originally developed in [5], implement it for the GPU. This may be an effective en- 31
7 their cost has been drastically reduced thanks to shar- gineering solution, but re-implementation ignores the 32
8 ing convolutions across proposals [1], [2]. The latest down-stream detection network and therefore misses 33
9 incarnation, Fast R-CNN [2], achieves near real-time important opportunities for sharing computation. 34
10 rates using very deep networks [3], when ignoring the In this paper, we show that an algorithmic change— 35
11 time spent on region proposals. Now, proposals are the computing proposals with a deep convolutional neu- 36
12 test-time computational bottleneck in state-of-the-art ral network—leads to an elegant and effective solution 37
China. This work was done when S. Ren was an intern at Microsoft lutional network (FCN) [7] and can be trained end-to- 53
Research. Email: [email protected] end specifically for the task for generating detection 54
• K. He and J. Sun are with Visual Computing Group, Microsoft
Research. E-mail: {kahe,jiansun}@microsoft.com
proposals. 55
• R. Girshick is with Facebook AI Research. The majority of this work RPNs are designed to efficiently predict region pro- 56
was done when R. Girshick was with Microsoft Research. E-mail: posals with a wide range of scales and aspect ratios. In 57
[email protected]
contrast to prevalent methods [8], [9], [1], [2] that use 58
0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPAMI.2016.2577031, IEEE Transactions on Pattern Analysis and Machine Intelligence
Figure 1: Different schemes for addressing multiple scales and sizes. (a) Pyramids of images and feature maps
are built, and the classifier is run at all scales. (b) Pyramids of filters with multiple scales/sizes are run on
the feature map. (c) We use pyramids of reference boxes in the regression functions.
60 (Figure 1, b), we introduce novel “anchor” boxes In ILSVRC and COCO 2015 competitions, Faster 102
61 that serve as references at multiple scales and aspect R-CNN and RPN are the basis of several 1st-place 103
62 ratios. Our scheme can be thought of as a pyramid entries [18] in the tracks of ImageNet detection, Ima- 104
63 of regression references (Figure 1, c), which avoids geNet localization, COCO detection, and COCO seg- 105
64 enumerating images or filters of multiple scales or mentation. RPNs completely learn to propose regions 106
65 aspect ratios. This model performs well when trained from data, and thus can easily benefit from deeper 107
66 and tested using single-scale images and thus benefits and more expressive features (such as the 101-layer 108
67 running speed. residual nets adopted in [18]). Faster R-CNN and RPN 109
68 To unify RPNs with Fast R-CNN [2] object detec- are also used by several other leading entries in these 110
69 tion networks, we propose a training scheme that competitions2 . These results suggest that our method 111
70 alternates between fine-tuning for the region proposal is not only a cost-efficient solution for practical usage, 112
71 task and then fine-tuning for object detection, while but also an effective way of improving object detec- 113
72 keeping the proposals fixed. This scheme converges tion accuracy. 114
88 gate the improvements on PASCAL VOC using the method [5] trains CNNs end-to-end to classify the 128
89 COCO data. Code has been made publicly available proposal regions into object categories or background. 129
91 rcnn (in MATLAB) and https://github.com/ predict object bounds (except for refining by bounding 131
92 rbgirshick/py-faster-rcnn (in Python). box regression). Its accuracy depends on the perfor- 132
93 A preliminary version of this manuscript was pub- mance of the region proposal module (see compar- 133
94 lished previously [10]. Since then, the frameworks of isons in [20]). Several papers have proposed ways of 134
95 RPN and Faster R-CNN have been adopted and gen- using deep networks for predicting object bounding 135
96 eralized to other methods, such as 3D object detection boxes [25], [9], [26], [27]. In the OverFeat method [9], 136
97 [13], part-based detection [14], instance segmentation a fully-connected layer is trained to predict the box 137
98 [15], and image captioning [16]. Our fast and effective coordinates for the localization task that assumes a 138
99 object detection system has also been built in com- single object. The fully-connected layer is then turned 139
100 mercial systems such as at Pinterests [17], with user into a convolutional layer for detecting multiple class- 140
1. Since the publication of the conference version of this paper erate region proposals from a network whose last 142
[10], we have also found that RPNs can be trained jointly with Fast
R-CNN networks leading to less training time. 2. http://image-net.org/challenges/LSVRC/2015/results
0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPAMI.2016.2577031, IEEE Transactions on Pattern Analysis and Machine Intelligence
classifier
develop algorithms for training both modules with 178
RoI pooling
3.1 Region Proposal Networks 180
Region Proposal Network model this process with a fully convolutional network 184
feature maps [7], which we describe in this section. Because our ulti- 185
Figure 2: Faster R-CNN is a single, unified network network over the convolutional feature map output 194
for object detection. The RPN module serves as the by the last shared convolutional layer. This small 195
‘attention’ of this unified network. network takes as input an n × n spatial window of 196
166 3 FASTER R-CNN eterized relative to k reference boxes, which we call 221
167 Our object detection system, called Faster R-CNN, is anchors. An anchor is centered at the sliding window 222
168 composed of two modules. The first module is a deep in question, and is associated with a scale and aspect 223
169 fully convolutional network that proposes regions, ratio (Figure 3, left). By default we use 3 scales and 224
170 and the second module is the Fast R-CNN detector [2] 3 aspect ratios, yielding k = 9 anchors at each sliding 225
171 that uses the proposed regions. The entire system is a position. For a convolutional feature map of a size 226
person : 0.992
2k scores 4k coordinates k anchor boxes dog : 0.994
horse : 0.993
256-d
intermediate layer
bus : 0.996
boat : 0.970
person : 0.983
person : 0.736 person : 0.983
person : 0.925
person : 0.989
sliding window
conv feature map
Figure 3: Left: Region Proposal Network (RPN). Right: Example detections using RPN proposals on PASCAL
VOC 2007 test. Our method detects objects in a wide range of scales and aspect ratios.
227 W × H (typically ∼2,400), there are W Hk anchors in image/feature pyramids, e.g., in DPM [8] and CNN- 261
228 total. based methods [9], [1], [2]. The images are resized at 262
256 Multi-Scale Anchors as Regression References For training RPNs, we assign a binary class label 290
257 Our design of anchors presents a novel scheme (of being an object or not) to each anchor. We as- 291
258 for addressing multiple scales (and aspect ratios). As sign a positive label to two kinds of anchors: (i) the 292
259 shown in Figure 1, there have been two popular ways anchor/anchors with the highest Intersection-over- 293
260 for multi-scale predictions. The first way is based on Union (IoU) overlap with a ground-truth box, or (ii) an 294
anchor that has an IoU overlap higher than 0.7 with 295
5. As is the case of FCNs [7], our network is translation invariant
up to the network’s total stride.
any ground-truth box. Note that a single ground-truth 296
6. 4 is the dimension of reg term for each category, and 1 or 2 is box may assign positive labels to multiple anchors. 297
the dimension of cls term of sigmoid or softmax for each category Usually the second condition is sufficient to determine 298
7. Considering the feature projection layers, our proposal layers’ the positive samples; but we still adopt the first 299
parameter count is 3 × 3 × 512 × 512 + 512 × 6 × 9 = 2.4 × 106 ;
MultiBox’s proposal layers’ parameter count is 7 × 7 × (64 + 96 + condition for the reason that in some rare cases the 300
64 + 64) × 1536 + 1536 × 5 × 800 = 27 × 106 . second condition may find no positive sample. We 301
0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPAMI.2016.2577031, IEEE Transactions on Pattern Analysis and Machine Intelligence
302 assign a negative label to a non-positive anchor if its pooled from arbitrarily sized RoIs, and the regression 348
303 IoU ratio is lower than 0.3 for all ground-truth boxes. weights are shared by all region sizes. In our formula- 349
304 Anchors that are neither positive nor negative do not tion, the features used for regression are of the same 350
305 contribute to the training objective. spatial size (3 × 3) on the feature maps. To account 351
306 With these definitions, we minimize an objective for varying sizes, a set of k bounding-box regressors 352
307 function following the multi-task loss in Fast R-CNN are learned. Each regressor is responsible for one scale 353
308 [2]. Our loss function for an image is defined as: and one aspect ratio, and the k regressors do not share 354
337 eterizations of the 4 coordinates following [5]: Thus far we have described how to train a network 386
t∗w ∗
= log(w /wa ), t∗h ∗
= log(h /ha ), learn a unified network composed of RPN and Fast 391
338 where x, y, w, and h denote the box’s center coordi- Both RPN and Fast R-CNN, trained independently, 393
339 nates and its width and height. Variables x, xa , and will modify their convolutional layers in different 394
340 x∗ are for the predicted box, anchor box, and ground- ways. We therefore need to develop a technique that 395
341 truth box respectively (likewise for y, w, h). This can allows for sharing convolutional layers between the 396
342 be thought of as bounding-box regression from an two networks, rather than learning two separate net- 397
343 anchor box to a nearby ground-truth box. works. We discuss three ways for training networks 398
344 Nevertheless, our method achieves bounding-box with features shared: 399
345 regression by a different manner from previous RoI- (i) Alternating training. In this solution, we first train 400
346 based (Region of Interest) methods [1], [2]. In [1], RPN, and use the proposals to train Fast R-CNN. 401
347 [2], bounding-box regression is performed on features The network tuned by Fast R-CNN is then used to 402
0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPAMI.2016.2577031, IEEE Transactions on Pattern Analysis and Machine Intelligence
Table 1: the learned average proposal size for each anchor using the ZF net (numbers for s = 600).
anchor 1282 , 2:1 1282 , 1:1 1282 , 1:2 2562 , 2:1 2562 , 1:1 2562 , 1:2 5122 , 2:1 5122 , 1:1 5122 , 1:2
proposal 188×111 113×114 70×92 416×229 261×284 174×332 768×437 499×501 355×715
403 initialize RPN, and this process is iterated. This is the network. A similar alternating training can be run 456
404 solution that is used in all experiments in this paper. for more iterations, but we have observed negligible 457
405 (ii) Approximate joint training. In this solution, the improvements. 458
406 RPN and Fast R-CNN networks are merged into one
407 network during training as in Figure 2. In each SGD 3.3 Implementation Details 459
408 iteration, the forward pass generates region propos- We train and test both region proposal and object 460
409 als which are treated just like fixed, pre-computed detection networks on images of a single scale [1], [2]. 461
410 proposals when training a Fast R-CNN detector. The We re-scale the images such that their shorter side 462
411 backward propagation takes place as usual, where for is s = 600 pixels [2]. Multi-scale feature extraction 463
412 the shared layers the backward propagated signals (using an image pyramid) may improve accuracy but 464
413 from both the RPN loss and the Fast R-CNN loss does not exhibit a good speed-accuracy trade-off [2]. 465
414 are combined. This solution is easy to implement. But On the re-scaled images, the total stride for both ZF 466
415 this solution ignores the derivative w.r.t. the proposal and VGG nets on the last convolutional layer is 16 467
416 boxes’ coordinates that are also network responses, so pixels, and thus is ∼10 pixels on a typical PASCAL 468
417 is approximate. In our experiments, we have empiri- image before resizing (∼500×375). Even such a large 469
418 cally found this solver produces close results (mAP stride provides good results, though accuracy may be 470
419 70.0% compared with 69.9% of alternating training further improved with a smaller stride. 471
420 reported in Table 3), yet reduces the training time For anchors, we use 3 scales with box areas of 1282 , 472
421 by about 25-50% comparing with alternating training. 2562 , and 5122 pixels, and 3 aspect ratios of 1:1, 1:2, 473
422 This solver is included in our released Python code. and 2:1. These hyper-parameters are not carefully cho- 474
423 (iii) Non-approximate joint training. As discussed sen for a particular dataset, and we provide ablation 475
424 above, the bounding boxes predicted by RPN are experiments on their effects in the next section. As dis- 476
425 also functions of the input. The RoI pooling layer cussed, our solution does not need an image pyramid 477
426 [2] in Fast R-CNN accepts the convolutional features or filter pyramid to predict regions of multiple scales, 478
427 and also the predicted bounding boxes as input, so saving considerable running time. Figure 3 (right) 479
428 a theoretically valid backpropagation solver should shows the capability of our method for a wide range 480
429 also involve gradients w.r.t. the box coordinates. These of scales and aspect ratios. Table 1 shows the learned 481
430 gradients are ignored in the above approximate joint average proposal size for each anchor using the ZF 482
431 training. In a non-approximate joint training solution, net. We note that our algorithm allows predictions 483
432 we need an RoI pooling layer that is differentiable that are larger than the underlying receptive field. 484
433 w.r.t. the box coordinates. This is a nontrivial problem Such predictions are not impossible—one may still 485
434 and a solution can be given by an “RoI warping” layer roughly infer the extent of an object if only the middle 486
435 as developed in [15], which is beyond the scope of this of the object is visible. 487
436 paper. The anchor boxes that cross image boundaries need 488
437 4-Step Alternating Training. In this paper, we adopt all cross-boundary anchors so they do not contribute 490
438 a pragmatic 4-step training algorithm to learn shared to the loss. For a typical 1000 × 600 image, there 491
439 features via alternating optimization. In the first step, will be roughly 20000 (≈ 60 × 40 × 9) anchors in 492
440 we train the RPN as described in Section 3.1.3. This total. With the cross-boundary anchors ignored, there 493
441 network is initialized with an ImageNet-pre-trained are about 6000 anchors per image for training. If the 494
442 model and fine-tuned end-to-end for the region pro- boundary-crossing outliers are not ignored in training, 495
443 posal task. In the second step, we train a separate they introduce large, difficult to correct error terms in 496
444 detection network by Fast R-CNN using the proposals the objective, and training does not converge. During 497
445 generated by the step-1 RPN. This detection net- testing, however, we still apply the fully convolutional 498
446 work is also initialized by the ImageNet-pre-trained RPN to the entire image. This may generate cross- 499
447 model. At this point the two networks do not share boundary proposal boxes, which we clip to the image 500
448 convolutional layers. In the third step, we use the boundary. 501
449 detector network to initialize RPN training, but we Some RPN proposals highly overlap with each 502
450 fix the shared convolutional layers and only fine-tune other. To reduce redundancy, we adopt non-maximum 503
451 the layers unique to RPN. Now the two networks suppression (NMS) on the proposal regions based on 504
452 share convolutional layers. Finally, keeping the shared their cls scores. We fix the IoU threshold for NMS 505
453 convolutional layers fixed, we fine-tune the unique at 0.7, which leaves us about 2000 proposal regions 506
454 layers of Fast R-CNN. As such, both networks share per image. As we will show, NMS does not harm the 507
455 the same convolutional layers and form a unified ultimate detection accuracy, but substantially reduces 508
0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPAMI.2016.2577031, IEEE Transactions on Pattern Analysis and Machine Intelligence
Table 2: Detection results on PASCAL VOC 2007 test set (trained on VOC 2007 trainval). The detectors are
Fast R-CNN with ZF, but using various proposal methods for training and testing.
train-time region proposals test-time region proposals
method # boxes method # proposals mAP (%)
SS 2000 SS 2000 58.7
EB 2000 EB 2000 58.6
RPN+ZF, shared 2000 RPN+ZF, shared 300 59.9
ablation experiments follow below
RPN+ZF, unshared 2000 RPN+ZF, unshared 300 58.7
SS 2000 RPN+ZF 100 55.1
SS 2000 RPN+ZF 300 56.8
SS 2000 RPN+ZF 1000 56.3
SS 2000 RPN+ZF (no NMS) 6000 55.2
SS 2000 RPN+ZF (no cls) 100 44.6
SS 2000 RPN+ZF (no cls) 300 51.4
SS 2000 RPN+ZF (no cls) 1000 55.8
SS 2000 RPN+ZF (no reg) 300 52.1
SS 2000 RPN+ZF (no reg) 1000 51.3
SS 2000 RPN+VGG 300 59.2
509 the number of proposals. After NMS, we use the using either SS or EB because of shared convolutional 541
510 top-N ranked proposal regions for detection. In the computations; the fewer proposals also reduce the 542
511 following, we train Fast R-CNN using 2000 RPN pro- region-wise fully-connected layers’ cost (Table 5). 543
512 posals, but evaluate different numbers of proposals at Ablation Experiments on RPN. To investigate the be- 544
513 test-time. havior of RPNs as a proposal method, we conducted 545
514 4 E XPERIMENTS sharing convolutional layers between the RPN and 547
530 Table 2 (top) shows Fast R-CNN results when leads to an mAP of 56.8%. The loss in mAP is because 564
531 trained and tested using various region proposal of the inconsistency between the training/testing pro- 565
532 methods. These results use the ZF net. For Selective posals. This result serves as the baseline for the fol- 566
533 Search (SS) [4], we generate about 2000 proposals by lowing comparisons. 567
534 the “fast” mode. For EdgeBoxes (EB) [6], we generate Somewhat surprisingly, the RPN still leads to a 568
535 the proposals by the default EB setting tuned for 0.7 competitive result (55.1%) when using the top-ranked 569
536 IoU. SS has an mAP of 58.7% and EB has an mAP 100 proposals at test-time, indicating that the top- 570
537 of 58.6% under the Fast R-CNN framework. RPN ranked RPN proposals are accurate. On the other 571
538 with Fast R-CNN achieves competitive results, with extreme, using the top-ranked 6000 RPN proposals 572
539 an mAP of 59.9% while using up to 300 proposals9 . (without NMS) has a comparable mAP (55.2%), sug- 573
540 Using RPN yields a much faster detection system than gesting NMS does not harm the detection mAP and 574
NMS, and thus the average number of proposals is smaller. at test-time. When the cls layer is removed at test- 578
0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPAMI.2016.2577031, IEEE Transactions on Pattern Analysis and Machine Intelligence
Table 3: Detection results on PASCAL VOC 2007 test set. The detector is Fast R-CNN and VGG-16. Training
data: “07”: VOC 2007 trainval, “07+12”: union set of VOC 2007 trainval and VOC 2012 trainval. For RPN,
the train-time proposals for Fast R-CNN are 2000. † : this number was reported in [2]; using the repository
provided by this paper, this result is higher (68.1).
method # proposals data mAP (%)
SS 2000 07 66.9†
SS 2000 07+12 70.0
RPN+VGG, unshared 300 07 68.5
RPN+VGG, shared 300 07 69.9
RPN+VGG, shared 300 07+12 73.2
RPN+VGG, shared 300 COCO+07+12 78.8
Table 4: Detection results on PASCAL VOC 2012 test set. The detector is Fast R-CNN and VGG-16. Training
data: “07”: VOC 2007 trainval, “07++12”: union set of VOC 2007 trainval+test and VOC 2012 trainval. For
RPN, the train-time proposals for Fast R-CNN are 2000. † : http://host.robots.ox.ac.uk:8080/anonymous/HZJTQA.html. ‡ :
http://host.robots.ox.ac.uk:8080/anonymous/YNPLXB.html. § : http://host.robots.ox.ac.uk:8080/anonymous/XEDH10.html.
method # proposals data mAP (%)
SS 2000 12 65.7
SS 2000 07++12 68.4
RPN+VGG, shared† 300 12 67.0
RPN+VGG, shared‡ 300 07++12 70.4
RPN+VGG, shared§ 300 COCO+07++12 75.9
Table 5: Timing (ms) on a K40 GPU, except SS proposal is evaluated in a CPU. “Region-wise” includes NMS,
pooling, fully-connected, and softmax layers. See our released code for the profiling of running time.
model system conv proposal region-wise total rate
VGG SS + Fast R-CNN 146 1510 174 1830 0.5 fps
VGG RPN + Fast R-CNN 141 10 47 198 5 fps
ZF RPN + Fast R-CNN 31 3 25 59 17 fps
579 time (thus no NMS/ranking is used), we randomly slightly higher than the SS baseline. As shown above, 607
580 sample N proposals from the unscored regions. The this is because the proposals generated by RPN+VGG 608
581 mAP is nearly unchanged with N = 1000 (55.8%), but are more accurate than SS. Unlike SS that is pre- 609
582 degrades considerably to 44.6% when N = 100. This defined, the RPN is actively trained and benefits from 610
583 shows that the cls scores account for the accuracy of better networks. For the feature-shared variant, the 611
584 the highest ranked proposals. result is 69.9%—better than the strong SS baseline, yet 612
585 On the other hand, when the reg layer is removed with nearly cost-free proposals. We further train the 613
586 at test-time (so the proposals become anchor boxes), RPN and detection network on the union set of PAS- 614
587 the mAP drops to 52.1%. This suggests that the high- CAL VOC 2007 trainval and 2012 trainval. The mAP 615
588 quality proposals are mainly due to the regressed box is 73.2%. Figure 5 shows some results on the PASCAL 616
589 bounds. The anchor boxes, though having multiple VOC 2007 test set. On the PASCAL VOC 2012 test set 617
590 scales and aspect ratios, are not sufficient for accurate (Table 4), our method has an mAP of 70.4% trained 618
591 detection. on the union set of VOC 2007 trainval+test and VOC 619
592 We also evaluate the effects of more powerful net- 2012 trainval. Table 6 and Table 7 show the detailed 620
593 works on the proposal quality of RPN alone. We use numbers. 621
594 VGG-16 to train the RPN, and still use the above In Table 5 we summarize the running time of the 622
595 detector of SS+ZF. The mAP improves from 56.8% entire object detection system. SS takes 1-2 seconds 623
596 (using RPN+ZF) to 59.2% (using RPN+VGG). This is a depending on content (on average about 1.5s), and 624
597 promising result, because it suggests that the proposal Fast R-CNN with VGG-16 takes 320ms on 2000 SS 625
598 quality of RPN+VGG is better than that of RPN+ZF. proposals (or 223ms if using SVD on fully-connected 626
599 Because proposals of RPN+ZF are competitive with layers [2]). Our system with VGG-16 takes in total 627
600 SS (both are 58.7% when consistently used for training 198ms for both proposal and detection. With the con- 628
601 and testing), we may expect RPN+VGG to be better volutional features shared, the RPN alone only takes 629
602 than SS. The following experiments justify this hy- 10ms computing the additional layers. Our region- 630
603 pothesis. wise computation is also lower, thanks to fewer pro- 631
0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPAMI.2016.2577031, IEEE Transactions on Pattern Analysis and Machine Intelligence
Table 6: Results on PASCAL VOC 2007 test set with Fast R-CNN detectors and VGG-16. For RPN, the train-time
proposals for Fast R-CNN are 2000. RPN∗ denotes the unsharing feature version.
method # box data mAP areo bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv
SS 2000 07 66.9 74.5 78.3 69.2 53.2 36.6 77.3 78.2 82.0 40.7 72.7 67.9 79.6 79.2 73.0 69.0 30.1 65.4 70.2 75.8 65.8
SS 2000 07+12 70.0 77.0 78.1 69.3 59.4 38.3 81.6 78.6 86.7 42.8 78.8 68.9 84.7 82.0 76.6 69.9 31.8 70.1 74.8 80.4 70.4
RPN∗ 300 07 68.5 74.1 77.2 67.7 53.9 51.0 75.1 79.2 78.9 50.7 78.0 61.1 79.1 81.9 72.2 75.9 37.2 71.4 62.5 77.4 66.4
RPN 300 07 69.9 70.0 80.6 70.1 57.3 49.9 78.2 80.4 82.0 52.2 75.3 67.2 80.3 79.8 75.0 76.3 39.1 68.3 67.3 81.1 67.6
RPN 300 07+12 73.2 76.5 79.0 70.9 65.5 52.1 83.1 84.7 86.4 52.0 81.9 65.7 84.8 84.6 77.5 76.7 38.8 73.6 73.9 83.0 72.6
RPN 300 COCO+07+12 78.8 84.3 82.0 77.7 68.9 65.7 88.1 88.4 88.9 63.6 86.3 70.8 85.9 87.6 80.1 82.3 53.6 80.4 75.8 86.6 78.9
Table 7: Results on PASCAL VOC 2012 test set with Fast R-CNN detectors and VGG-16. For RPN, the train-time
proposals for Fast R-CNN are 2000.
method # box data mAP areo bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv
SS 2000 12 65.7 80.3 74.7 66.9 46.9 37.7 73.9 68.6 87.7 41.7 71.1 51.1 86.0 77.8 79.8 69.8 32.1 65.5 63.8 76.4 61.7
SS 2000 07++12 68.4 82.3 78.4 70.8 52.3 38.7 77.8 71.6 89.3 44.2 73.0 55.0 87.5 80.5 80.8 72.0 35.1 68.3 65.7 80.4 64.2
RPN 300 12 67.0 82.3 76.4 71.0 48.4 45.2 72.1 72.3 87.3 42.2 73.7 50.0 86.8 78.7 78.4 77.4 34.5 70.1 57.1 77.1 58.9
RPN 300 07++12 70.4 84.9 79.8 74.3 53.9 49.8 77.5 75.9 88.5 45.6 77.1 55.3 86.9 81.7 80.9 79.6 40.1 72.6 60.9 81.2 61.5
RPN 300 COCO+07++12 75.9 87.4 83.6 76.8 62.9 59.6 81.9 82.0 91.3 54.9 82.6 59.0 89.0 85.5 84.7 84.1 52.2 78.9 65.5 85.4 70.2
Figure 4: Recall vs. IoU overlap ratio on the PASCAL VOC 2007 test set.
Table 11: One-Stage Detection vs. Two-Stage Proposal + Detection. Detection results are on the PASCAL
VOC 2007 test set using the ZF model and Fast R-CNN. RPN uses unshared features.
proposals detector mAP (%)
Two-Stage RPN + ZF, unshared 300 Fast R-CNN + ZF, 1 scale 58.7
One-Stage dense, 3 scales, 3 aspect ratios 20000 Fast R-CNN + ZF, 1 scale 53.8
One-Stage dense, 3 scales, 3 aspect ratios 20000 Fast R-CNN + ZF, 5 scales 53.9
Table 8: Detection results of Faster R-CNN on PAS- Table 9: Detection results of Faster R-CNN on PAS-
CAL VOC 2007 test set using different settings of CAL VOC 2007 test set using different values of λ
anchors. The network is VGG-16. The training data in Equation (1). The network is VGG-16. The training
is VOC 2007 trainval. The default setting of using 3 data is VOC 2007 trainval. The default setting of using
scales and 3 aspect ratios (69.9%) is the same as that λ = 10 (69.9%) is the same as that in Table 3.
in Table 3. λ 0.1 1 10 100
settings anchor scales aspect ratios mAP (%) mAP (%) 67.2 68.9 69.9 69.1
1282 1:1 65.8 Table 10: Detection results of Faster R-CNN on PAS-
1 scale, 1 ratio
2562 1:1 66.7
CAL VOC 2007 test set using different numbers of
1282 {2:1, 1:1, 1:2} 68.8
1 scale, 3 ratios proposals in testing. The network is VGG-16. The
2562 {2:1, 1:1, 1:2} 67.9
3 scales, 1 ratios {128 , 2562 , 5122 }
2 training data is VOC 2007 trainval. The default setting
1:1 69.8
3 scales, 3 ratios {1282 , 2562 , 5122 } {2:1, 1:1, 1:2} 69.9 of using 300 proposals is the same as that in Table 3.
# proposals 50 100 150 200 300 500 1000
mAP (%) 66.3 68.9 69.5 69.8 69.9 69.8 69.8
637 If using just one anchor at each position, the mAP 3 aspect ratios on this dataset, suggesting that scales 644
638 drops by a considerable margin of 3-4%. The mAP and aspect ratios are not disentangled dimensions for 645
639 is higher if using 3 scales (with 1 aspect ratio) or 3 the detection accuracy. But we still adopt these two 646
640 aspect ratios (with 1 scale), demonstrating that using dimensions in our designs to keep our system flexible. 647
641 anchors of multiple sizes as the regression references In Table 9 we compare different values of λ in Equa- 648
642 is an effective solution. Using just 3 scales with 1 tion (1). By default we use λ = 10 which makes the 649
0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPAMI.2016.2577031, IEEE Transactions on Pattern Analysis and Machine Intelligence
10
650 two terms in Equation (1) roughly equally weighted evaluate using convolutional features extracted from 707
651 after normalization. Table 9 shows that our result is 5 scales. We use those 5 scales as in [1], [2]. 708
652 impacted just marginally (by ∼ 1%) when λ is within Table 11 compares the two-stage system and two 709
653 a scale of about two orders of magnitude (1 to 100). variants of the one-stage system. Using the ZF model, 710
654 This demonstrates that the result is insensitive to λ in the one-stage system has an mAP of 53.9%. This is 711
655 a wide range. lower than the two-stage system (58.7%) by 4.8%. 712
656 In Table 10 we investigate the numbers of proposals This experiment justifies the effectiveness of cascaded 713
657 in testing. region proposals and object detection. Similar obser- 714
665 In Figure 4, we show the results of using 300, 1000, We present more results on the Microsoft COCO 721
666 and 2000 proposals. We compare with SS, EB and object detection dataset [12]. This dataset involves 80 722
667 MCG, and the N proposals are the top-N ranked ones object categories. We experiment with the 80k images 723
668 based on the confidence generated by these meth- on the training set, 40k images on the validation set, 724
669 ods. The plots show that the RPN method behaves and 20k images on the test-dev set. We evaluate the 725
670 gracefully when the number of proposals drops from mAP averaged for IoU ∈ [0.5 : 0.05 : 0.95] (COCO’s 726
671 2000 to 300. This explains why the RPN has a good standard metric, simply denoted as mAP@[.5, .95]) 727
672 ultimate detection mAP when using as few as 300 and [email protected] (PASCAL VOC’s metric). 728
673 proposals. As we analyzed before, this property is
There are a few minor changes of our system made 729
674 mainly attributed to the cls term of the RPN. The recall
for this dataset. We train our models on an 8-GPU 730
675 of SS, EB and MCG drops more quickly than RPN
implementation, and the effective mini-batch size be- 731
676 when the proposals are fewer.
comes 8 for RPN (1 per GPU) and 16 for Fast R-CNN 732
677 One-Stage Detection vs. Two-Stage Proposal + De- (2 per GPU). The RPN step and Fast R-CNN step are 733
678 tection. The OverFeat paper [9] proposes a detection both trained for 240k iterations with a learning rate 734
679 method that uses regressors and classifiers on sliding of 0.003 and then for 80k iterations with 0.0003. We 735
680 windows over convolutional feature maps. OverFeat modify the learning rates (starting with 0.003 instead 736
681 is a one-stage, class-specific detection pipeline, and ours of 0.001) because the mini-batch size is changed. For 737
682 is a two-stage cascade consisting of class-agnostic pro- the anchors, we use 3 aspect ratios and 4 scales 738
683 posals and class-specific detections. In OverFeat, the (adding 642 ), mainly motivated by handling small 739
684 region-wise features come from a sliding window of objects on this dataset. In addition, in our Fast R-CNN 740
685 one aspect ratio over a scale pyramid. These features step, the negative samples are defined as those with 741
686 are used to simultaneously determine the location and a maximum IoU with ground truth in the interval of 742
687 category of objects. In RPN, the features are from [0, 0.5), instead of [0.1, 0.5) used in [1], [2]. We note 743
688 square (3×3) sliding windows and predict proposals that in the SPPnet system [1], the negative samples 744
689 relative to anchors with different scales and aspect in [0.1, 0.5) are used for network fine-tuning, but the 745
690 ratios. Though both methods use sliding windows, the negative samples in [0, 0.5) are still visited in the SVM 746
691 region proposal task is only the first stage of Faster R- step with hard-negative mining. But the Fast R-CNN 747
692 CNN—the downstream Fast R-CNN detector attends system [2] abandons the SVM step, so the negative 748
693 to the proposals to refine them. In the second stage of samples in [0, 0.1) are never visited. Including these 749
694 our cascade, the region-wise features are adaptively [0, 0.1) samples improves [email protected] on the COCO 750
695 pooled [1], [2] from proposal boxes that more faith- dataset for both Fast R-CNN and Faster R-CNN sys- 751
696 fully cover the features of the regions. We believe tems (but the impact is negligible on PASCAL VOC). 752
697 these features lead to more accurate detections. The rest of the implementation details are the same 753
698 To compare the one-stage and two-stage systems, as on PASCAL VOC. In particular, we keep using 754
699 we emulate the OverFeat system (and thus also circum- 300 proposals and single-scale (s = 600) testing. The 755
700 vent other differences of implementation details) by testing time is still about 200ms per image on the 756
701 one-stage Fast R-CNN. In this system, the “proposals” COCO dataset. 757
702 are dense sliding windows of 3 scales (128, 256, 512) In Table 12 we first report the results of the Fast 758
703 and 3 aspect ratios (1:1, 1:2, 2:1). Fast R-CNN is R-CNN system [2] using the implementation in this 759
704 trained to predict class-specific scores and regress box paper. Our Fast R-CNN baseline has 39.3% [email protected] 760
705 locations from these sliding windows. Because the on the test-dev set, higher than that reported in [2]. 761
706 OverFeat system adopts an image pyramid, we also We conjecture that the reason for this gap is mainly 762
0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPAMI.2016.2577031, IEEE Transactions on Pattern Analysis and Machine Intelligence
11
Table 12: Object detection results (%) on the MS COCO dataset. The model is VGG-16.
COCO val COCO test-dev
method proposals training data [email protected] mAP@[.5, .95] [email protected] mAP@[.5, .95]
Fast R-CNN [2] SS, 2000 COCO train - - 35.9 19.7
Fast R-CNN [impl. in this paper] SS, 2000 COCO train 38.6 18.9 39.3 19.3
Faster R-CNN RPN, 300 COCO train 41.5 21.2 42.1 21.5
Faster R-CNN RPN, 300 COCO trainval - - 42.7 21.9
763 due to the definition of the negative samples and also Table 13: Detection mAP (%) of Faster R-CNN on
764 the changes of the mini-batch sizes. We also note that PASCAL VOC 2007 test set and 2012 test set us-
765 the mAP@[.5, .95] is just comparable. ing different training data. The model is VGG-16.
766 Next we evaluate our Faster R-CNN system. Using “COCO” denotes that the COCO trainval set is used
767 the COCO training set to train, Faster R-CNN has for training. See also Table 6 and Table 7.
768 42.1% [email protected] and 21.5% mAP@[.5, .95] on the training data 2007 test 2012 test
769 COCO test-dev set. This is 2.8% higher for [email protected] VOC07 69.9 67.0
770 and 2.2% higher for mAP@[.5, .95] than the Fast R- VOC07+12 73.2 -
771 CNN counterpart under the same protocol (Table 12). VOC07++12 - 70.4
COCO (no VOC) 76.1 73.0
772 This indicates that RPN performs excellent for im-
COCO+VOC07+12 78.8 -
773 proving the localization accuracy at higher IoU thresh- COCO+VOC07++12 - 75.9
774 olds. Using the COCO trainval set to train, Faster R-
775 CNN has 42.7% [email protected] and 21.9% mAP@[.5, .95] on VOC07+12 COCO+VOC07+12
776 the COCO test-dev set. Figure 6 shows some results Cor: 77.1% Cor: 83.3%
777 on the MS COCO test-dev set. Loc: 8.1% Loc: 7.1%
Sim: 2.0% Sim: 1.7%
778 Faster R-CNN in ILSVRC & COCO 2015 compe- Oth: 1.3% Oth: 1.3%
BG: 11.6% BG: 6.7%
779 titions We have demonstrated that Faster R-CNN Loc: 8.1% Loc: 7.1%
780 benefits more from better features, thanks to the fact Cor: 77.1% Cor: 83.3%
781 that the RPN completely learns to propose regions by
782 neural networks. This observation is still valid even
783 when one increases the depth substantially to over
784 100 layers [18]. Only by replacing VGG-16 with a 101-
785 layer residual net (ResNet-101) [18], the Faster R-CNN Figure 7: Error analyses on models trained with and
786 system increases the mAP from 41.5%/21.2% (VGG- without MS COCO data. The test set is PASCAL VOC
787 16) to 48.4%/27.2% (ResNet-101) on the COCO val 2007 test. Distribution of top-ranked Cor (correct), Loc
788 set. With other improvements orthogonal to Faster R- (false due to poor localization), Sim (confusion with
789 CNN, He et al. [18] obtained a single-model result of a similar category), Oth (confusion with a dissimlar
790 55.7%/34.9% and an ensemble result of 59.0%/37.4% category), BG (fired on background) is shown, which
791 on the COCO test-dev set, which won the 1st place is generated by the published diagnosis code of [40].
792 in the COCO 2015 object detection competition. The
793 same system [18] also won the 1st place in the ILSVRC this experiment, and the softmax layer is performed 811
794 2015 object detection competition, surpassing the sec- only on the 20 categories plus background. The mAP 812
795 ond place by absolute 8.5%. RPN is also a building under this setting is 76.1% on the PASCAL VOC 2007 813
796 block of the 1st-place winning entries in ILSVRC 2015 test set (Table 13). This result is better than that trained 814
797 localization and COCO 2015 segmentation competi- on VOC07+12 (73.2%) by a good margin, even though 815
798 tions, for which the details are available in [18] and the PASCAL VOC data are not exploited. 816
799 [15] respectively. Then we fine-tune the COCO detection model on 817
801 Large-scale data is of crucial importance for improv- Faster R-CNN system is fine-tuned as described in 821
802 ing deep neural networks. Next, we investigate how Section 3.2. Doing so leads to 78.8% mAP on the 822
803 the MS COCO dataset can help with the detection PASCAL VOC 2007 test set. The extra data from 823
804 performance on PASCAL VOC. the COCO set increases the mAP by 5.6%. Table 6 824
805 As a simple baseline, we directly evaluate the shows that the model trained on COCO+VOC has 825
806 COCO detection model on the PASCAL VOC dataset, the best AP for every individual category on PASCAL 826
807 without fine-tuning on any PASCAL VOC data. This VOC 2007. This improvement is mainly resulted from 827
808 evaluation is possible because the categories on fewer false alarms on background (Figure 7). Similar 828
809 COCO are a superset of those on PASCAL VOC. The improvements are observed on the PASCAL VOC 829
810 categories that are exclusive on COCO are ignored in 2012 test set (Table 13 and Table 7). We note that 830
0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPAMI.2016.2577031, IEEE Transactions on Pattern Analysis and Machine Intelligence
12
person : 0.988
person : 0.992
car : 0.745
.745 person : 0.797 bird : 0.978
car : 0.955
55 horse : 0.991
bird : 0.941
bottle : 0.726
dog : 0.981
dog : 0.697
cat : 0.998
person : 0.917
boat : 0.671
car : 1.000 boat : 0.895 boat : 0.749
boat : 0.877
person : 0.988
person : 0.995
bottle : 0.851
bottle : 0
0.962
962
pottedplant : 0.728
car : 1.000 car : 0.880
car : 0.981
car : 0.982 chair : 0.630
boat : 0.995
boat : 0.948
diningtable : 0.862
bottle : 0.826
boat : 0.692
boat : 0
0.808
808
person : 0.975
bird : 0.980
horse : 0.984
aeroplane : 0.998
pottedplant : 0.820
chair : 0.984
984
diningtable : 0.997
pottedplant : 0.993 chair : 0.978
chair : 0.962
chair : 0.976
pottedplant : 0.715 car : 0.907
907
person : 0.993 person : 0.987
pottedplant : 0.940
pottedplant : 0.869
tvmonitor : 0.945
person : 0.983
chair : 0.723
person : 0.968 chair : 0.982 tvmonitor : 0.993 person : 0.959
bottle
e : 0.789
person : 0.988
diningtable : 0.903 bottle : 0
bot 0.858
chair : 0.852 bottle : 0.616 b
bottle :person
0
0.903
903 : 0.897
person : 0.870
bottle : 0.884
bird : 0.727
Figure 5: Selected examples of object detection results on the PASCAL VOC 2007 test set using the Faster
R-CNN system. The model is VGG-16 and the training data is 07+12 trainval (73.2% mAP on the 2007 test
set). Our method detects objects of a wide range of scales and aspect ratios. Each output box is associated
with a category label and a softmax score in [0, 1]. A score threshold of 0.6 is used to display these images.
The running time for obtaining these results is 198ms per image, including all steps.
831 the test-time speed of obtaining these state-of-the-art improves region proposal quality and thus the overall 840
832 results is still about 200ms per image. object detection accuracy. 841
834 We have presented RPNs for efficient and accurate [1] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling 843
in deep convolutional networks for visual recognition,” in 844
835 region proposal generation. By sharing convolutional European Conference on Computer Vision (ECCV), 2014. 845
836 features with the down-stream detection network, the [2] R. Girshick, “Fast R-CNN,” in IEEE International Conference on 846
837 region proposal step is nearly cost-free. Our method Computer Vision (ICCV), 2015. 847
[3] K. Simonyan and A. Zisserman, “Very deep convolutional 848
838 enables a unified, deep-learning-based object detec- networks for large-scale image recognition,” in International 849
839 tion system to run at 5-17 fps. The learned RPN also Conference on Learning Representations (ICLR), 2015. 850
0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPAMI.2016.2577031, IEEE Transactions on Pattern Analysis and Machine Intelligence
13
person
person
son
on : 0
0.975
975 traffic light : 0.802
person : 0.941
.941
4
person : 0.673 person : 0.928 person nperson
: 0.
0.958
958: 0.823
airplane : 0.997
person : 0.759
p
person : 0.766 backpack : 0.756 person : 0.772
0person
976 : 0.939
person : 0.976 0 939 person : 0.842
0.84 person : 0.841
person : 0.867 umbrella : 0.824 person : 0.897 car : 0.957
person
on : 0
0.950
50
handbag : 0.848 person : 0.805 clock : 0.986
person : 0.950
p person : 0.931
person : 0.970 clock : 0.981
person : 0.916
motorcycle : 0.713
dog : 0.996
bicycle : 0.891
dog : 0.691 bicycle : 0.639
person : 0.996
person : 0.800
motorcycle : 0.827
person : 0.808
pizza : 0.985
person : 0.998
dining table : 0.956
pizza : 0.938
bed : 0.999
pizza : 0.995
pizza : 0.982
clock : 0.982
bowl : 0.759
broccoli : 0.953
boat : 0.992
person : 0.934
bus : 0.999
person
person
erson : 0.869
: 0.970
book : 0.611
tv : 0.964
cup : 0.720
frisbee : 0.998
person : 0.723
cup : 0.931
dining table : 0.941 cup : 0.986
bird : 0.968
dog : 0.966
bowl : 0.958
zebra : 0.996
zebra : 0.970
970
zebra : 0.848
zebra : 0.993 sandwich : 0.629
bird : 0.987
bird : 0.894
person :tv
0 : 0.711
0.792
792 person : 0.917
refrigerator : 0.699
person : 0.993
bottle : 0.982
laptop : 0.973
tennis :racket
person
perso 0.999 : 0.960 horse : 0.990
bird : 0.746
oven : 0.655 bird : 0.956
keyboard : 0.638
bird : 0.906
keyboard : 0.615
mouse : 0.981
clock : 0.988
bowl : 0.744
bowl : 0.816
bowl : 0.710 person : 0.998
bowl : 0.847
cup : 0.807
pizza : 0.965
chair : 0.772
oven : 0.969
dining table : 0.618
Figure 6: Selected examples of object detection results on the MS COCO test-dev set using the Faster R-CNN
system. The model is VGG-16 and the training data is COCO trainval (42.7% [email protected] on the test-dev set).
Each output box is associated with a category label and a softmax score in [0, 1]. A score threshold of 0.6 is
used to display these images. For each image, one color represents one object category in that image.
851 [4] J. R. Uijlings, K. E. van de Sande, T. Gevers, and A. W. Smeul- based models,” IEEE Transactions on Pattern Analysis and Ma- 866
852 ders, “Selective search for object recognition,” International chine Intelligence (TPAMI), 2010. 867
853 Journal of Computer Vision (IJCV), 2013. [9] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, 868
854 [5] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature and Y. LeCun, “Overfeat: Integrated recognition, localization 869
855 hierarchies for accurate object detection and semantic seg- and detection using convolutional networks,” in International 870
856 mentation,” in IEEE Conference on Computer Vision and Pattern Conference on Learning Representations (ICLR), 2014. 871
857 Recognition (CVPR), 2014. [10] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards 872
858 [6] C. L. Zitnick and P. Dollár, “Edge boxes: Locating object real-time object detection with region proposal networks,” in 873
859 proposals from edges,” in European Conference on Computer Neural Information Processing Systems (NIPS), 2015. 874
860 Vision (ECCV), 2014. [11] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and 875
861 [7] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional A. Zisserman, “The PASCAL Visual Object Classes Challenge 876
862 networks for semantic segmentation,” in IEEE Conference on 2007 (VOC2007) Results,” 2007. 877
863 Computer Vision and Pattern Recognition (CVPR), 2015. [12] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra- 878
864 [8] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ra- manan, P. Dollár, and C. L. Zitnick, “Microsoft COCO: Com- 879
865 manan, “Object detection with discriminatively trained part- mon Objects in Context,” in European Conference on Computer 880
0162-8828 (c) 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPAMI.2016.2577031, IEEE Transactions on Pattern Analysis and Machine Intelligence
14
881 Vision (ECCV), 2014. [38] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Gir- 957
882 [13] S. Song and J. Xiao, “Deep sliding shapes for amodal 3d object shick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional 958
883 detection in rgb-d images,” arXiv:1511.02300, 2015. architecture for fast feature embedding,” arXiv:1408.5093, 2014. 959
884 [14] J. Zhu, X. Chen, and A. L. Yuille, “DeePM: A deep part-based [39] K. Lenc and A. Vedaldi, “R-CNN minus R,” in British Machine 960
885 model for object detection and semantic part localization,” Vision Conference (BMVC), 2015. 961
886 arXiv:1511.07131, 2015. [40] D. Hoiem, Y. Chodpathumwan, and Q. Dai, “Diagnosing error 962
887 [15] J. Dai, K. He, and J. Sun, “Instance-aware semantic segmenta- in object detectors,” in European Conference on Computer Vision 963
888 tion via multi-task network cascades,” arXiv:1512.04412, 2015. (ECCV), 2012. 964
911 “Multiscale combinatorial grouping,” in IEEE Conference on Research Asia. He received the BS degree 976
912 Computer Vision and Pattern Recognition (CVPR), 2014. from Tsinghua University in 2007, and the 977
913 [24] B. Alexe, T. Deselaers, and V. Ferrari, “Measuring the object- PhD degree from the Chinese University of 978
914 ness of image windows,” IEEE Transactions on Pattern Analysis Hong Kong in 2011. He joined Microsoft Re- 979
915 and Machine Intelligence (TPAMI), 2012. search Asia in 2011. His current research 980
916 [25] C. Szegedy, A. Toshev, and D. Erhan, “Deep neural networks interests are deep learning for visual recog- 981
917 for object detection,” in Neural Information Processing Systems nition, including image classification, object 982
919 [26] D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov, “Scalable has won the Best Paper Award at CVPR 984
920 object detection using deep neural networks,” in IEEE Confer- 2009. 985