Light-Weight RetinaNet For Object Detection
Light-Weight RetinaNet For Object Detection
Abstract
Object detection has gained great progress driven by the development of deep learn-
ing. Compared with a widely studied task – classification, generally speaking, object
detection even need one or two orders of magnitude more FLOPs (floating point oper-
ations) in processing the inference task. To enable a practical application, it is essen-
tial to explore effective runtime and accuracy trade-off scheme. Recently, a growing
number of studies are intended for object detection on resource constraint devices, such
as YOLOv1, YOLOv2, SSD, MobileNetv2-SSDLite [11, 14, 16], whose accuracy on
COCO test-dev [8] detection results are yield to mAP around 22-25% (mAP-20-tier).
On the contrary, very few studies discuss the computation and accuracy trade-off scheme
for mAP-30-tier detection networks. In this paper, we illustrate the insights of why Reti-
naNet gives effective computation and accuracy trade-off for object detection and how
to build a light-weight RetinaNet. We propose to only reduce FLOPs in computational
intensive layers and keep other layer the same. Compared with most common way –
input image scaling for FLOPs-accuracy trade-off, the proposed solution shows a con-
stantly better FLOPs-mAP trade-off line. Quantitatively, the proposed method result in
0.1% mAP improvement at 1.15x FLOPs reduction and 0.3% mAP improvement at 1.8x
FLOPs reduction.
1 Introduction
Object detection serves as an important role in computer vision-based tasks [10, 16, 17]. It
is the key module in face detection, tracking objects, video surveillance, pedestrian detection
etc [13, 19]. With the recent development of deep learning, it boosts the performance of ob-
ject detection tasks. However, regarding the computational complexity (in terms of FLOPs),
the detection network can possibly consume three orders of magnitude more FLOPs than
a classification network, which makes it much more difficult to move towards low-latency
inference.
Recently, there have been growing number of studies investigating into detection on
resource constraint devices, such as mobile platforms. As the main concern of resource
constraint devices is the memory consumption, the existing solutions, such as YOLOv1,
YOLOv2, SSD, MobileNetv2-SSDLite [11, 14, 16] have pushed it hard to reduce the mem-
ory consumption by trading off their accuracy performance. Their accuracy on large dataset
– COCO test-dev 2017 [8] detection results are yield to mAP of 22-25%. Here, we use
the mAP as the indicator to categorize these solutions as mAP-20-tier. On the other side,
in the mAP-30-tier, popular solutions can be listed as Faster R-CNN, RetinaNet, YOLOv3
The source code is available at https://github.com/PSCLab-ASU/LW-RetinaNet.
2 Y. LI AND F. REN: LIGHT-WEIGHT RETINANET FOR OBJECT DETECTION
[10, 15, 17] and their variants. As these solutions are commonly deployed on mid- or high-
end GPUs or FPGAs, the memory resource is probably enough for preloading the weights.
In addtion, [7] also verifies the linear relation between number FLOPs and inference run-
time for the same kind of network. By applying Faster R-CNN, RetinaNet and YOLOv3
[10, 15, 17] on the same task for COCO detection dataset, which takes an input image
around 600x600-800x800, the mAP will hit in the range of 33%-36%. However, the FLOPs
of Faster R-CNN [17] is around 850 GFLOPs (gigaFLOPs), which is at lease 5x more than
that of RetinaNet and YOLOv3 [15]. Apparently, Faster R-CNN is not competitive in the
computational efficiency. From YOLOv2 [14] to YOLOv3 [15], it is interesting that the
authors have aggressively increased the number of FLOPs from 30 to 140 GFLOPs to gain
mAP improvement from 21% to 33%. Even with that, its mAP is 2.5% lower than RetinaNet
with 150 GFLOPs. This observation inspires us to take the RetinaNet as the baseline to
explore a more light-weight version.
There are two common methods to reduce the FLOPs in a detection network. One way
is to switch to another backbone, while the other is reducing the input image size. The first
one definitely results in noticeable accuracy drop if one switches from one of the ResNet
backbones [5] to the other. Typically it won’t be consider as a good accuracy-FLOPs trade-
off scheme with small variation. With regard to reduce the input image size, it is an intuitive
way to reduce the FLOPs. However, the accuracy-FLOPs trade-off line shows degradation
in an exponentially trend [7]. There is an opportunity to find a more linear degradation
trendancy line for better accuracy-FLOPs trade-off. We propose to only replace the certain
branches/layers of the detection network with light-weight architecture and keep the rest of
the network unchanged. For the RetinaNet, the heaviest branch is the succeeding layers
of the finest FPN (P3 in Fig. 1), which takes up 48% of the total FLOPs. We propose
different light-weight architecture variants. More importantly, the proposed method can also
be applied to other blockwise-FLOPs-imbalance detection networks.
The contributions of this paper can be summarized as follows: (1) We proposed only to
reduce the heaviest bottleneck layer for the light-weight RetinaNet mAP-FLOPs trade-off.
(2) The proposed solution shows a constantly better mAP-FLOPs trade-off line in a linear
degradation trend, while the input image scaling method degrades in a more exponentially
trend. (3) Quantitatively, the proposed method result in 0.1% mAP improvement at 1.15x
FLOPs reduction and 0.1% mAP improvement at 1.15x FLOPs reduction and 0.3% mAP
improvement at 1.8x FLOPs reduction.
2 Related Works
2.1 High-end object detection networks (mAP-30-tier)
Faster RCNN [17] is an advanced architecture, which boosts both the accuracy and runtime
performance from R-CNN and Fast R-CNN [1, 2].
The main body of Faster RCNN [17] is composed of three parts – Feature Network,
Region Proposal Network (RPN) and Detection Network. As Faster RCNN [17] replaces
the selective search (used by Fast R-CNN) with RPN, it significantly reduces the runtime
of gernerate the region proposals. However, in the Faster R-CNN inference stage, there are
still around 256-1000 boxes will feed into the detection network, which is really expensive
to process this large batch of data. As for a Faster RCNN to process COCO dataset detection
task with Inception-ResNetV2 [18] backbone, the total numbers of FLOPs can comes up to
Y. LI AND F. REN: LIGHT-WEIGHT RETINANET FOR OBJECT DETECTION 3
38
FasterRCNN-
RetinaNet-800 InceptionResnetV2
36 RetinaNet-700
RetinaNet-600
34 YOLOv3-608
RetinaNet-500
32 RetinaNet-400 YOLOv3-416
30 mAP30-tier
mAP/%
YOLOv3-320
28
26 SSD
SSDLite-
24 MobileNetv2 mAP20-tier
YOLOv2-416
22
SSDLite-MobileNet
20
0 50 100 150 200 38250
GFLOPs
36
RetinaNe
34 RetinaNet-500
32
Figure 1: Example of a short caption, which should be centered. RetinaNet-400 YOLOv3
30
mAP/%
YOLOv3-320
Compared with Faster RCNN [17], the RetinaNet [10] targets a simpler
26 designSSDfor gain-
ing speedup. A feature pyramid network (FPN) [9] is attached to its backbone
SSDLite-
to generate
24 MobileNetv2
multi-scale pyramid features. Then, pyramid features go into classification and regression
YOLOv2-416
branches, whose weights can be shared across different levels of the 22
FPN. The focal loss
is applied to compensate the accuracy drop, which makes its accuracy performance to be
SSDLite-MobileNet
20
comparable with the Faster RCNN. 0 50 1
1/4 Res2
0.5x
Image Res1 1/2
take the RetinaNet as the baseline design to explore a better scheme for accuracy and FLOPs
tradeoff for the high-end detection tasks.
3 Light-weight RetinaNet
In Section 2, we have explained why we think the RetinaNet network architecture has the
potential to be tailored for a better accuracy and FLOPs traderoff. Here, in this Section, first
we will further analyze the RetinaNet network with a focus of the distribution of the number
of floating-point operations (FLOPs) across different layers in Section 3.1. Then Section 3.2
illustrates the scheme to help RetinaNet to lose weight.
48.1
0.8 4.0 3.7 0.4 0.8 1.7 2.1 2.1 3.9 3.9 3.9 3.9 3.9
22.1
memory/% FLOPs/%
42.9
Figure 3: The FLOPs and memory (parameter) distribution of RetinaNet across different
blocks.
bounding box branches do not share weights. While the weights of each branch are shared
across pyramid features (P3-P7).
The FLOPs distribution of RetinaNet across different blocks is shown in Fig. 3. Here,
each block is corresponding to the same block in Fig. 2. For the detection backend D3-D7,
they refer to succeeding layers of P3-P7, respectively. As in the original design, D3-D7
share the same weight parameters, we just show the average memory cost of D3-D7 in Fig.3
. The FLOPs count of the D3 block significantly dominates the total FLOPs count. This
unbalanced FLOPs distribution is quite different from that of the ResNet architecture, which
has small variant across different blocks. The unbalanced FLOPs distribution gives us the
chance to get a good overall FLOPs reduction percentile by only reducing the cost of the
heaviest layer. Quantitatively, for example, if we can reduce the FLOPs of D3 by half, the
total FLOPs can be reduced by 24%. In the following subsections, we will discuss the main
insights of how to get a tiny back-end.
Intuitively, we can reduce the filter size in order to get FLOPs reduction. As shown in Fig. 4,
we have propose different block design for the detection branches of ResNet. The D-block-
v1 is applying the MobileNet [6] building block here. A 3x3 depth-wise (dw) convolution
is followed by a 1x1 convolutional block to substitute one orginal layer. The D-block-v2
alternately places 1x1 and 3x3 kernels. This one is inspired by the YOLOv1 [16], which has
replaces the 3x3 kernels without introducing residual blocks. In our design, we even make
it simpler to keep number of filters fixed across different layers. The D-block-v3 is more
aggressive, which replaces all the 3x3 convolutions with 1x1 convolutions.
Apparently, the light-weight block will cause certain accuracy drop to tradeoff less com-
putation cost. Therefore, we introduce limited overheads to compensate the accuracy drop
here, which is stated in the next subsection.
6 Y. LI AND F. REN: LIGHT-WEIGHT RETINANET FOR OBJECT DETECTION
P7
W×H×KA
W×H×256
P6
3x3 conv
D3-D7
P7 P5
W×H×4A
W×H×256
P6 W×H×KA P4
W×H×256
Shared weight for D4-D7
P5
D-block-v1/2/3 D3
P4 W×H×4A
W×H×256
W×H×KA
WxHx256
P3
W×H×4A
WxHx256
As illustrated in Section 3.2.1., the light-weight detection blocks are trading off lower com-
putational complexity with accuracy drop. To compensate the accuracy drop, we proposed
to replace the fully shared weight scheme in the original RetinaNet to a partial shard weight
scheme. As shown in Fig. 3, P3-P7 is the multi-scale feature maps outputs of FPN, which
then feed into detection backend D3-D7 respectively. Although D3-D7 shared the weight
parameters, D3-D7 have unique input size (P3-P7) and are processed in serial. Fig. 3(a)
is the original one that D3-D7 fully share the weights. In Fig. 3(b), only D4-D7 share the
weights with original configuration, while D3 will be processed by the light-weight D-block-
v1/v2/v3 that proposed in Section 3.2.1.
Y. LI AND F. REN: LIGHT-WEIGHT RETINANET FOR OBJECT DETECTION 7
Partially shared weights scheme mainly has two advantages. For one thing, as D3 has
its independent weight parameters, it can learn more tailored features for its branch which
can compensate the accuracy drop brought by lower computational complexity. For another,
this enables us not to touch the rest of the network but only solving the heaviest bottleneck
block. Also, as the backbone (ResNet-50) dominate the memory consumption (as shown in
Fig. 3), the overhead of memory consumption here can be negligible. Quantitatively, weight
parameter increment introduced here is less than 1% of the total weights.
Detection backend
Light-weight block
Classification Bouding box
LW-RetinaNet-v1 D-block-v2 √
LW-RetinaNet-v2 D-block-v3 √
LW-RetinaNet-v3 D-block-v3 √ √
5 Conclusion
In this paper, We proposed only to reduce the FLOPs in the heaviest bottleneck layer for
a blockwise-FLOPs-imbalance RetinaNet to get its light-weight version. The proposed so-
lution shows a constantly better mAP-FLOPs trade-off line in a linear degradation trend,
while the input image scaling method degrades in a more exponentially trend. Quantita-
Y. LI AND F. REN: LIGHT-WEIGHT RETINANET FOR OBJECT DETECTION 9
35.8
LW-RetinaNet-
35.6 v1
RetinaNet-800
35.4 LW-RetinaNet-
v2 +0.1%
35.2 mAP
mAP/%
35 LW-RetinaNet-
RetinaNet-700
34.8 v3
34.6
+0.3%
34.4 mAP
34.2 RetinaNet-600
34
80 90 100 110 120 130 140 150 160
GFLOPs
Figure 6: FLOPs and mAP trade-off for input image size scaling versus the proposed method.
tively, the proposed method result in 0.1% mAP improvement at 1.15x FLOPs reduction and
0.3% mAP improvement at 1.8x FLOPs reduction. The proposed method can be potentially
applied to any FPN-based blockwise-FLOPs-imbalance detection network.
References
[1] Ross Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on
computer vision, pages 1440–1448, 2015.
[2] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Region-based con-
volutional networks for accurate object detection and segmentation. IEEE transactions
on pattern analysis and machine intelligence, 38(1):142–158, 2016.
[3] Ross Girshick, Ilija Radosavovic, Georgia Gkioxari, Piotr Dollár, and Kaiming He.
Detectron. https://github.com/facebookresearch/detectron, 2018.
[4] Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo
Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch
sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
[5] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning
for image recognition. In Proceedings of the IEEE conference on computer vision and
pattern recognition, pages 770–778, 2016.
[6] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun
Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Effi-
cient convolutional neural networks for mobile vision applications. arXiv preprint
arXiv:1704.04861, 2017.
[7] Jonathan Huang, Vivek Rathod, Chen Sun, Menglong Zhu, Anoop Korattikara,
Alireza Fathi, Ian Fischer, Zbigniew Wojna, Yang Song, Sergio Guadarrama, et al.
Speed/accuracy trade-offs for modern convolutional object detectors. In Proceedings
10 Y. LI AND F. REN: LIGHT-WEIGHT RETINANET FOR OBJECT DETECTION
of the IEEE conference on computer vision and pattern recognition, pages 7310–7311,
2017.
[8] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ra-
manan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in
context. In European conference on computer vision, pages 740–755. Springer, 2014.
[9] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge
Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pages 2117–2125, 2017.
[10] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss
for dense object detection. In Proceedings of the IEEE international conference on
computer vision, pages 2980–2988, 2017.
[11] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-
Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. In European
conference on computer vision, pages 21–37. Springer, 2016.
[12] Zhuang Liu, Mingjie Sun, Tinghui Zhou, Gao Huang, and Trevor Darrell. Rethinking
the value of network pruning. arXiv preprint arXiv:1810.05270, 2018.
[13] Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, et al. Deep face recognition. In
bmvc, volume 1, page 6, 2015.
[14] Joseph Redmon and Ali Farhadi. Yolo9000: better, faster, stronger. In Proceedings
of the IEEE conference on computer vision and pattern recognition, pages 7263–7271,
2017.
[15] Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement. arXiv preprint
arXiv:1804.02767, 2018.
[16] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once:
Unified, real-time object detection. In Proceedings of the IEEE conference on computer
vision and pattern recognition, pages 779–788, 2016.
[17] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-
time object detection with region proposal networks. In Advances in neural information
processing systems, pages 91–99, 2015.
[18] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna.
Rethinking the inception architecture for computer vision. In Proceedings of the IEEE
conference on computer vision and pattern recognition, pages 2818–2826, 2016.
[19] Xiaogang Wang. Intelligent multi-camera video surveillance: A review. Pattern recog-
nition letters, 34(1):3–19, 2013.