[2020 CVPR Efficient DET paper review]

Presented by ChanHyuk Lee
2021/06/13
Computer Graphics @ Korea University
EfficientDet
MingxingTan et al.
CVPR 2020
517 citation
1/

CONTENTS
Introduction
01
Related work
02
Proposed method
03
Experiments
04
Ablation study
05
Conclusion
05
2

3
Background
Detection architecture
00
Backbone network FPN Prediction Network
Box prediction
(Regression)
Class prediction
(Classification)
Backbone network Feature Pyramid Network Prediction network

Introduction
• Recent detectors have the trade-off between accuracy and efficiency
• Most previous works only focus on a specific or a small range of resource requirements
• This points make hard to apply the recent detection models on industry field
• “Is it possible to build a scalable detection architecture with both higher
accuracy and better efficiency across a wide spectrum of resource constraints?”
Motivation
01
4

Introduction
Challenge 1. Efficient multi-scale feature fusion
01
5
• Feature fusion : The method for combining feature maps
→ Normal feature fusion methods don’t care about feature resolution.
Challenge 2 : Model scaling
• Model scaling : The method for up-scaling the model architecture
→ Limitation of up-scaling by considering one factor
Input-image up-scaling
Network up-scaling
02

Related work
Multi-scale feature representation
01
Conv
Conv
Conv
Conv
Up scaling
Up scaling
Up scaling
1x1 Conv
1x1 Conv
1x1 Conv
1x1 Conv
Prediction
Prediction
Prediction
Prediction
Backbone
Feature
pyramid
𝒑𝟒𝒐𝒖𝒕
𝒑𝟑𝒐𝒖𝒕
𝒑𝟐𝒐𝒖𝒕
𝒑𝟏𝒐𝒖𝒕
𝒑𝟒
𝒑𝟑
𝒑𝟐
𝒑𝟏
7
• For considering multi-scale object
Area
Prediction
layer

Related work
Model scaling
02
• EfficientNet (EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks, Mingxing Tan et al, ICML 2019)
• Jointly Scale up the depth, width, resolution (Compound scaling)
8
𝑓
𝑓
𝑓
𝑓
𝐷𝑒𝑝𝑡ℎ
𝐼𝑛𝑝𝑢𝑡 𝑟𝑒𝑠𝑜𝑙𝑢𝑡𝑖𝑜𝑛

Proposed method
01 RetinaNet architecture
10
02 EfficientDet architecture

BiFPN : Efficient bidirectional cross-scale connections and weighted feature fusion
Problem formulation
01
11
• Delete two blocks (compared to PANet)
• Add skip connection
• Weighted feature fusion
• Repeat BiFPN Layers
𝑤
𝑤
𝑤
𝑤
𝑤
𝑤
𝑤
𝑤
𝑤
𝑤
𝑤
𝑤
𝑤

BiFPN
Weighted Feature Fusion
02
• The difference of Resolution between Inputs → Different degrees of contribution to output
• Gave each input feature a weight to learn the contribution of the input feature.
𝑶𝒖𝒕𝒑𝒖𝒕
𝒇𝒆𝒂𝒕𝒖𝒓𝒆
𝑾𝒆𝒊𝒈𝒉𝒕𝒊 𝑰𝒏𝒑𝒖𝒕
𝒇𝒆𝒂𝒕𝒖𝒓𝒆𝒊
𝑺𝒐𝒇𝒕𝒎𝒂𝒙 − 𝒃𝒂𝒔𝒆𝒅 𝒇𝒖𝒔𝒊𝒐𝒏 𝑭𝒂𝒔𝒕 𝒏𝒐𝒓𝒎𝒂𝒍𝒊𝒛𝒆𝒅 𝒇𝒖𝒔𝒊𝒐𝒏
(30% Speed Gain in GPU)
12

EfficientDet
EfficientDet Architecture
01
• Using the efficientNet trained by ImageNet Data as backbone
• The Prediction layer network’s weights is shared for all Level features
13

EfficientDet
Compound scaling
02
• Previous works mostly scale up baseline network or using larger image inputs, stacking
more FPN layers
• New compound scaling method jointly scale up all dimensions of backbone network, BiFPN
network, prediction network and resolution of input.
Backbone network
02-1
• Reuse the same width/depth scaling coefficients of EfficientNet-B0 to B6
BiFPN network
02-2
• Perform grid search for finding best factor value on a list of values {1.2, 1.25, 1.3, 1.35, 1.4, 1.45}
𝑇ℎ𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐ℎ𝑎𝑛𝑛𝑒𝑙 𝑇ℎ𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑙𝑎𝑦𝑒𝑟
14

EfficientDet
Prediction network
02-3
• The width of network is same as BiFPN network's width
𝑇ℎ𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑙𝑎𝑦𝑒𝑟
Input image resolution
02-4
Overall scaling output
02-5
15

Experiments
Experiment configuration
01
• Dataset : COCO 2017 datasets with 118K images
• Optimizer : SGD with momentum 0.9 and weight decay 4e-5
• Learning Rate : 0 to 0.16 (First epoch), annealed down using cosine decay rule (0~0.16 𝑟𝑒𝑝𝑒𝑎𝑡)
• Batch normalization is used after every convolution layer
• Every convolution layer is depth-wise conv layer
• Activation function : Swish (𝑥 ∗ 𝑆𝑖𝑔𝑚𝑜𝑖𝑑(𝛽𝑥))
• Augmentation : Multi-resolution cropping / scaling / flipping
17

Experiments
Loss function
02
• Using Focal-loss for detection
• Class imbalanced problem is most effected by easy negative samples
• Training by focusing on hard samples
• If 𝑝𝑡 is almost 1 → − 1 − 0.999 𝑟
𝑙𝑜𝑔 𝑝𝑡 ≈ 0
• Else → − 1 − 0.001 𝑟 𝑙𝑜𝑔(𝑝𝑡) ≈ ∞
𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓
𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑐𝑎𝑡𝑖𝑜𝑛
18

Experiments
Performance on COCO
03
• Latency is inference latency with batch size 1
• AA denotes Auto-Augmentation
19

Experiments
Model size and inference latency comparison
04
• The comparison result of using GPU (Titan-V), CPU (Xeon)
20

Experiments
EfficientDet for Semantic Segmentation
05
• Use P2 Layer in BiFPN for semantic segmentation in EfficientDet-D4 model
DeepLabv3
21

Ablation study
Disentangling Backbone and BiFPN
01
• The Backbone network and multi-feature network of EfficientDet achieves higher AP and
Efficiency than prior networks
22

Ablation study
BiFPN Cross Scale Connection
02
• For the fair comparison, FPN and PANet are repeated multiple times and change the conv.
• BiFPN achieves the best accuracy with fewer parameters and FLOPs
23

Ablation study
Softmax vs Fast Normalized fusion
03
• Fast normalized fusion approach achieves similar accuracy as the softmax-based method
• Figure 5 illustrates the learned weights for three feature fusion nodes
24

Ablation study
Compound Scaling
04
• EfficientDet jointly scale up the network’s backbone, BiFPN, prediction net, input resolution
• The proposed method achieves the best accuracy than other scaling method
25

Conclusion
Propose the weight bidirectional feature network and customized compound scaling
method, in order to improve accuracy and efficiency
01
EfficientDet achieves better accuracy and efficiency than the prior art across a wide
spectrum of resource constrains
02
EfficientDet achieves SOTA accuracy with much fewer parameters and FLOPs in object
detection and semantic segmentation
03
26

[2020 CVPR Efficient DET paper review]

More Related Content

What's hot (19)

Similar to [2020 CVPR Efficient DET paper review] (20)

More from taeseon ryu (20)

Recently uploaded (20)

[2020 CVPR Efficient DET paper review]