1-Resnet Slides
1-Resnet Slides
1x1 conv, 64
3x3 conv, 64
1x1 conv, 64
3x3 conv, 64
1x1 conv, 64
3x3 conv, 64
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.
ResNets @ ILSVRC & COCO 2015 Competitions
• 1st places in all five main tracks
• ImageNet Classification: “Ultra-deep” 152-layer nets
• ImageNet Detection: 16% better than 2nd
• ImageNet Localization: 27% better than 2nd
• COCO Detection: 11% better than 2nd
• COCO Segmentation: 12% better than 2nd
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.
Revolution of Depth 28.2
25.8
152 layers
16.4
11.7
22 layers 19 layers
6.7 7.3
fc, 4096
fc, 4096
fc, 1000
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.
Revolution of Depth soft max2
FC
Av eragePool
7 x7 + 1 (V)
Dept hConcat
Dept hConcat FC
Dept hConcat
Dept hConcat
MaxPool
3 x3 + 2 (S)
Conv
3 x3 + 1 (S)
LocalRespNorm
Conv
7 x7 + 2 (S)
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.
7x7 conv, 64, /2, pool /2
1x1 conv, 64
3x3 conv, 64
1x1 conv, 64
3x3 conv, 64
1x1 conv, 64
3x3 conv, 64
Revolution of Depth
1x1 conv, 512
3x3 conv, 384 3x3 conv, 64, pool/2 1x1 conv, 1024
3x3 conv, 256, pool/2 3x3 conv, 128 1x1 conv, 256
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.
101 layers
Revolution of Depth
86
Engines of
66
visual recognition 58
34
16 layers
8 layers
shallow
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.
Very simple, easy to follow
• Many third-party implementations (list in https://github.com/KaimingHe/deep-residual-networks)
• Facebook AI Research’s Torch ResNet:
• Torch, CIFAR-10, with ResNet-20 to ResNet-110, training code, and curves: code
• Lasagne, CIFAR-10, with ResNet-32 and ResNet-56 and training code: code
• Neon, CIFAR-10, with pre-trained ResNet-32 to ResNet-110 models, training code, and curves: code
• Torch, MNIST, 100 layers: blog, code
• A winning entry in Kaggle's right whale recognition challenge: blog, code
• Neon, Place2 (mini), 40 layers: blog, code
• …
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.
Background
From shallow to deep
Traditional recognition But what’s next?
shallower
classifier “bus”?
pixels
SIFT/HOG
deeper
edges histogram classifier “bus”?
K-means/ “bus”?
edges histogram sparse code classifier
Deep Learning
Specialized components, domain knowledge required
K-means/ “bus”?
edges histogram sparse code classifier
“bus”?
“bus”?
• End-to-end learning
• Richer solution space
Spectrum of Depth
5 layers: easy
>10 layers: initialization, Batch Normalization
>30 layers: skip connections
>100 layers: identity skip connections
>1000 layers: ?
shallower deeper
If:
Initialization • Linear activation
• 𝑥, 𝑦, 𝑤: independent
weight Then:
𝑊
1-layer:
𝑉𝑎𝑟 𝑦 = (𝑛+, 𝑉𝑎𝑟 𝑤 )𝑉𝑎𝑟[𝑥]
input output
𝑋 𝑌 = 𝑊𝑋
Multi-layer:
+,
𝑉𝑎𝑟 𝑦 = (2 𝑛3 𝑉𝑎𝑟 𝑤3 )𝑉𝑎𝑟[𝑥]
𝑛+, 𝑛567 3
LeCun et al 1998 “Efficient Backprop”
Glorot & Bengio 2010 “Understanding the difficulty of training deep feedforward neural networks”
Both forward (response) and backward (gradient)
Initialization signal can vanish/explode
Forward:
𝑉𝑎𝑟 𝑦 = (2 𝑛+,
3 𝑉𝑎𝑟 𝑤3 )𝑉𝑎𝑟[𝑥]
3 exploding
Backward:
𝜕 567 𝜕
𝑉𝑎𝑟 = (2 𝑛3 𝑉𝑎𝑟 𝑤3 )𝑉𝑎𝑟[ ]
𝜕𝑥 𝜕𝑦
3
ideal
vanishing
1 3 5 7 9 11 13 15
depth
LeCun et al 1998 “Efficient Backprop”
Glorot & Bengio 2010 “Understanding the difficulty of training deep feedforward neural networks”
Initialization
• Initialization under linear assumption
∏3 𝑛+,
3 𝑉𝑎𝑟 𝑤3 = 𝑐𝑜𝑛𝑠𝑡>? (healthy forward)
and
∏3 𝑛567
3 𝑉𝑎𝑟 𝑤3 = 𝑐𝑜𝑛𝑠𝑡@? (healthy backward)
D5,E7FG ,MNO
𝑛+,
3 𝑉𝑎𝑟 𝑤3 = 1
*: 𝑛567
3 = +,
𝑛3BC, so D5,E7 = IJKL
< ∞.
HG ,RS
HPQKL
or* It is sufficient to use either form.
𝑛567
3 𝑉𝑎𝑟 𝑤3 = 1
“Xavier” init in Caffe
LeCun et al 1998 “Efficient Backprop”
Glorot & Bengio 2010 “Understanding the difficulty of training deep feedforward neural networks”
Initialization
• Initialization under ReLU activation
C
∏3 𝑛+, 𝑉𝑎𝑟 𝑤3 = 𝑐𝑜𝑛𝑠𝑡>? (healthy forward)
V 3
and
C 567
∏3 𝑛3 𝑉𝑎𝑟 𝑤3 = 𝑐𝑜𝑛𝑠𝑡@?(healthy backward)
V
1 +,
𝑛3 𝑉𝑎𝑟 𝑤3 = 1
2 With 𝐷 layers, a factor of 2 per layer has
or exponential impact of 2Y
1 567
𝑛3 𝑉𝑎𝑟 𝑤3 = 1
2 “MSRA” init in Caffe
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification”. ICCV 2015.
Initialization
22-layer ReLU net: 30-layer ReLU net:
good init converges faster good init is able to converge
1
𝑛𝑉𝑎𝑟 𝑤ours
=1
2 1
2
ours
𝑛𝑉𝑎𝑟 𝑤 =1
𝑛𝑉𝑎𝑟 𝑤 Xavier
=1 𝑛𝑉𝑎𝑟Xavier
𝑤 =1
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification”. ICCV 2015.
Batch Normalization (BN)
• Normalizing input (LeCun et al 1998 “Efficient Backprop”)
• Improve regularization
S. Ioffe & C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. ICML 2015
Batch Normalization (BN)
𝑥−𝜇
layer 𝑥 𝑥Z = 𝑦 = 𝛾𝑥Z + 𝛽
𝜎
S. Ioffe & C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. ICML 2015
Batch Normalization (BN)
𝑥−𝜇
layer 𝑥 𝑥Z = 𝑦 = 𝛾𝑥Z + 𝛽
𝜎
2 modes of BN:
• Train mode:
• 𝜇, 𝜎 are functions of 𝑥; backprop gradients
• Test mode:
Caution: make sure your BN
• 𝜇, 𝜎 are pre-computed* on training set
is in a correct mode
S. Ioffe & C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. ICML 2015
Batch Normalization (BN)
accuracy best of w/ BN w/o BN
iter.
Figure taken from [S. Ioffe & C. Szegedy]
S. Ioffe & C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. ICML 2015
Deep Residual Networks
From 10 layers to 100 layers
Going Deeper
• Initialization algorithms ✓
• Batch Normalization ✓
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.
Simply stacking layers?
CIFAR-10
train error (%) test error (%)
20 20
56-layer
56-layer
10 10
20-layer
20-layer
0 0
0 1 2 3 4 5 6 0 1 2 3 4 5 6
iter. (1e4) iter. (1e4)
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.
Simply stacking layers?
CIFAR-10 ImageNet-1000
20
60
56-layer
44-layer 50
32-layer
error (%)
error (%)
10 34-layer
20-layer 40
5 plain-20 30
plain-32
plain-18
plain-44
plain-56 solid: test/val plain-34 18-layer
0 20
0 1 2 3
iter. (1e4)
4 5 6
dashed: train 0 10 20 30
iter. (1e4)
40 50
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.
7x7 conv, 64, /2 7x7 conv, 64, /2
3x3 conv, 64
3x3 conv, 64
layers
• A solution by construction:
3x3 conv, 256 3x3 conv, 256
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.
Deep Residual Learning
• Plaint net 𝐻 𝑥 is any desired mapping,
𝑥 hope the 2 weight layers fit 𝐻(𝑥)
weight layer
any two
stacked layers relu
weight layer
relu
𝐻(𝑥)
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.
Deep Residual Learning
• Residual net 𝐻 𝑥 is any desired mapping,
𝑥 hope the 2 weight layers fit 𝐻(𝑥)
𝐻 𝑥 = 𝐹 𝑥 +𝑥
relu
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.
Deep Residual Learning
• 𝐹 𝑥 is a residual mapping w.r.t. identity
𝑥
• If identity were optimal,
weight layer
easy to set weights as 0
𝐹(𝑥) relu identity
weight layer 𝑥 • If optimal mapping is closer to identity,
easier to find small fluctuations
𝐻 𝑥 = 𝐹 𝑥 +𝑥
relu
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.
Related Works – Residual Representations
• VLAD & Fisher Vector [Jegou et al 2010], [Perronnin et al 2007]
• Encoding residual vectors; powerful shallower representations.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.
7x7 conv, 64, /2 7x7 conv, 64, /2
pool, /2 pool, /2
Network “Design”
3x3 conv, 64 3x3 conv, 64
• no hidden fc
3x3 conv, 512 3x3 conv, 512
• no dropout
3x3 conv, 512 3x3 conv, 512
fc 1000 fc 1000
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.
Training
• All plain/residual nets are trained from scratch
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.
CIFAR-10 experiments
CIFAR-10 plain nets CIFAR-10 ResNets
20 20
ResNet-20
56-layer ResNet-32
ResNet-44
ResNet-56
44-layer ResNet-110
32-layer 20-layer
error (%)
error (%)
10
20-layer
10 32-layer
44-layer
5 plain-20
plain-32
5
56-layer
plain-44
plain-56
solid: test 110-layer
0
0 1 2 3 4 5 6 dashed: train 0
0 1 2 3 4 5 6
iter. (1e4) iter. (1e4)
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.
ImageNet experiments
ImageNet plain nets ImageNet ResNets
60 60
50 50
error (%)
error (%)
40
34-layer 40 18-layer
30 30
solid: test ResNet-18
plain-18
plain-34
dashed: train 18-layer ResNet-34 34-layer
20 20
0 10 20 30 40 50 0 10 20 30 40 50
iter. (1e4) iter. (1e4)
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.
ImageNet experiments
• A practical design of going deeper
64-d 256-d
3x3, 64 1x1, 64
relu
relu
3x3, 64
relu
3x3, 64
1x1, 256
relu relu
similar
all-3x3 complexity bottleneck
(for ResNet-50/101/152)
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.
ImageNet experiments
8
• Deeper ResNets have lower error 7.4
this model has
lower time complexity
than VGG-16/19 6.7 7
6.1
5.7 6
16.4
11.7
22 layers 19 layers
6.7 7.3
𝑥cBC = 𝑓(ℎ 𝑥c + 𝐹 𝑥c )
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Identity Mappings in Deep Residual Networks”. arXiv 2016.
On identity mappings for optimization
𝑥𝑙 • shortcut mapping: ℎ = identity
• after-add mapping: 𝑓 = ReLU
layer
• What if 𝑓 = identity?
𝐹(𝑥𝑙 ) ℎ(𝑥c )
layer
𝑥cBC = 𝑓(ℎ 𝑥c + 𝐹 𝑥c )
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Identity Mappings in Deep Residual Networks”. arXiv 2016.
On identity mappings for optimization
𝑥𝑙 • shortcut mapping: ℎ = identity
• after-add mapping: 𝑓 = ReLU
layer
• What if 𝑓 = identity?
𝐹(𝑥𝑙 ) ℎ(𝑥c )
layer
𝑥cBC = 𝑓(ℎ 𝑥c + 𝐹 𝑥c )
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Identity Mappings in Deep Residual Networks”. arXiv 2016.
Very smooth forward propagation
𝑥cBC = 𝑥c + 𝐹 𝑥c
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Identity Mappings in Deep Residual Networks”. arXiv 2016.
Very smooth forward propagation
𝑥cBC = 𝑥c + 𝐹 𝑥c
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Identity Mappings in Deep Residual Networks”. arXiv 2016.
Very smooth forward propagation
𝑥cBC = 𝑥c + 𝐹 𝑥c
𝑥g = 𝑥c + h 𝐹 𝑥+
+jc
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Identity Mappings in Deep Residual Networks”. arXiv 2016.
7x7 conv, 64, /2 7x7 conv, 64, /2
pool, /2 pool, /2
giC
3x3 conv, 128, /2 3x3 conv, 128, /2
𝑥g = 𝑥c + h 𝐹 𝑥+
3x3 conv, 128 3x3 conv, 128
plus residual.
3x3 conv, 256 3x3 conv, 256
+jc 𝑊+ 𝑥c
3x3 conv, 512, /2 3x3 conv, 512, /2
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Identity Mappingsfcin
1000 Deep Residual
fc 1000 Networks”. arXiv 2016.
7x7 conv, 64, /2 7x7 conv, 64, /2
pool, /2 pool, /2
giC
3x3 conv, 128, /2 3x3 conv, 128, /2
𝑥g = 𝑥c + h 𝐹 𝑥+
3x3 conv, 128 3x3 conv, 128
𝜕𝑥c
3x3 conv, 256 3x3 conv, 256
giC
3x3 conv, 256 3x3 conv, 256
𝜕𝐸 𝜕𝐸 𝜕𝑥g 𝜕𝐸 𝜕
3x3 conv, 256 3x3 conv, 256
= = (1 + h 𝐹 𝑥+ )
3x3 conv, 256 3x3 conv, 256
+jC
3x3 conv, 256 3x3 conv, 256
𝜕𝑥g
3x3 conv, 512, /2 3x3 conv, 512, /2
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Identity Mappingsfcin
1000 Deep Residual
fc 1000 Networks”. arXiv 2016.
7x7 conv, 64, /2 7x7 conv, 64, /2
pool, /2 pool, /2
= (1 + h 𝐹 𝑥+ )
3x3 conv, 128 3x3 conv, 128
lm lm 𝜕𝑥c
3x3 conv, 256 3x3 conv, 256
• Any lno
is directly back-prop to any lnp
, 3x3 conv, 256
plus residual.
3x3 conv, 256 3x3 conv, 256
lm
3x3 conv, 256 3x3 conv, 256
• Any is additive; unlikely to vanish 𝜕𝐸 3x3 conv, 256 3x3 conv, 256
lnp
3x3 conv, 256 3x3 conv, 256
𝜕𝑥g
3x3 conv, 512, /2 3x3 conv, 512, /2
lm lm
• in contrast to multiplicative: ln = ∏giC
+jc 𝑊+ ln
3x3 conv, 512
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Identity Mappingsfcin
1000 Deep Residual
fc 1000 Networks”. arXiv 2016.
Residual for every layer
giC
Enabled by:
forward: 𝑥g = 𝑥c + h 𝐹 𝑥+
• shortcut mapping: ℎ = identity
+jc
• after-add mapping: 𝑓 = identity
giC
𝜕𝐸 𝜕𝐸 𝜕
backward: = (1 + h 𝐹 𝑥+ )
𝜕𝑥c 𝜕𝑥g 𝜕𝑥c
+jC
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Identity Mappings in Deep Residual Networks”. arXiv 2016.
Experiments
• Set 1: what if shortcut mapping ℎ ≠ identity
giC
𝜕𝐸 𝜕𝐸 gic 𝜕
backward: = (λ + h 𝐹u 𝑥+ )
𝜕𝑥c 𝜕𝑥g 𝜕𝑥c
+jC
*assuming 𝑓 = identity
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Identity Mappings in Deep Residual Networks”. arXiv 2016.
3x3 conv
ReLU
1x1 conv 3x3 conv
sigmoid
1-
addition
ReLU
ℎ is gating
solid: test
dashed: train
3x3 conv ℎ is identity
ReLU
3x3 conv
addition
ReLU
weight weight BN
BN BN ReLU
weight weight BN
BN addition ReLU
addition BN weight
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Identity Mappings in Deep Residual Networks”. arXiv 2016.
xl xl
weight weight
BN BN
ReLU ReLU
weight weight
BN addition
addition BN
ReLU ReLU
solid: test xl+1 xl+1
dashed: train
𝑓 = ReLU
• BN could block prop
• Keep the shortest pass as
smooth as possible
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Identity Mappings in Deep Residual Networks”. arXiv 2016.
1001-layer ResNets on CIFAR-10 xl xl
weight BN
BN ReLU
ReLU weight
weight BN
BN ReLU
addition weight
ReLU addition
solid: test xl+1 xl+1
dashed: train
𝑓 = ReLU 𝑓 = identity
𝑓 = ReLU
𝑓 = identity
• ReLU could block prop when there
are 1000 layers
• pre-activation design eases
optimization (and improves generalization; see paper)
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Identity Mappings in Deep Residual Networks”. arXiv 2016.
Comparisons on CIFAR-10/100
CIFAR-10 CIFAR-100
method error (%) method error (%)
NIN 8.81 NIN 35.68
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Identity Mappings in Deep Residual Networks”. arXiv 2016.
Summary of observations
• Keep the shortest path as smooth as possible
• by making ℎ and 𝑓 identity xl
• forward/backward signals directly flow through this path
BN
ReLU
• Features of any layers are additive outcomes weight
BN
addition
xl+1
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Identity Mappings in Deep Residual Networks”. arXiv 2016.
Future Works
• Representation
• skipping 1 layer vs. multiple layers?
xl
• Flat vs. Bottleneck?
• Inception-ResNet [Szegedy et al 2016]
• ResNet in ResNet [Targ et al 2016] BN
weight
• Generalization BN
• DropOut, MaxOut, DropConnect, …
• Drop Layer (Stochastic Depth) [Huang et al 2016] ReLU
weight
• Optimization addition
• Without residual/shortcut?
xl+1
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Identity Mappings in Deep Residual Networks”. arXiv 2016.
Applications
“Features matter”
“Features matter.” (quote [Girshick et al. 2014], the R-CNN paper)
2nd-place margin
task winner
ResNets (relative)
34
16 layers
8 layers
shallow
segmentation
network
ImageNet target
(e.g. FCN)
data data
…...
backbone classification
structure network human pose
pre-train features estimation fine-tune
network
depth
estimation
network
Example: Object Detection
ü boat
ü person
input image region proposals 1 CNN for each region classify regions
~2,000
Girshick, Donahue, Darrell, Malik. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. CVPR 2014
Object Detection: R-CNN
• R-CNN
feature
feature
feature
feature
End-to-End
training
CNN CNN
CNN CNN
pre-computed
Regions-of-Interest
image
(RoIs)
Girshick, Donahue, Darrell, Malik. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. CVPR 2014
Object Detection: Fast R-CNN
• Fast R-CNN
feature
feature
feature
pre-computed
Regions-of-Interest RoI pooling
(RoIs) End-to-End
training
image
feature map
CNN
image
Shaoqing Ren, Kaiming He, Ross Girshick, & Jian Sun. “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”. NIPS 2015.
Object Detection
ImageNet detection
data data
• R-CNN
• AlexNet
• Fast R-CNN
• VGG-16
• Faster R-CNN
• GoogleNet
• MultiBox
• ResNet-101
• SSD
• …
• …
independently
“plug-in” developed
features detectors
classifier
Object Detection
RoI pooling
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.
Shaoqing Ren, Kaiming He, Ross Girshick, & Jian Sun. “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”. NIPS 2015.
Object Detection
• RPN learns proposals by extremely deep nets
• We use only 300 proposals (no hand-designed proposals)
• Add components:
• Iterative localization
• Context modeling
• Multi-scale testing
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.
Shaoqing Ren, Kaiming He, Ross Girshick, & Jian Sun. “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”. NIPS 2015.
ResNet’s object detection result on COCO
*the original image is from the COCO dataset
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015.
Shaoqing Ren, Kaiming He, Ross Girshick, & Jian Sun. “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”. NIPS 2015.
*the original image is from the COCO dataset
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015.
Shaoqing Ren, Kaiming He, Ross Girshick, & Jian Sun. “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”. NIPS 2015.
*the original image is from the COCO dataset
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015.
Shaoqing Ren, Kaiming He, Ross Girshick, & Jian Sun. “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”. NIPS 2015.
this video is available online: https://youtu.be/WZmSMkK9VuA
Results on real video. Models trained on MS COCO (80 categories).
(frame-by-frame; no temporal processing)
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015.
Shaoqing Ren, Kaiming He, Ross Girshick, & Jian Sun. “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”. NIPS 2015.
More Visual Recognition Tasks
ResNet-based methods lead on these benchmarks (incomplete list):
• ImageNet classification, detection, localization
• MS COCO detection, segmentation
ResNet-101
• PASCAL VOC detection, segmentation
Image Generation
(Pixel RNN, Neural Art, etc.)
ResNets have
shown outstanding or Natural Language Processing
promising results on: (Very deep CNN)
Speech Recognition
(preliminary results)
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Identity Mappings in Deep Residual Networks”. arXiv 2016.
Resources
• Models and Code
• Our ImageNet models in Caffe: https://github.com/KaimingHe/deep-residual-networks