SlideShare a Scribd company logo
Deep Residual Learning
MSRA @ ILSVRC & COCO 2015 competitions
Kaiming He
with Xiangyu Zhang, Shaoqing Ren, Jifeng Dai, & Jian Sun
Microsoft Research Asia (MSRA)
MSRA @ ILSVRC & COCO 2015 Competitions
• 1st places in all five main tracks
• ImageNet Classification: “Ultra-deep” (quote Yann) 152-layer nets
• ImageNet Detection: 16% better than 2nd
• ImageNet Localization: 27% better than 2nd
• COCO Detection: 11% better than 2nd
• COCO Segmentation: 12% better than 2nd
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015.
*improvements are relative numbers
Revolution of Depth
3.57
6.7 7.3
11.7
16.4
25.8
28.2
ILSVRC'15
ResNet
ILSVRC'14
GoogleNet
ILSVRC'14
VGG
ILSVRC'13 ILSVRC'12
AlexNet
ILSVRC'11 ILSVRC'10
ImageNet Classification top-5 error (%)
shallow8 layers
19 layers22 layers
152 layers
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015.
8 layers
Revolution of Depth
34
58
66
86
HOG, DPM AlexNet
(RCNN)
VGG
(RCNN)
ResNet
(Faster RCNN)*
PASCAL VOC 2007 Object Detection mAP (%)
shallow
8 layers
16 layers
101 layers
*w/ other improvements & more data
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015.
Engines of
visual recognition
Revolution of Depth
11x11 conv, 96, /4, pool/2
5x5 conv, 256, pool/2
3x3 conv, 384
3x3 conv, 384
3x3 conv, 256, pool/2
fc, 4096
fc, 4096
fc, 1000
AlexNet, 8 layers
(ILSVRC 2012)
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015.
Revolution of Depth
11x11 conv, 96, /4, pool/2
5x5 conv, 256, pool/2
3x3 conv, 384
3x3 conv, 384
3x3 conv, 256, pool/2
fc, 4096
fc, 4096
fc, 1000
AlexNet, 8 layers
(ILSVRC 2012)
3x3 conv, 64
3x3 conv, 64, pool/2
3x3 conv, 128
3x3 conv, 128, pool/2
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256, pool/2
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512, pool/2
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512, pool/2
fc, 4096
fc, 4096
fc, 1000
VGG, 19 layers
(ILSVRC 2014)
input
Conv
7x7+ 2(S)
MaxPool
3x3+ 2(S)
LocalRespNorm
Conv
1x1+ 1(V)
Conv
3x3+ 1(S)
LocalRespNorm
MaxPool
3x3+ 2(S)
Conv Conv Conv Conv
1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool
1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
Dept hConcat
Conv Conv Conv Conv
1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool
1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
Dept hConcat
MaxPool
3x3+ 2(S)
Conv Conv Conv Conv
1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool
1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
Dept hConcat
Conv Conv Conv Conv
1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool
1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
AveragePool
5x5+ 3(V)
Dept hConcat
Conv Conv Conv Conv
1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool
1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
Dept hConcat
Conv Conv Conv Conv
1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool
1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
Dept hConcat
Conv Conv Conv Conv
1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool
1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
AveragePool
5x5+ 3(V)
Dept hConcat
MaxPool
3x3+ 2(S)
Conv Conv Conv Conv
1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool
1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
Dept hConcat
Conv Conv Conv Conv
1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool
1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
Dept hConcat
AveragePool
7x7+ 1(V)
FC
Conv
1x1+ 1(S)
FC
FC
Soft maxAct ivat ion
soft max0
Conv
1x1+ 1(S)
FC
FC
Soft maxAct ivat ion
soft max1
Soft maxAct ivat ion
soft max2
GoogleNet, 22 layers
(ILSVRC 2014)
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015.
AlexNet, 8 layers
(ILSVRC 2012)
Revolution of Depth
ResNet, 152 layers
(ILSVRC 2015)
3x3 conv, 64
3x3 conv, 64, pool/2
3x3 conv, 128
3x3 conv, 128, pool/2
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256, pool/2
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512, pool/2
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512, pool/2
fc, 4096
fc, 4096
fc, 1000
11x11 conv, 96, /4, pool/2
5x5 conv, 256, pool/2
3x3 conv, 384
3x3 conv, 384
3x3 conv, 256, pool/2
fc, 4096
fc, 4096
fc, 1000
1x1 conv, 64
3x3 conv, 64
1x1 conv, 256
1x1 conv, 64
3x3 conv, 64
1x1 conv, 256
1x1 conv, 64
3x3 conv, 64
1x1 conv, 256
1x2 conv, 128, /2
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 256, /2
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 512, /2
3x3 conv, 512
1x1 conv, 2048
1x1 conv, 512
3x3 conv, 512
1x1 conv, 2048
1x1 conv, 512
3x3 conv, 512
1x1 conv, 2048
ave pool, fc 1000
7x7 conv, 64, /2, pool/2
VGG, 19 layers
(ILSVRC 2014)
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015.
Revolution of Depth
ResNet, 152 layers
1x1 conv, 64
3x3 conv, 64
1x1 conv, 256
1x1 conv, 64
3x3 conv, 64
1x1 conv, 256
1x1 conv, 64
3x3 conv, 64
1x1 conv, 256
1x2 conv, 128, /2
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 256, /2
3x3 conv, 256
7x7 conv, 64, /2, pool/2
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015.
(there was an animation here)
Revolution of Depth
ResNet, 152 layers
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 256, /2
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015.
(there was an animation here)
Revolution of Depth
ResNet, 152 layers
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015.
(there was an animation here)
Revolution of Depth
ResNet, 152 layers
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 512, /2
3x3 conv, 512
1x1 conv, 2048
1x1 conv, 512
3x3 conv, 512
1x1 conv, 2048
1x1 conv, 512
3x3 conv, 512
1x1 conv, 2048
ave pool, fc 1000
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015.
(there was an animation here)
Is learning better networks
as simple as stacking more layers?
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015.
Simply stacking layers?
0 1 2 3 4 5 6
0
10
20
iter. (1e4)
train error (%)
0 1 2 3 4 5 6
0
10
20
iter. (1e4)
test error (%)
CIFAR-10
56-layer
20-layer
56-layer
20-layer
• Plain nets: stacking 3x3 conv layers…
• 56-layer net has higher training error and test error than 20-layer net
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015.
Simply stacking layers?
0 1 2 3 4 5 6
0
5
10
20
iter. (1e4)
error(%)
plain-20
plain-32
plain-44
plain-56
CIFAR-10
20-layer
32-layer
44-layer
56-layer
0 10 20 30 40 50
20
30
40
50
60
iter. (1e4)
error(%)
plain-18
plain-34
ImageNet-1000
34-layer
18-layer
• “Overly deep” plain nets have higher training error
• A general phenomenon, observed in many datasets
solid: test/val
dashed: train
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015.
7x7 conv, 64, /2
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 128, /2
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
3x3 conv, 256, /2
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 512, /2
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
fc 1000
a shallower
model
(18 layers)
a deeper
counterpart
(34 layers)
7x7 conv, 64, /2
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 128, /2
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
3x3 conv, 256, /2
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 512, /2
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
fc 1000
“extra”
layers
• A deeper model should not have
higher training error
• A solution by construction:
• original layers: copied from a
learned shallower model
• extra layers: set as identity
• at least the same training error
• Optimization difficulties: solvers
cannot find the solution when going
deeper…
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015.
Deep Residual Learning
• Plaint net
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015.
any two
stacked layers
𝑥
𝐻(𝑥)
weight layer
weight layer
relu
relu
𝐻 𝑥 is any desired mapping,
hope the 2 weight layers fit 𝐻(𝑥)
Deep Residual Learning
• Residual net
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015.
𝐻 𝑥 is any desired mapping,
hope the 2 weight layers fit 𝐻(𝑥)
hope the 2 weight layers fit 𝐹(𝑥)
let 𝐻 𝑥 = 𝐹 𝑥 + 𝑥
weight layer
weight layer
relu
relu
𝑥
𝐻 𝑥 = 𝐹 𝑥 + 𝑥
identity
𝑥
𝐹(𝑥)
Deep Residual Learning
• 𝐹 𝑥 is a residual mapping w.r.t. identity
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015.
• If identity were optimal,
easy to set weights as 0
• If optimal mapping is closer to identity,
easier to find small fluctuations
weight layer
weight layer
relu
relu
𝑥
𝐻 𝑥 = 𝐹 𝑥 + 𝑥
identity
𝑥
𝐹(𝑥)
Related Works – Residual Representations
• VLAD & Fisher Vector [Jegou et al 2010], [Perronnin et al 2007]
• Encoding residual vectors; powerful shallower representations.
• Product Quantization (IVF-ADC) [Jegou et al 2011]
• Quantizing residual vectors; efficient nearest-neighbor search.
• MultiGrid & Hierarchical Precondition [Briggs, et al 2000], [Szeliski 1990, 2006]
• Solving residual sub-problems; efficient PDE solvers.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015.
7x7 conv, 64, /2
pool, /2
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 128, /2
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
3x3 conv, 256, /2
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 512, /2
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
avg pool
fc 1000
7x7 conv, 64, /2
pool, /2
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 128, /2
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
3x3 conv, 256, /2
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 512, /2
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
avg pool
fc 1000
Network “Design”
• Keep it simple
• Our basic design (VGG-style)
• all 3x3 conv (almost)
• spatial size /2 => # filters x2
• Simple design; just deep!
• Other remarks:
• no max pooling (almost)
• no hidden fc
• no dropout
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015.
plain net ResNet
Training
• All plain/residual nets are trained from scratch
• All plain/residual nets use Batch Normalization
• Standard hyper-parameters & augmentation
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015.
CIFAR-10 experiments
0 1 2 3 4 5 6
0
5
10
20
iter. (1e4)
error(%)
plain-20
plain-32
plain-44
plain-56
20-layer
32-layer
44-layer
56-layer
CIFAR-10 plain nets
0 1 2 3 4 5 6
0
5
10
20
iter. (1e4)
error(%)
ResNet-20
ResNet-32
ResNet-44
ResNet-56
ResNet-110
CIFAR-10 ResNets
56-layer
44-layer
32-layer
20-layer
110-layer
• Deep ResNets can be trained without difficulties
• Deeper ResNets have lower training error, and also lower test error
solid: test
dashed: train
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015.
ImageNet experiments
0 10 20 30 40 50
20
30
40
50
60
iter. (1e4)
error(%)
ResNet-18
ResNet-34
0 10 20 30 40 50
20
30
40
50
60
iter. (1e4)
error(%)
plain-18
plain-34
ImageNet plain nets ImageNet ResNets
solid: test
dashed: train
34-layer
18-layer
18-layer
34-layer
• Deep ResNets can be trained without difficulties
• Deeper ResNets have lower training error, and also lower test error
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015.
ImageNet experiments
• A practical design of going deeper
3x3, 64
3x3, 64
relu
relu
64-d
3x3, 64
1x1, 64
relu
1x1, 256
relu
relu
256-d
all-3x3
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015.
bottleneck
(for ResNet-50/101/152)
similar
complexity
ImageNet experiments
7.4
6.7
6.1
5.7
4
5
6
7
8
ResNet-34ResNet-50ResNet-101ResNet-152
10-crop testing, top-5 val error (%)
this model has
lower time complexity
than VGG-16/19
• Deeper ResNets have lower error
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015.
ImageNet experiments
3.57
6.7 7.3
11.7
16.4
25.8
28.2
ILSVRC'15
ResNet
ILSVRC'14
GoogleNet
ILSVRC'14
VGG
ILSVRC'13 ILSVRC'12
AlexNet
ILSVRC'11 ILSVRC'10
ImageNet Classification top-5 error (%)
shallow8 layers
19 layers22 layers
152 layers
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015.
8 layers
Just classification?
A treasure from ImageNet is on learning features.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015.
“Features matter.” (quote [Girshick et al. 2014], the R-CNN paper)
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015.
task
2nd-place
winner
MSRA margin
(relative)
ImageNet Localization (top-5 error) 12.0 9.0 27%
ImageNet Detection (mAP@.5) 53.6 62.1 16%
COCO Detection (mAP@.5:.95) 33.5 37.3 11%
COCO Segmentation (mAP@.5:.95) 25.1 28.2 12%
• Our results are all based on ResNet-101
• Our features are well transferrable
absolute
8.5% better!
Object Detection (brief)
• Simply “Faster R-CNN + ResNet”
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015.
Shaoqing Ren, Kaiming He, Ross Girshick, & Jian Sun. “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”. NIPS 2015.
image
CNN
feature map
Region Proposal Net
proposals
classifier
RoI pooling
Faster R-CNN
baseline
mAP@.5 mAP@.5:.95
VGG-16 41.5 21.5
ResNet-101 48.4 27.2
COCO detection results
(ResNet has 28% relative gain)
Object Detection (brief)
• RPN learns proposals by extremely deep nets
• We use only 300 proposals (no SS/EB/MCG!)
• Add what is just missing in Faster R-CNN…
• Iterative localization
• Context modeling
• Multi-scale testing
• All are based on CNN features; all are end-to-end (train and/or inference)
• All benefit more from deeper features – cumulative gains!
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015.
Shaoqing Ren, Kaiming He, Ross Girshick, & Jian Sun. “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”. NIPS 2015.
Our results on COCO – too many objects, let’s check carefully!
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015.
Shaoqing Ren, Kaiming He, Ross Girshick, & Jian Sun. “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”. NIPS 2015.
*the original image is from the COCO dataset
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015.
Shaoqing Ren, Kaiming He, Ross Girshick, & Jian Sun. “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”. NIPS 2015.
*the original image is from the COCO dataset
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015.
Shaoqing Ren, Kaiming He, Ross Girshick, & Jian Sun. “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”. NIPS 2015.
*the original image is from the COCO dataset
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015.
Jifeng Dai, Kaiming He, & Jian Sun. “Instance-aware Semantic Segmentation via Multi-task Network Cascades”. arXiv 2015.
Instance Segmentation (brief)
for each RoI
for each RoI
CONVs
conv feature map
FCs
FCs
RoI warping ,
pooling
masking
CONVs
box instances (RoIs)
mask instances
categorized instances
person
personperson
horse
• Solely CNN-based (“features matter”)
• Differentiable RoI warping layer (w.r.t box coord.)
• Multi-task cascades, exact end-to-end training
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015.
Jifeng Dai, Kaiming He, & Jian Sun. “Instance-aware Semantic Segmentation via Multi-task Network Cascades”. arXiv 2015.
*the original image is from the COCO dataset
input
Conclusions
• Deeper is still better
• “Features matter”!
• Faster R-CNN is just amazing
Kaiming He Xiangyu Zhang Shaoqing Ren
Jifeng Dai Jian Sun
MSRA team
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015.
Shaoqing Ren, Kaiming He, Ross Girshick, & Jian Sun. “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”. NIPS 2015.
Jifeng Dai, Kaiming He, & Jian Sun. “Instance-aware Semantic Segmentation via Multi-task Network Cascades”. arXiv 2015.

More Related Content

PDF
Paper overview: "Deep Residual Learning for Image Recognition"
Ilya Kuzovkin
 
PDF
FOSS4G Firenze 2022 참가기
SANGHEE SHIN
 
PDF
Interpretability beyond feature attribution quantitative testing with concept...
MLconf
 
PDF
PR-411: Model soups: averaging weights of multiple fine-tuned models improves...
Sunghoon Joo
 
PDF
[系列活動] 智慧製造與生產線上的資料科學 (製造資料科學:從預測性思維到處方性決策)
台灣資料科學年會
 
PDF
파이콘 한국 2019 튜토리얼 - SHAP (Part 3)
XAIC
 
PPTX
2017:10:20論文読み会"Image-to-Image Translation with Conditional Adversarial Netwo...
ayaha osaki
 
PDF
Understanding Convolutional Neural Networks
Jeremy Nixon
 
Paper overview: "Deep Residual Learning for Image Recognition"
Ilya Kuzovkin
 
FOSS4G Firenze 2022 참가기
SANGHEE SHIN
 
Interpretability beyond feature attribution quantitative testing with concept...
MLconf
 
PR-411: Model soups: averaging weights of multiple fine-tuned models improves...
Sunghoon Joo
 
[系列活動] 智慧製造與生產線上的資料科學 (製造資料科學:從預測性思維到處方性決策)
台灣資料科學年會
 
파이콘 한국 2019 튜토리얼 - SHAP (Part 3)
XAIC
 
2017:10:20論文読み会"Image-to-Image Translation with Conditional Adversarial Netwo...
ayaha osaki
 
Understanding Convolutional Neural Networks
Jeremy Nixon
 

What's hot (20)

PDF
Robustness in deep learning
Ganesan Narayanasamy
 
PPTX
Artificial Intelligence Explained: What Are Generative Adversarial Networks (...
Bernard Marr
 
PDF
Learning loss for active learning
NAVER Engineering
 
PDF
BASIS MARKETS - DAO_proposal_1_-_The_BASIS_Token.pdf
CryptoSleuthInvestig
 
PDF
Responsive Websites
Joe Seifi
 
PDF
CNN Algorithm
georgejustymirobi1
 
PPTX
Technology overview - cellular-agriculture
darshana naranje
 
PDF
Genome Big Data
Adrian Baez-Ortega
 
PDF
210523 swin transformer v1.5
taeseon ryu
 
PDF
Attention mechanism 소개 자료
Whi Kwon
 
PDF
Compressed Timelines for Breakthrough Therapies: Impact on Process Characteri...
KBI Biopharma
 
PDF
공간정보와 도시 디지털트윈(부산DX컨퍼런스 발표자료)
SANGHEE SHIN
 
PDF
공간정보, 디지털 트윈, 그리고 스마트 시티
SANGHEE SHIN
 
PDF
Introduction To Generative Adversarial Networks GANs
Hichem Felouat
 
PDF
Deep residual learning for image recognition
Yoonho Shin
 
PPTX
S07 파크랩 DSLab.1기: QGIS: Python 종 분포 모델링(SDM)
ByeongHyeokYu
 
PPTX
오픈소스GIS를 활용한 서버기반 공간분석과 시각화
MinPa Lee
 
PDF
DIY cell culture manual (& the roadmap to DIY cell-based meat)
2co
 
PPTX
PR-105: MnasNet: Platform-Aware Neural Architecture Search for Mobile
Seoul National University
 
PPTX
Generative Adversarial Networks (GANs)
Amol Patil
 
Robustness in deep learning
Ganesan Narayanasamy
 
Artificial Intelligence Explained: What Are Generative Adversarial Networks (...
Bernard Marr
 
Learning loss for active learning
NAVER Engineering
 
BASIS MARKETS - DAO_proposal_1_-_The_BASIS_Token.pdf
CryptoSleuthInvestig
 
Responsive Websites
Joe Seifi
 
CNN Algorithm
georgejustymirobi1
 
Technology overview - cellular-agriculture
darshana naranje
 
Genome Big Data
Adrian Baez-Ortega
 
210523 swin transformer v1.5
taeseon ryu
 
Attention mechanism 소개 자료
Whi Kwon
 
Compressed Timelines for Breakthrough Therapies: Impact on Process Characteri...
KBI Biopharma
 
공간정보와 도시 디지털트윈(부산DX컨퍼런스 발표자료)
SANGHEE SHIN
 
공간정보, 디지털 트윈, 그리고 스마트 시티
SANGHEE SHIN
 
Introduction To Generative Adversarial Networks GANs
Hichem Felouat
 
Deep residual learning for image recognition
Yoonho Shin
 
S07 파크랩 DSLab.1기: QGIS: Python 종 분포 모델링(SDM)
ByeongHyeokYu
 
오픈소스GIS를 활용한 서버기반 공간분석과 시각화
MinPa Lee
 
DIY cell culture manual (& the roadmap to DIY cell-based meat)
2co
 
PR-105: MnasNet: Platform-Aware Neural Architecture Search for Mobile
Seoul National University
 
Generative Adversarial Networks (GANs)
Amol Patil
 
Ad

Similar to Ilsvrc2015 deep residual_learning_kaiminghe (20)

PDF
20190417 畳み込みニューラル ネットワークの基礎と応用
Kazuki Motohashi
 
PDF
Deep Learning for New User Interactions (Gestures, Speech and Emotions)
Olivia Klose
 
PDF
Introduction to CNN with Application to Object Recognition
Artifacia
 
PDF
Discovering Your AI Super Powers - Tips and Tricks to Jumpstart your AI Projects
Wee Hyong Tok
 
PPTX
AlexNet(ImageNet Classification with Deep Convolutional Neural Networks)
UMBC
 
PDF
Model Compression
DarshanG13
 
PDF
AlexNet(ImageNet Classification with Deep Convolutional Neural Networks)
UMBC
 
PPTX
Image segmentation hj_cho
Hyungjoo Cho
 
PPTX
Deep cv 101
Xiaohu ZHU
 
PDF
DataScience Lab 2017_Мониторинг модных трендов с помощью глубокого обучения и...
GeeksLab Odessa
 
PDF
DLD meetup 2017, Efficient Deep Learning
Brodmann17
 
PDF
卒業研究
RyoyaMitsuhashi
 
PDF
Pixel Recurrent Neural Networks
neouyghur
 
PDF
ResNet basics (Deep Residual Network for Image Recognition)
Sanjay Saha
 
PDF
Brian Mac Namee - Predict Webinar 3 - Short Intro to Deep Learing
Predict Conference
 
PDF
Deep Residual Learning for Image Recognition
Willy Marroquin (WillyDevNET)
 
PDF
Recent developments in Deep Learning
Brahim HAMADICHAREF
 
PDF
Pr083 Non-local Neural Networks
Taeoh Kim
 
PDF
How to Design Efficient Deep Convolutional Architectures
Coderx7
 
PDF
Neural Architectures for Still Images - Xavier Giro- UPC Barcelona 2019
Universitat Politècnica de Catalunya
 
20190417 畳み込みニューラル ネットワークの基礎と応用
Kazuki Motohashi
 
Deep Learning for New User Interactions (Gestures, Speech and Emotions)
Olivia Klose
 
Introduction to CNN with Application to Object Recognition
Artifacia
 
Discovering Your AI Super Powers - Tips and Tricks to Jumpstart your AI Projects
Wee Hyong Tok
 
AlexNet(ImageNet Classification with Deep Convolutional Neural Networks)
UMBC
 
Model Compression
DarshanG13
 
AlexNet(ImageNet Classification with Deep Convolutional Neural Networks)
UMBC
 
Image segmentation hj_cho
Hyungjoo Cho
 
Deep cv 101
Xiaohu ZHU
 
DataScience Lab 2017_Мониторинг модных трендов с помощью глубокого обучения и...
GeeksLab Odessa
 
DLD meetup 2017, Efficient Deep Learning
Brodmann17
 
卒業研究
RyoyaMitsuhashi
 
Pixel Recurrent Neural Networks
neouyghur
 
ResNet basics (Deep Residual Network for Image Recognition)
Sanjay Saha
 
Brian Mac Namee - Predict Webinar 3 - Short Intro to Deep Learing
Predict Conference
 
Deep Residual Learning for Image Recognition
Willy Marroquin (WillyDevNET)
 
Recent developments in Deep Learning
Brahim HAMADICHAREF
 
Pr083 Non-local Neural Networks
Taeoh Kim
 
How to Design Efficient Deep Convolutional Architectures
Coderx7
 
Neural Architectures for Still Images - Xavier Giro- UPC Barcelona 2019
Universitat Politècnica de Catalunya
 
Ad

More from pramod naik (13)

DOCX
Electrical lab
pramod naik
 
DOC
Dsp manual
pramod naik
 
PDF
Adversarial search
pramod naik
 
PDF
3 pandasadvanced
pramod naik
 
PDF
2 pandasbasic
pramod naik
 
PDF
1 pythonbasic
pramod naik
 
PDF
Chapter07
pramod naik
 
PDF
Chapter06
pramod naik
 
PDF
Chapter05
pramod naik
 
PDF
Chapter04b
pramod naik
 
PDF
Chapter04a
pramod naik
 
PDF
Chapter01
pramod naik
 
PDF
Ganesan dhawanrpt
pramod naik
 
Electrical lab
pramod naik
 
Dsp manual
pramod naik
 
Adversarial search
pramod naik
 
3 pandasadvanced
pramod naik
 
2 pandasbasic
pramod naik
 
1 pythonbasic
pramod naik
 
Chapter07
pramod naik
 
Chapter06
pramod naik
 
Chapter05
pramod naik
 
Chapter04b
pramod naik
 
Chapter04a
pramod naik
 
Chapter01
pramod naik
 
Ganesan dhawanrpt
pramod naik
 

Recently uploaded (20)

PPTX
Civil Engineering Practices_BY Sh.JP Mishra 23.09.pptx
bineetmishra1990
 
PDF
67243-Cooling and Heating & Calculation.pdf
DHAKA POLYTECHNIC
 
PDF
Natural_Language_processing_Unit_I_notes.pdf
sanguleumeshit
 
PPTX
MULTI LEVEL DATA TRACKING USING COOJA.pptx
dollysharma12ab
 
PDF
LEAP-1B presedntation xxxxxxxxxxxxxxxxxxxxxxxxxxxxx
hatem173148
 
PPTX
Module2 Data Base Design- ER and NF.pptx
gomathisankariv2
 
PDF
Unit I Part II.pdf : Security Fundamentals
Dr. Madhuri Jawale
 
PDF
settlement FOR FOUNDATION ENGINEERS.pdf
Endalkazene
 
PPTX
Information Retrieval and Extraction - Module 7
premSankar19
 
PPTX
22PCOAM21 Session 2 Understanding Data Source.pptx
Guru Nanak Technical Institutions
 
PDF
top-5-use-cases-for-splunk-security-analytics.pdf
yaghutialireza
 
PDF
Introduction to Ship Engine Room Systems.pdf
Mahmoud Moghtaderi
 
PPTX
MT Chapter 1.pptx- Magnetic particle testing
ABCAnyBodyCanRelax
 
PDF
Biodegradable Plastics: Innovations and Market Potential (www.kiu.ac.ug)
publication11
 
PDF
2010_Book_EnvironmentalBioengineering (1).pdf
EmilianoRodriguezTll
 
PPT
1. SYSTEMS, ROLES, AND DEVELOPMENT METHODOLOGIES.ppt
zilow058
 
PDF
AI-Driven IoT-Enabled UAV Inspection Framework for Predictive Maintenance and...
ijcncjournal019
 
PDF
2025 Laurence Sigler - Advancing Decision Support. Content Management Ecommer...
Francisco Javier Mora Serrano
 
PPTX
Chapter_Seven_Construction_Reliability_Elective_III_Msc CM
SubashKumarBhattarai
 
PDF
20ME702-Mechatronics-UNIT-1,UNIT-2,UNIT-3,UNIT-4,UNIT-5, 2025-2026
Mohanumar S
 
Civil Engineering Practices_BY Sh.JP Mishra 23.09.pptx
bineetmishra1990
 
67243-Cooling and Heating & Calculation.pdf
DHAKA POLYTECHNIC
 
Natural_Language_processing_Unit_I_notes.pdf
sanguleumeshit
 
MULTI LEVEL DATA TRACKING USING COOJA.pptx
dollysharma12ab
 
LEAP-1B presedntation xxxxxxxxxxxxxxxxxxxxxxxxxxxxx
hatem173148
 
Module2 Data Base Design- ER and NF.pptx
gomathisankariv2
 
Unit I Part II.pdf : Security Fundamentals
Dr. Madhuri Jawale
 
settlement FOR FOUNDATION ENGINEERS.pdf
Endalkazene
 
Information Retrieval and Extraction - Module 7
premSankar19
 
22PCOAM21 Session 2 Understanding Data Source.pptx
Guru Nanak Technical Institutions
 
top-5-use-cases-for-splunk-security-analytics.pdf
yaghutialireza
 
Introduction to Ship Engine Room Systems.pdf
Mahmoud Moghtaderi
 
MT Chapter 1.pptx- Magnetic particle testing
ABCAnyBodyCanRelax
 
Biodegradable Plastics: Innovations and Market Potential (www.kiu.ac.ug)
publication11
 
2010_Book_EnvironmentalBioengineering (1).pdf
EmilianoRodriguezTll
 
1. SYSTEMS, ROLES, AND DEVELOPMENT METHODOLOGIES.ppt
zilow058
 
AI-Driven IoT-Enabled UAV Inspection Framework for Predictive Maintenance and...
ijcncjournal019
 
2025 Laurence Sigler - Advancing Decision Support. Content Management Ecommer...
Francisco Javier Mora Serrano
 
Chapter_Seven_Construction_Reliability_Elective_III_Msc CM
SubashKumarBhattarai
 
20ME702-Mechatronics-UNIT-1,UNIT-2,UNIT-3,UNIT-4,UNIT-5, 2025-2026
Mohanumar S
 

Ilsvrc2015 deep residual_learning_kaiminghe

  • 1. Deep Residual Learning MSRA @ ILSVRC & COCO 2015 competitions Kaiming He with Xiangyu Zhang, Shaoqing Ren, Jifeng Dai, & Jian Sun Microsoft Research Asia (MSRA)
  • 2. MSRA @ ILSVRC & COCO 2015 Competitions • 1st places in all five main tracks • ImageNet Classification: “Ultra-deep” (quote Yann) 152-layer nets • ImageNet Detection: 16% better than 2nd • ImageNet Localization: 27% better than 2nd • COCO Detection: 11% better than 2nd • COCO Segmentation: 12% better than 2nd Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015. *improvements are relative numbers
  • 3. Revolution of Depth 3.57 6.7 7.3 11.7 16.4 25.8 28.2 ILSVRC'15 ResNet ILSVRC'14 GoogleNet ILSVRC'14 VGG ILSVRC'13 ILSVRC'12 AlexNet ILSVRC'11 ILSVRC'10 ImageNet Classification top-5 error (%) shallow8 layers 19 layers22 layers 152 layers Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015. 8 layers
  • 4. Revolution of Depth 34 58 66 86 HOG, DPM AlexNet (RCNN) VGG (RCNN) ResNet (Faster RCNN)* PASCAL VOC 2007 Object Detection mAP (%) shallow 8 layers 16 layers 101 layers *w/ other improvements & more data Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015. Engines of visual recognition
  • 5. Revolution of Depth 11x11 conv, 96, /4, pool/2 5x5 conv, 256, pool/2 3x3 conv, 384 3x3 conv, 384 3x3 conv, 256, pool/2 fc, 4096 fc, 4096 fc, 1000 AlexNet, 8 layers (ILSVRC 2012) Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015.
  • 6. Revolution of Depth 11x11 conv, 96, /4, pool/2 5x5 conv, 256, pool/2 3x3 conv, 384 3x3 conv, 384 3x3 conv, 256, pool/2 fc, 4096 fc, 4096 fc, 1000 AlexNet, 8 layers (ILSVRC 2012) 3x3 conv, 64 3x3 conv, 64, pool/2 3x3 conv, 128 3x3 conv, 128, pool/2 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256, pool/2 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512, pool/2 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512, pool/2 fc, 4096 fc, 4096 fc, 1000 VGG, 19 layers (ILSVRC 2014) input Conv 7x7+ 2(S) MaxPool 3x3+ 2(S) LocalRespNorm Conv 1x1+ 1(V) Conv 3x3+ 1(S) LocalRespNorm MaxPool 3x3+ 2(S) Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hConcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hConcat MaxPool 3x3+ 2(S) Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hConcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) AveragePool 5x5+ 3(V) Dept hConcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hConcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hConcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) AveragePool 5x5+ 3(V) Dept hConcat MaxPool 3x3+ 2(S) Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hConcat Conv Conv Conv Conv 1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S) Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S) Dept hConcat AveragePool 7x7+ 1(V) FC Conv 1x1+ 1(S) FC FC Soft maxAct ivat ion soft max0 Conv 1x1+ 1(S) FC FC Soft maxAct ivat ion soft max1 Soft maxAct ivat ion soft max2 GoogleNet, 22 layers (ILSVRC 2014) Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015.
  • 7. AlexNet, 8 layers (ILSVRC 2012) Revolution of Depth ResNet, 152 layers (ILSVRC 2015) 3x3 conv, 64 3x3 conv, 64, pool/2 3x3 conv, 128 3x3 conv, 128, pool/2 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256, pool/2 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512, pool/2 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512, pool/2 fc, 4096 fc, 4096 fc, 1000 11x11 conv, 96, /4, pool/2 5x5 conv, 256, pool/2 3x3 conv, 384 3x3 conv, 384 3x3 conv, 256, pool/2 fc, 4096 fc, 4096 fc, 1000 1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x2 conv, 128, /2 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 256, /2 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 512, /2 3x3 conv, 512 1x1 conv, 2048 1x1 conv, 512 3x3 conv, 512 1x1 conv, 2048 1x1 conv, 512 3x3 conv, 512 1x1 conv, 2048 ave pool, fc 1000 7x7 conv, 64, /2, pool/2 VGG, 19 layers (ILSVRC 2014) Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015.
  • 8. Revolution of Depth ResNet, 152 layers 1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x1 conv, 64 3x3 conv, 64 1x1 conv, 256 1x2 conv, 128, /2 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 256, /2 3x3 conv, 256 7x7 conv, 64, /2, pool/2 Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015. (there was an animation here)
  • 9. Revolution of Depth ResNet, 152 layers 1x1 conv, 512 1x1 conv, 128 3x3 conv, 128 1x1 conv, 512 1x1 conv, 256, /2 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015. (there was an animation here)
  • 10. Revolution of Depth ResNet, 152 layers 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015. (there was an animation here)
  • 11. Revolution of Depth ResNet, 152 layers 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 256 3x3 conv, 256 1x1 conv, 1024 1x1 conv, 512, /2 3x3 conv, 512 1x1 conv, 2048 1x1 conv, 512 3x3 conv, 512 1x1 conv, 2048 1x1 conv, 512 3x3 conv, 512 1x1 conv, 2048 ave pool, fc 1000 Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015. (there was an animation here)
  • 12. Is learning better networks as simple as stacking more layers? Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015.
  • 13. Simply stacking layers? 0 1 2 3 4 5 6 0 10 20 iter. (1e4) train error (%) 0 1 2 3 4 5 6 0 10 20 iter. (1e4) test error (%) CIFAR-10 56-layer 20-layer 56-layer 20-layer • Plain nets: stacking 3x3 conv layers… • 56-layer net has higher training error and test error than 20-layer net Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015.
  • 14. Simply stacking layers? 0 1 2 3 4 5 6 0 5 10 20 iter. (1e4) error(%) plain-20 plain-32 plain-44 plain-56 CIFAR-10 20-layer 32-layer 44-layer 56-layer 0 10 20 30 40 50 20 30 40 50 60 iter. (1e4) error(%) plain-18 plain-34 ImageNet-1000 34-layer 18-layer • “Overly deep” plain nets have higher training error • A general phenomenon, observed in many datasets solid: test/val dashed: train Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015.
  • 15. 7x7 conv, 64, /2 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 128, /2 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 256, /2 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 512, /2 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 fc 1000 a shallower model (18 layers) a deeper counterpart (34 layers) 7x7 conv, 64, /2 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 128, /2 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 256, /2 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 512, /2 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 fc 1000 “extra” layers • A deeper model should not have higher training error • A solution by construction: • original layers: copied from a learned shallower model • extra layers: set as identity • at least the same training error • Optimization difficulties: solvers cannot find the solution when going deeper… Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015.
  • 16. Deep Residual Learning • Plaint net Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015. any two stacked layers 𝑥 𝐻(𝑥) weight layer weight layer relu relu 𝐻 𝑥 is any desired mapping, hope the 2 weight layers fit 𝐻(𝑥)
  • 17. Deep Residual Learning • Residual net Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015. 𝐻 𝑥 is any desired mapping, hope the 2 weight layers fit 𝐻(𝑥) hope the 2 weight layers fit 𝐹(𝑥) let 𝐻 𝑥 = 𝐹 𝑥 + 𝑥 weight layer weight layer relu relu 𝑥 𝐻 𝑥 = 𝐹 𝑥 + 𝑥 identity 𝑥 𝐹(𝑥)
  • 18. Deep Residual Learning • 𝐹 𝑥 is a residual mapping w.r.t. identity Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015. • If identity were optimal, easy to set weights as 0 • If optimal mapping is closer to identity, easier to find small fluctuations weight layer weight layer relu relu 𝑥 𝐻 𝑥 = 𝐹 𝑥 + 𝑥 identity 𝑥 𝐹(𝑥)
  • 19. Related Works – Residual Representations • VLAD & Fisher Vector [Jegou et al 2010], [Perronnin et al 2007] • Encoding residual vectors; powerful shallower representations. • Product Quantization (IVF-ADC) [Jegou et al 2011] • Quantizing residual vectors; efficient nearest-neighbor search. • MultiGrid & Hierarchical Precondition [Briggs, et al 2000], [Szeliski 1990, 2006] • Solving residual sub-problems; efficient PDE solvers. Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015.
  • 20. 7x7 conv, 64, /2 pool, /2 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 128, /2 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 256, /2 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 512, /2 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 avg pool fc 1000 7x7 conv, 64, /2 pool, /2 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 128, /2 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 256, /2 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 512, /2 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 avg pool fc 1000 Network “Design” • Keep it simple • Our basic design (VGG-style) • all 3x3 conv (almost) • spatial size /2 => # filters x2 • Simple design; just deep! • Other remarks: • no max pooling (almost) • no hidden fc • no dropout Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015. plain net ResNet
  • 21. Training • All plain/residual nets are trained from scratch • All plain/residual nets use Batch Normalization • Standard hyper-parameters & augmentation Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015.
  • 22. CIFAR-10 experiments 0 1 2 3 4 5 6 0 5 10 20 iter. (1e4) error(%) plain-20 plain-32 plain-44 plain-56 20-layer 32-layer 44-layer 56-layer CIFAR-10 plain nets 0 1 2 3 4 5 6 0 5 10 20 iter. (1e4) error(%) ResNet-20 ResNet-32 ResNet-44 ResNet-56 ResNet-110 CIFAR-10 ResNets 56-layer 44-layer 32-layer 20-layer 110-layer • Deep ResNets can be trained without difficulties • Deeper ResNets have lower training error, and also lower test error solid: test dashed: train Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015.
  • 23. ImageNet experiments 0 10 20 30 40 50 20 30 40 50 60 iter. (1e4) error(%) ResNet-18 ResNet-34 0 10 20 30 40 50 20 30 40 50 60 iter. (1e4) error(%) plain-18 plain-34 ImageNet plain nets ImageNet ResNets solid: test dashed: train 34-layer 18-layer 18-layer 34-layer • Deep ResNets can be trained without difficulties • Deeper ResNets have lower training error, and also lower test error Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015.
  • 24. ImageNet experiments • A practical design of going deeper 3x3, 64 3x3, 64 relu relu 64-d 3x3, 64 1x1, 64 relu 1x1, 256 relu relu 256-d all-3x3 Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015. bottleneck (for ResNet-50/101/152) similar complexity
  • 25. ImageNet experiments 7.4 6.7 6.1 5.7 4 5 6 7 8 ResNet-34ResNet-50ResNet-101ResNet-152 10-crop testing, top-5 val error (%) this model has lower time complexity than VGG-16/19 • Deeper ResNets have lower error Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015.
  • 26. ImageNet experiments 3.57 6.7 7.3 11.7 16.4 25.8 28.2 ILSVRC'15 ResNet ILSVRC'14 GoogleNet ILSVRC'14 VGG ILSVRC'13 ILSVRC'12 AlexNet ILSVRC'11 ILSVRC'10 ImageNet Classification top-5 error (%) shallow8 layers 19 layers22 layers 152 layers Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015. 8 layers
  • 27. Just classification? A treasure from ImageNet is on learning features. Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015.
  • 28. “Features matter.” (quote [Girshick et al. 2014], the R-CNN paper) Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015. task 2nd-place winner MSRA margin (relative) ImageNet Localization (top-5 error) 12.0 9.0 27% ImageNet Detection ([email protected]) 53.6 62.1 16% COCO Detection ([email protected]:.95) 33.5 37.3 11% COCO Segmentation ([email protected]:.95) 25.1 28.2 12% • Our results are all based on ResNet-101 • Our features are well transferrable absolute 8.5% better!
  • 29. Object Detection (brief) • Simply “Faster R-CNN + ResNet” Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015. Shaoqing Ren, Kaiming He, Ross Girshick, & Jian Sun. “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”. NIPS 2015. image CNN feature map Region Proposal Net proposals classifier RoI pooling Faster R-CNN baseline [email protected] [email protected]:.95 VGG-16 41.5 21.5 ResNet-101 48.4 27.2 COCO detection results (ResNet has 28% relative gain)
  • 30. Object Detection (brief) • RPN learns proposals by extremely deep nets • We use only 300 proposals (no SS/EB/MCG!) • Add what is just missing in Faster R-CNN… • Iterative localization • Context modeling • Multi-scale testing • All are based on CNN features; all are end-to-end (train and/or inference) • All benefit more from deeper features – cumulative gains! Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015. Shaoqing Ren, Kaiming He, Ross Girshick, & Jian Sun. “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”. NIPS 2015.
  • 31. Our results on COCO – too many objects, let’s check carefully! Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015. Shaoqing Ren, Kaiming He, Ross Girshick, & Jian Sun. “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”. NIPS 2015. *the original image is from the COCO dataset
  • 32. Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015. Shaoqing Ren, Kaiming He, Ross Girshick, & Jian Sun. “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”. NIPS 2015. *the original image is from the COCO dataset
  • 33. Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015. Shaoqing Ren, Kaiming He, Ross Girshick, & Jian Sun. “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”. NIPS 2015. *the original image is from the COCO dataset
  • 34. Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015. Jifeng Dai, Kaiming He, & Jian Sun. “Instance-aware Semantic Segmentation via Multi-task Network Cascades”. arXiv 2015. Instance Segmentation (brief) for each RoI for each RoI CONVs conv feature map FCs FCs RoI warping , pooling masking CONVs box instances (RoIs) mask instances categorized instances person personperson horse • Solely CNN-based (“features matter”) • Differentiable RoI warping layer (w.r.t box coord.) • Multi-task cascades, exact end-to-end training
  • 35. Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015. Jifeng Dai, Kaiming He, & Jian Sun. “Instance-aware Semantic Segmentation via Multi-task Network Cascades”. arXiv 2015. *the original image is from the COCO dataset input
  • 36. Conclusions • Deeper is still better • “Features matter”! • Faster R-CNN is just amazing Kaiming He Xiangyu Zhang Shaoqing Ren Jifeng Dai Jian Sun MSRA team Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015. Shaoqing Ren, Kaiming He, Ross Girshick, & Jian Sun. “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”. NIPS 2015. Jifeng Dai, Kaiming He, & Jian Sun. “Instance-aware Semantic Segmentation via Multi-task Network Cascades”. arXiv 2015.