0% found this document useful (0 votes)

19 views

1-Resnet Slides

see

Uploaded by

satyakali24

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views

1-Resnet Slides

see

Uploaded by

satyakali24

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 89

7x7 conv, 64, /2, pool /2

1x1 conv, 64

3x3 conv, 64

1x1 conv, 256

1x1 conv, 64

3x3 conv, 64

1x1 conv, 256

1x1 conv, 64

3x3 conv, 64

1x1 conv, 256

1x1 conv, 128, /2

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 256, /2

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

Kaiming He

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

Facebook AI Research*

1x1 conv, 1024

1x1 conv, 256

ICML 2016 tutorial

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

8:30-10:30am, June 19

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

*as of July 2016. Formerly affiliated with Microsoft Research Asia

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

Deep Residual Networks

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

Deep Learning Gets Way Deeper

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 512, /2

3x3 conv, 512

1x1 conv, 2048

1x1 conv, 512

3x3 conv, 512

1x1 conv, 2048

1x1 conv, 512

3x3 conv, 512

1x1 conv, 2048

ave pool, fc 1000

Overview
• Introduction
• Background
• From shallow to deep
• Deep Residual Networks
• From 10 layers to 100 layers
• From 100 layers to 1000 layers
• Applications
• Q&A
Introduction
Introduction
Deep Residual Networks (ResNets)
• “Deep Residual Learning for Image Recognition”. CVPR 2016 (next week)

• A simple and clean framework of training “very” deep nets

• State-of-the-art performance for

• Image classification
• Object detection
• Semantic segmentation
• and more…

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.
ResNets @ ILSVRC & COCO 2015 Competitions
• 1st places in all five main tracks
• ImageNet Classification: “Ultra-deep” 152-layer nets
• ImageNet Detection: 16% better than 2nd
• ImageNet Localization: 27% better than 2nd
• COCO Detection: 11% better than 2nd
• COCO Segmentation: 12% better than 2nd

*improvements are relative numbers

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.
Revolution of Depth 28.2
25.8
152 layers

16.4

11.7
22 layers 19 layers
6.7 7.3

3.57 8 layers 8 layers shallow

ILSVRC'15 ILSVRC'14 ILSVRC'14 ILSVRC'13 ILSVRC'12 ILSVRC'11 ILSVRC'10

ResNet GoogleNet VGG AlexNet

ImageNet Classification top-5 error (%)

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.
Revolution of Depth
AlexNet, 8 layers 11x11 conv, 96, /4, pool/2
(ILSVRC 2012)
5x5 conv, 256, pool/2

3x3 conv, 384

3x3 conv, 256, pool/2

fc, 4096

fc, 1000

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.
Revolution of Depth soft max2

Soft maxAct iv at ion

Av eragePool
7 x7 + 1 (V)

11x11 conv, 96, /4, pool/2 3x3 conv, 64

AlexNet, 8 layers VGG, 19 layers GoogleNet, 22 layers
Dept hConcat

Conv Conv Conv Conv

1 x1 + 1 (S) 3 x3 + 1 (S) 5 x5 + 1 (S) 1 x1 + 1 (S)

5x5 conv, 256, pool/2 3x3 conv, 64, pool/2

(ILSVRC 2012) (ILSVRC 2014) (ILSVRC 2014)
Conv Conv MaxPool
1 x1 + 1 (S) 1 x1 + 1 (S) 3 x3 + 1 (S)

Dept hConcat

3x3 conv, 384 3x3 conv, 128 Conv

1 x1 + 1 (S)
Conv
3 x3 + 1 (S)
Conv
5 x5 + 1 (S)
Conv
1 x1 + 1 (S)
soft max1

Conv Conv MaxPool

Soft maxAct iv at ion
1 x1 + 1 (S) 1 x1 + 1 (S) 3 x3 + 1 (S)

3x3 conv, 384 3x3 conv, 128, pool/2 MaxPool

3 x3 + 2 (S)
FC

Dept hConcat FC

3x3 conv, 256, pool/2 3x3 conv, 256 Conv

1 x1 + 1 (S)
Conv
3 x3 + 1 (S)
Conv
5 x5 + 1 (S)
Conv
1 x1 + 1 (S)
Conv
1 x1 + 1 (S)

Conv Conv MaxPool Av eragePool

1 x1 + 1 (S) 1 x1 + 1 (S) 3 x3 + 1 (S) 5 x5 + 3 (V)

fc, 4096 3x3 conv, 256 Dept hConcat

Conv Conv Conv Conv

1 x1 + 1 (S) 3 x3 + 1 (S) 5 x5 + 1 (S) 1 x1 + 1 (S)

fc, 4096 3x3 conv, 256 Conv

1 x1 + 1 (S)
Conv
1 x1 + 1 (S)
MaxPool
3 x3 + 1 (S)

Dept hConcat soft max0

fc, 1000 3x3 conv, 256, pool/2 Conv

1 x1 + 1 (S)
Conv
3 x3 + 1 (S)
Conv
5 x5 + 1 (S)
Conv
1 x1 + 1 (S)
Soft maxAct iv at ion

Conv Conv MaxPool

FC
1 x1 + 1 (S) 1 x1 + 1 (S) 3 x3 + 1 (S)

3x3 conv, 512 Dept hConcat FC

Conv Conv Conv Conv Conv

1 x1 + 1 (S) 3 x3 + 1 (S) 5 x5 + 1 (S) 1 x1 + 1 (S) 1 x1 + 1 (S)

3x3 conv, 512 Conv

1 x1 + 1 (S)
Conv
1 x1 + 1 (S)
MaxPool
3 x3 + 1 (S)
Av eragePool
5 x5 + 3 (V)

Dept hConcat

3x3 conv, 512 Conv

1 x1 + 1 (S)
Conv
3 x3 + 1 (S)
Conv
5 x5 + 1 (S)
Conv
1 x1 + 1 (S)

Conv Conv MaxPool

1 x1 + 1 (S) 1 x1 + 1 (S) 3 x3 + 1 (S)

3x3 conv, 512, pool/2 MaxPool

3 x3 + 2 (S)

Dept hConcat

3x3 conv, 512 Conv

1 x1 + 1 (S)
Conv
3 x3 + 1 (S)
Conv
5 x5 + 1 (S)
Conv
1 x1 + 1 (S)

Conv Conv MaxPool

1 x1 + 1 (S) 1 x1 + 1 (S) 3 x3 + 1 (S)

3x3 conv, 512 Dept hConcat

Conv Conv Conv Conv

1 x1 + 1 (S) 3 x3 + 1 (S) 5 x5 + 1 (S) 1 x1 + 1 (S)

3x3 conv, 512 Conv

1 x1 + 1 (S)
Conv
1 x1 + 1 (S)
MaxPool
3 x3 + 1 (S)

MaxPool
3 x3 + 2 (S)

3x3 conv, 512, pool/2 LocalRespNorm

Conv
3 x3 + 1 (S)

fc, 4096 Conv

1 x1 + 1 (V)

LocalRespNorm

fc, 4096 MaxPool

3 x3 + 2 (S)

Conv
7 x7 + 2 (S)

fc, 1000 input

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.
7x7 conv, 64, /2, pool /2

1x1 conv, 64

3x3 conv, 64

1x1 conv, 256

1x1 conv, 64

3x3 conv, 64

1x1 conv, 256

1x1 conv, 64

3x3 conv, 64

1x1 conv, 256

1x1 conv, 128, /2

3x3 conv, 128

Revolution of Depth
1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 256, /2

3x3 conv, 256

11x11 conv, 96, /4, pool/2 1x1 conv, 1024

5x5 conv, 256, pool/2 1x1 conv, 256

AlexNet, 8 layers VGG, 19 layers ResNet, 152 layers

3x3 conv, 384 3x3 conv, 64 3x3 conv, 256

3x3 conv, 384 3x3 conv, 64, pool/2 1x1 conv, 1024

3x3 conv, 256, pool/2 3x3 conv, 128 1x1 conv, 256

fc, 4096 3x3 conv, 128, pool/2 3x3 conv, 256

fc, 4096 3x3 conv, 256 1x1 conv, 1024

3x3 conv, 256 1x1 conv, 256
fc, 1000
3x3 conv, 256 3x3 conv, 256

(ILSVRC 2012) (ILSVRC 2014) (ILSVRC 2015)

3x3 conv, 256, pool/2 1x1 conv, 1024
3x3 conv, 512 1x1 conv, 256
3x3 conv, 512 3x3 conv, 256
3x3 conv, 512 1x1 conv, 1024
3x3 conv, 512, pool/2 1x1 conv, 256
3x3 conv, 512 3x3 conv, 256
3x3 conv, 512 1x1 conv, 1024
3x3 conv, 512 1x1 conv, 256
3x3 conv, 512, pool/2 3x3 conv, 256
fc, 4096 1x1 conv, 1024
fc, 4096 1x1 conv, 256
fc, 1000 3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 512, /2

3x3 conv, 512

1x1 conv, 2048

1x1 conv, 512

3x3 conv, 512

1x1 conv, 2048

1x1 conv, 512

3x3 conv, 512

1x1 conv, 2048

ave pool, fc 1000

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.
101 layers
Revolution of Depth
86
Engines of
66
visual recognition 58

34
16 layers
8 layers
shallow

HOG, DPM AlexNet VGG ResNet

(RCNN) (RCNN) (Faster RCNN)*

PASCAL VOC 2007 Object Detection mAP (%)

*w/ other improvements & more data
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.
ResNet’s object detection result on COCO
*the original image is from the COCO dataset

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.
Very simple, easy to follow
• Many third-party implementations (list in https://github.com/KaimingHe/deep-residual-networks)
• Facebook AI Research’s Torch ResNet:
• Torch, CIFAR-10, with ResNet-20 to ResNet-110, training code, and curves: code
• Lasagne, CIFAR-10, with ResNet-32 and ResNet-56 and training code: code
• Neon, CIFAR-10, with pre-trained ResNet-32 to ResNet-110 models, training code, and curves: code
• Torch, MNIST, 100 layers: blog, code
• A winning entry in Kaggle's right whale recognition challenge: blog, code
• Neon, Place2 (mini), 40 layers: blog, code
• …

• Easily reproduced results (e.g. Torch ResNet: https://github.com/facebook/fb.resnet.torch)

• A series of extensions and follow-ups
• > 200 citations in 6 months after posted on arXiv (Dec. 2015)

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.
Background
From shallow to deep
Traditional recognition But what’s next?
shallower
classifier “bus”?
pixels

edges classifier “bus”?

SIFT/HOG
deeper
edges histogram classifier “bus”?

K-means/ “bus”?
edges histogram sparse code classifier
Deep Learning
Specialized components, domain knowledge required

K-means/ “bus”?
edges histogram sparse code classifier

Generic components (“layers”), less domain knowledge

“bus”?

Repeat elementary layers => Going deeper

“bus”?

• End-to-end learning
• Richer solution space
Spectrum of Depth
5 layers: easy
>10 layers: initialization, Batch Normalization
>30 layers: skip connections
>100 layers: identity skip connections
>1000 layers: ?

shallower deeper
If:
Initialization • Linear activation
• 𝑥, 𝑦, 𝑤: independent
weight Then:
𝑊
1-layer:
𝑉𝑎𝑟 𝑦 = (𝑛+, 𝑉𝑎𝑟 𝑤 )𝑉𝑎𝑟[𝑥]
input output
𝑋 𝑌 = 𝑊𝑋
Multi-layer:
+,
𝑉𝑎𝑟 𝑦 = (2 𝑛3 𝑉𝑎𝑟 𝑤3 )𝑉𝑎𝑟[𝑥]
𝑛+, 𝑛567 3
LeCun et al 1998 “Efficient Backprop”
Glorot & Bengio 2010 “Understanding the difficulty of training deep feedforward neural networks”
Both forward (response) and backward (gradient)
Initialization signal can vanish/explode
Forward:
𝑉𝑎𝑟 𝑦 = (2 𝑛+,
3 𝑉𝑎𝑟 𝑤3 )𝑉𝑎𝑟[𝑥]
3 exploding
Backward:
𝜕 567 𝜕
𝑉𝑎𝑟 = (2 𝑛3 𝑉𝑎𝑟 𝑤3 )𝑉𝑎𝑟[ ]
𝜕𝑥 𝜕𝑦
3
ideal

vanishing
1 3 5 7 9 11 13 15
depth
LeCun et al 1998 “Efficient Backprop”
Glorot & Bengio 2010 “Understanding the difficulty of training deep feedforward neural networks”
Initialization
• Initialization under linear assumption

∏3 𝑛+,
3 𝑉𝑎𝑟 𝑤3 = 𝑐𝑜𝑛𝑠𝑡>? (healthy forward)
and
∏3 𝑛567
3 𝑉𝑎𝑟 𝑤3 = 𝑐𝑜𝑛𝑠𝑡@? (healthy backward)

D5,E7FG ,MNO
𝑛+,
3 𝑉𝑎𝑟 𝑤3 = 1
*: 𝑛567
3 = +,
𝑛3BC, so D5,E7 = IJKL
< ∞.
HG ,RS
HPQKL
or* It is sufficient to use either form.
𝑛567
3 𝑉𝑎𝑟 𝑤3 = 1
“Xavier” init in Caffe
LeCun et al 1998 “Efficient Backprop”
Glorot & Bengio 2010 “Understanding the difficulty of training deep feedforward neural networks”
Initialization
• Initialization under ReLU activation
C
∏3 𝑛+, 𝑉𝑎𝑟 𝑤3 = 𝑐𝑜𝑛𝑠𝑡>? (healthy forward)
V 3
and
C 567
∏3 𝑛3 𝑉𝑎𝑟 𝑤3 = 𝑐𝑜𝑛𝑠𝑡@?(healthy backward)
V

1 +,
𝑛3 𝑉𝑎𝑟 𝑤3 = 1
2 With 𝐷 layers, a factor of 2 per layer has
or exponential impact of 2Y
1 567
𝑛3 𝑉𝑎𝑟 𝑤3 = 1
2 “MSRA” init in Caffe
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification”. ICCV 2015.
Initialization
22-layer ReLU net: 30-layer ReLU net:
good init converges faster good init is able to converge

1
𝑛𝑉𝑎𝑟 𝑤ours
=1
2 1
2
ours
𝑛𝑉𝑎𝑟 𝑤 =1

𝑛𝑉𝑎𝑟 𝑤 Xavier
=1 𝑛𝑉𝑎𝑟Xavier
𝑤 =1

*Figures show the beginning of training

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification”. ICCV 2015.
Batch Normalization (BN)
• Normalizing input (LeCun et al 1998 “Efficient Backprop”)

• BN: normalizing each layer, for each mini-batch

• Greatly accelerate training

• Less sensitive to initialization

• Improve regularization

S. Ioffe & C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. ICML 2015
Batch Normalization (BN)
𝑥−𝜇
layer 𝑥 𝑥Z = 𝑦 = 𝛾𝑥Z + 𝛽
𝜎

• 𝜇: mean of 𝑥 in mini-batch • 𝜇, 𝜎: functions of 𝑥,

• 𝜎: std of 𝑥 in mini-batch analogous to responses
• 𝛾: scale • 𝛾, 𝛽: parameters to be learned,
• 𝛽: shift analogous to weights

2 modes of BN:
• Train mode:
• 𝜇, 𝜎 are functions of 𝑥; backprop gradients
• Test mode:
Caution: make sure your BN
• 𝜇, 𝜎 are pre-computed* on training set
is in a correct mode

*: by running average, or post-processing after training

S. Ioffe & C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. ICML 2015
Batch Normalization (BN)
accuracy best of w/ BN w/o BN

iter.
Figure taken from [S. Ioffe & C. Szegedy]

S. Ioffe & C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. ICML 2015
Deep Residual Networks
From 10 layers to 100 layers
Going Deeper
• Initialization algorithms ✓
• Batch Normalization ✓

• Is learning better networks as simple as stacking more layers?

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.
Simply stacking layers?
CIFAR-10
train error (%) test error (%)
20 20
56-layer
56-layer

10 10

20-layer
20-layer
0 0
0 1 2 3 4 5 6 0 1 2 3 4 5 6
iter. (1e4) iter. (1e4)

• Plain nets: stacking 3x3 conv layers…

• 56-layer net has higher training error and test error than 20-layer net

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.
Simply stacking layers?
CIFAR-10 ImageNet-1000
20
60
56-layer
44-layer 50

32-layer

error (%)
error (%)

10 34-layer
20-layer 40

5 plain-20 30
plain-32
plain-18
plain-44
plain-56 solid: test/val plain-34 18-layer
0 20
0 1 2 3
iter. (1e4)
4 5 6
dashed: train 0 10 20 30
iter. (1e4)
40 50

• “Overly deep” plain nets have higher training error

• A general phenomenon, observed in many datasets

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.
7x7 conv, 64, /2 7x7 conv, 64, /2

a shallower 3x3 conv, 64 3x3 conv, 64 a deeper

model counterpart
3x3 conv, 64 3x3 conv, 64

3x3 conv, 64 3x3 conv, 64

(18 layers) (34 layers)

3x3 conv, 64 3x3 conv, 64

3x3 conv, 64

3x3 conv, 128, /2 3x3 conv, 128, /2

3x3 conv, 128

• Richer solution space
3x3 conv, 128 3x3 conv, 128

3x3 conv, 128

• A deeper model should not have higher
3x3 conv, 128
training error
“extra”
3x3 conv, 256, /2 3x3 conv, 256, /2

3x3 conv, 256 3x3 conv, 256

layers
• A solution by construction:
3x3 conv, 256 3x3 conv, 256

3x3 conv, 256 3x3 conv, 256

• original layers: copied from a

3x3 conv, 256

learned shallower model

3x3 conv, 256

• extra layers: set as identity

3x3 conv, 256

• at least the same training error

3x3 conv, 256

3x3 conv, 512, /2 3x3 conv, 512, /2

3x3 conv, 512 3x3 conv, 512

3x3 conv, 512

• Optimization difficulties: solvers cannot
3x3 conv, 512

3x3 conv, 512

find the solution when going deeper…
fc 1000 fc 1000

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.
Deep Residual Learning
• Plaint net 𝐻 𝑥 is any desired mapping,
𝑥 hope the 2 weight layers fit 𝐻(𝑥)

weight layer
any two
stacked layers relu
weight layer
relu
𝐻(𝑥)

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.
Deep Residual Learning
• Residual net 𝐻 𝑥 is any desired mapping,
𝑥 hope the 2 weight layers fit 𝐻(𝑥)

weight layer hope the 2 weight layers fit 𝐹(𝑥)

𝐹(𝑥) relu identity let 𝐻 𝑥 = 𝐹 𝑥 + 𝑥
weight layer 𝑥

𝐻 𝑥 = 𝐹 𝑥 +𝑥
relu

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.
Deep Residual Learning
• 𝐹 𝑥 is a residual mapping w.r.t. identity
𝑥
• If identity were optimal,
weight layer
easy to set weights as 0
𝐹(𝑥) relu identity
weight layer 𝑥 • If optimal mapping is closer to identity,
easier to find small fluctuations
𝐻 𝑥 = 𝐹 𝑥 +𝑥
relu

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.
Related Works – Residual Representations
• VLAD & Fisher Vector [Jegou et al 2010], [Perronnin et al 2007]
• Encoding residual vectors; powerful shallower representations.

• Product Quantization (IVF-ADC) [Jegou et al 2011]

• Quantizing residual vectors; efficient nearest-neighbor search.

• MultiGrid & Hierarchical Precondition [Briggs, et al 2000], [Szeliski 1990, 2006]

• Solving residual sub-problems; efficient PDE solvers.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.
7x7 conv, 64, /2 7x7 conv, 64, /2

pool, /2 pool, /2

3x3 conv, 64 3x3 conv, 64

Network “Design”
3x3 conv, 64 3x3 conv, 64

plain net ResNet

3x3 conv, 64 3x3 conv, 64

3x3 conv, 128, /2 3x3 conv, 128, /2

3x3 conv, 128 3x3 conv, 128

• Keep it simple 3x3 conv, 128

3x3 conv, 128

3x3 conv, 128 3x3 conv, 128

• Our basic design (VGG-style)

3x3 conv, 128 3x3 conv, 128

3x3 conv, 256, /2 3x3 conv, 256, /2

3x3 conv, 256 3x3 conv, 256

• all 3x3 conv (almost) 3x3 conv, 256

3x3 conv, 256

• spatial size /2 => # filters x2 (~same complexity per layer)

3x3 conv, 256 3x3 conv, 256

• Simple design; just deep!

3x3 conv, 256 3x3 conv, 256

• Other remarks: 3x3 conv, 256

3x3 conv, 512, /2

3x3 conv, 256

3x3 conv, 512, /2

• no hidden fc
3x3 conv, 512 3x3 conv, 512

3x3 conv, 512 3x3 conv, 512

• no dropout
3x3 conv, 512 3x3 conv, 512

3x3 conv, 512 3x3 conv, 512

avg pool avg pool

fc 1000 fc 1000

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.
Training
• All plain/residual nets are trained from scratch

• All plain/residual nets use Batch Normalization

• Standard hyper-parameters & augmentation

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.
CIFAR-10 experiments
CIFAR-10 plain nets CIFAR-10 ResNets
20 20
ResNet-20
56-layer ResNet-32
ResNet-44
ResNet-56
44-layer ResNet-110

32-layer 20-layer

error (%)
error (%)

10
20-layer
10 32-layer
44-layer
5 plain-20
plain-32
5
56-layer
plain-44
plain-56
solid: test 110-layer
0
0 1 2 3 4 5 6 dashed: train 0
0 1 2 3 4 5 6
iter. (1e4) iter. (1e4)

• Deep ResNets can be trained without difficulties

• Deeper ResNets have lower training error, and also lower test error

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.
ImageNet experiments
ImageNet plain nets ImageNet ResNets

60 60

50 50

error (%)
error (%)

40
34-layer 40 18-layer

30 30
solid: test ResNet-18
plain-18
plain-34
dashed: train 18-layer ResNet-34 34-layer
20 20
0 10 20 30 40 50 0 10 20 30 40 50
iter. (1e4) iter. (1e4)

• Deep ResNets can be trained without difficulties

• Deeper ResNets have lower training error, and also lower test error

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.
ImageNet experiments
• A practical design of going deeper
64-d 256-d

3x3, 64 1x1, 64
relu
relu
3x3, 64
relu
3x3, 64
1x1, 256

relu relu

similar
all-3x3 complexity bottleneck
(for ResNet-50/101/152)

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.
ImageNet experiments
8
• Deeper ResNets have lower error 7.4
this model has
lower time complexity
than VGG-16/19 6.7 7

6.1
5.7 6

ResNet-152 ResNet-101 ResNet-50 ResNet-34

10-crop testing, top-5 val error (%)
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.
ImageNet experiments 28.2
25.8
152 layers

16.4

11.7
22 layers 19 layers
6.7 7.3

3.57 8 layers 8 layers shallow

ILSVRC'15 ILSVRC'14 ILSVRC'14 ILSVRC'13 ILSVRC'12 ILSVRC'11 ILSVRC'10

ResNet GoogleNet VGG AlexNet

ImageNet Classification top-5 error (%)

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.
Discussions
Representation, Optimization, Generalization
Issues on learning deep models
• Ability of model to fit training data, if
• Representation ability optimum could be found
• If model A’s solution space is a superset of
B’s, A should be better.

• Optimization ability • Feasibility of finding an optimum

• Not all models are equally easy to optimize

• Once training data is fit, how good is the test

• Generalization ability performance
How do ResNets address these issues?
• No explicit advantage on representation
• Representation ability (only re-parameterization), but
• Allow models to go deeper

• Optimization ability • Enable very smooth forward/backward prop

• Greatly ease optimizing deeper models

• Not explicitly address generalization, but

• Generalization ability • Deeper+thinner is good generalization
On the Importance of
Identity Mapping
From 100 layers to 1000 layers
On identity mappings for optimization
𝑥𝑙 • shortcut mapping: ℎ = identity
• after-add mapping: 𝑓 = ReLU
layer
𝐹(𝑥𝑙 ) ℎ(𝑥c )
layer

𝑥cBC = 𝑓(ℎ 𝑥c + 𝐹 𝑥c )

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Identity Mappings in Deep Residual Networks”. arXiv 2016.
On identity mappings for optimization
𝑥𝑙 • shortcut mapping: ℎ = identity
• after-add mapping: 𝑓 = ReLU
layer
• What if 𝑓 = identity?
𝐹(𝑥𝑙 ) ℎ(𝑥c )
layer

𝑥cBC = 𝑓(ℎ 𝑥c + 𝐹 𝑥c )

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Identity Mappings in Deep Residual Networks”. arXiv 2016.
Very smooth forward propagation
𝑥cBC = 𝑥c + 𝐹 𝑥c

𝑥cBV = 𝑥cBC + 𝐹 𝑥cBC

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Identity Mappings in Deep Residual Networks”. arXiv 2016.
Very smooth forward propagation
𝑥cBC = 𝑥c + 𝐹 𝑥c

𝑥cBV = 𝑥cBC + 𝐹 𝑥cBC

𝑥cBV = 𝑥c + 𝐹 𝑥c + 𝐹 𝑥cBC

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Identity Mappings in Deep Residual Networks”. arXiv 2016.
Very smooth forward propagation
𝑥cBC = 𝑥c + 𝐹 𝑥c

𝑥cBV = 𝑥cBC + 𝐹 𝑥cBC

𝑥cBV = 𝑥c + 𝐹 𝑥c + 𝐹 𝑥cBC
giC

𝑥g = 𝑥c + h 𝐹 𝑥+
+jc
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Identity Mappings in Deep Residual Networks”. arXiv 2016.
7x7 conv, 64, /2 7x7 conv, 64, /2

pool, /2 pool, /2

3x3 conv, 64 3x3 conv, 64

Very smooth forward propagation

3x3 conv, 64 3x3 conv, 64

giC
3x3 conv, 128, /2 3x3 conv, 128, /2

3x3 conv, 128 3x3 conv, 128

𝑥g = 𝑥c + h 𝐹 𝑥+
3x3 conv, 128 3x3 conv, 128

3x3 conv, 128 3x3 conv, 128

+jc 3x3 conv, 128

3x3 conv, 256, /2

3x3 conv, 128

3x3 conv, 256, /2

𝑥c 3x3 conv, 256 3x3 conv, 256

• Any 𝑥c is directly forward-prop to any 𝑥g ,

3x3 conv, 256 3x3 conv, 256

plus residual.
3x3 conv, 256 3x3 conv, 256

3x3 conv, 256 3x3 conv, 256

• Any 𝑥g is an additive outcome.

3x3 conv, 256 3x3 conv, 256

• in contrast to multiplicative: 𝑥g = ∏giC 𝑥g 3x3 conv, 256 3x3 conv, 256

+jc 𝑊+ 𝑥c
3x3 conv, 512, /2 3x3 conv, 512, /2

3x3 conv, 512 3x3 conv, 512

avg pool avg pool

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Identity Mappingsfcin
1000 Deep Residual
fc 1000 Networks”. arXiv 2016.
7x7 conv, 64, /2 7x7 conv, 64, /2

pool, /2 pool, /2

3x3 conv, 64 3x3 conv, 64

Very smooth backward propagation

3x3 conv, 64 3x3 conv, 64

giC
3x3 conv, 128, /2 3x3 conv, 128, /2

3x3 conv, 128 3x3 conv, 128

𝑥g = 𝑥c + h 𝐹 𝑥+
3x3 conv, 128 3x3 conv, 128

3x3 conv, 128 3x3 conv, 128

+jc 3x3 conv, 128 3x3 conv, 128

𝜕𝐸 3x3 conv, 256, /2

3x3 conv, 256

3x3 conv, 256, /2

3x3 conv, 256

𝜕𝑥c
3x3 conv, 256 3x3 conv, 256

3x3 conv, 256 3x3 conv, 256

giC
3x3 conv, 256 3x3 conv, 256

𝜕𝐸 𝜕𝐸 𝜕𝑥g 𝜕𝐸 𝜕
3x3 conv, 256 3x3 conv, 256

3x3 conv, 256 3x3 conv, 256

= = (1 + h 𝐹 𝑥+ )
3x3 conv, 256 3x3 conv, 256

𝜕𝑥c 𝜕𝑥g 𝜕𝑥c 𝜕𝑥g 𝜕𝑥c

3x3 conv, 256 3x3 conv, 256

𝜕𝐸 3x3 conv, 256 3x3 conv, 256

+jC
3x3 conv, 256 3x3 conv, 256

𝜕𝑥g
3x3 conv, 512, /2 3x3 conv, 512, /2

3x3 conv, 512 3x3 conv, 512

avg pool avg pool

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Identity Mappingsfcin
1000 Deep Residual
fc 1000 Networks”. arXiv 2016.
7x7 conv, 64, /2 7x7 conv, 64, /2

pool, /2 pool, /2

3x3 conv, 64 3x3 conv, 64

Very smooth backward propagation

3x3 conv, 64 3x3 conv, 64

3x3 conv, 128, /2 3x3 conv, 128, /2

giC 3x3 conv, 128 3x3 conv, 128

𝜕𝐸 𝜕𝐸 𝜕 3x3 conv, 128 3x3 conv, 128

= (1 + h 𝐹 𝑥+ )
3x3 conv, 128 3x3 conv, 128

3x3 conv, 128 3x3 conv, 128

𝜕𝑥c 𝜕𝑥g 𝜕𝑥c 3x3 conv, 128

3x3 conv, 128

+jC 3x3 conv, 128 3x3 conv, 128

𝜕𝐸 3x3 conv, 256, /2

3x3 conv, 256

3x3 conv, 256, /2

3x3 conv, 256

lm lm 𝜕𝑥c
3x3 conv, 256 3x3 conv, 256

• Any lno
is directly back-prop to any lnp
, 3x3 conv, 256

3x3 conv, 256

3x3 conv, 256 3x3 conv, 256

plus residual.
3x3 conv, 256 3x3 conv, 256

3x3 conv, 256 3x3 conv, 256

lm
3x3 conv, 256 3x3 conv, 256

• Any is additive; unlikely to vanish 𝜕𝐸 3x3 conv, 256 3x3 conv, 256

lnp
3x3 conv, 256 3x3 conv, 256

𝜕𝑥g
3x3 conv, 512, /2 3x3 conv, 512, /2

3x3 conv, 512 3x3 conv, 512

lm lm
• in contrast to multiplicative: ln = ∏giC
+jc 𝑊+ ln
3x3 conv, 512

3x3 conv, 512

p o 3x3 conv, 512 3x3 conv, 512

3x3 conv, 512 3x3 conv, 512

avg pool avg pool

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Identity Mappingsfcin
1000 Deep Residual
fc 1000 Networks”. arXiv 2016.
Residual for every layer
giC
Enabled by:
forward: 𝑥g = 𝑥c + h 𝐹 𝑥+
• shortcut mapping: ℎ = identity
+jc
• after-add mapping: 𝑓 = identity
giC
𝜕𝐸 𝜕𝐸 𝜕
backward: = (1 + h 𝐹 𝑥+ )
𝜕𝑥c 𝜕𝑥g 𝜕𝑥c
+jC

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Identity Mappings in Deep Residual Networks”. arXiv 2016.
Experiments
• Set 1: what if shortcut mapping ℎ ≠ identity

• Set 2: what if after-add mapping 𝑓 is identity

• Experiments on ResNets with more than 100 layers

• deeper models suffer more from optimization difficulty
Experiment Set 1:
what if shortcut mapping ℎ ≠ identity?
* ResNet-110 on CIFAR-10
3x3 conv 3x3 conv
ReLU ReLU
3x3 conv 3x3 conv

ℎ 𝑥 =𝑥 0.5 0.5 ℎ 𝑥 = 0.5𝑥

error: 6.6% addition
ReLU (a) original
addition
ReLU (b) constant scaling
error: 12.4%

3x3 conv 3x3 conv

ReLU ReLU

ℎ 𝑥 = gate · 𝑥 1x1 conv

sigmoid
3x3 conv 1x1 conv
sigmoid
3x3 conv ℎ 𝑥 = gate · 𝑥
error: 8.7% 1- 1- error: 12.9%
addition addition
*similar to “Highway ReLU (c) exclusive gating ReLU (d) shortcut-only gating
Network”

3x3 conv 3x3 conv

ℎ 𝑥 = conv(𝑥) 1x1 conv
ReLU
3x3 conv dropout
ReLU
3x3 conv ℎ 𝑥 = dropout(𝑥)
error: 12.2% error: > 20%
addition addition
ReLU (e) conv shortcut ReLU (f) dropout shortcut
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Identity Mappings in Deep Residual Networks”. arXiv 2016.
* ResNet-110 on CIFAR-10
3x3 conv 3x3 conv
ReLU ReLU
3x3 conv 3x3 conv

ℎ 𝑥 =𝑥 0.5 0.5 ℎ 𝑥 = 0.5𝑥

error: 6.6% addition
ReLU (a) original
addition
ReLU (b) constant scaling
error: 12.4%

3x3 conv 3x3 conv

ReLU ReLU

ℎ 𝑥 = gate · 𝑥 1x1 conv

sigmoid
3x3 conv 1x1 conv
sigmoid
3x3 conv ℎ 𝑥 = gate · 𝑥
error: 8.7% 1- 1- error: 12.9%
addition
ReLU shortcuts
(c) exclusive gating
addition
ReLU (d) shortcut-only gating
blocked by
multiplications
3x3 conv 3x3 conv
ℎ 𝑥 = conv(𝑥) 1x1 conv
ReLU
3x3 conv dropout
ReLU
3x3 conv ℎ 𝑥 = dropout(𝑥)
error: 12.2% error: > 20%
addition addition
ReLU (e) conv shortcut ReLU (f) dropout shortcut
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Identity Mappings in Deep Residual Networks”. arXiv 2016.
If ℎ is multiplicative, e.g. ℎ 𝑥 = λ𝑥
giC
• if ℎ is multiplicative,
forward: 𝑥g = λgic 𝑥c + h 𝐹u 𝑥+ shortcuts are blocked
+jc
• direct propagation is decayed

giC
𝜕𝐸 𝜕𝐸 gic 𝜕
backward: = (λ + h 𝐹u 𝑥+ )
𝜕𝑥c 𝜕𝑥g 𝜕𝑥c
+jC

*assuming 𝑓 = identity

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Identity Mappings in Deep Residual Networks”. arXiv 2016.
3x3 conv
ReLU
1x1 conv 3x3 conv
sigmoid
1-

addition
ReLU
ℎ is gating

solid: test
dashed: train
3x3 conv ℎ is identity
ReLU
3x3 conv

addition
ReLU

• gating should have better representation

ability (identity is a special case), but
• optimization difficulty dominates results
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Identity Mappings in Deep Residual Networks”. arXiv 2016.
Experiment Set 2:
what if after-add mapping 𝑓 is identity
xl xl xl

weight weight BN

BN BN ReLU

ReLU ReLU weight

weight weight BN

BN addition ReLU

addition BN weight

ReLU ReLU addition

xl+1 xl+1 xl+1

𝑓 is ReLU 𝑓 is BN+ReLU 𝑓 is identity

(original ResNet) (pre-activation ResNet)

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Identity Mappings in Deep Residual Networks”. arXiv 2016.
xl xl

weight weight

BN BN

ReLU ReLU

weight weight

BN addition

addition BN

ReLU ReLU
solid: test xl+1 xl+1
dashed: train

𝑓 = BN+ReLU 𝑓 = ReLU 𝑓 = BN+ReLU

𝑓 = ReLU
• BN could block prop
• Keep the shortest pass as
smooth as possible

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Identity Mappings in Deep Residual Networks”. arXiv 2016.
1001-layer ResNets on CIFAR-10 xl xl

weight BN

BN ReLU

ReLU weight

weight BN

BN ReLU

addition weight

ReLU addition
solid: test xl+1 xl+1
dashed: train

𝑓 = ReLU 𝑓 = identity
𝑓 = ReLU
𝑓 = identity
• ReLU could block prop when there
are 1000 layers
• pre-activation design eases
optimization (and improves generalization; see paper)
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Identity Mappings in Deep Residual Networks”. arXiv 2016.
Comparisons on CIFAR-10/100
CIFAR-10 CIFAR-100
method error (%) method error (%)
NIN 8.81 NIN 35.68

DSN 8.22 DSN 34.57

FitNet 8.39 FitNet 35.04

Highway 7.72 Highway 32.39

ResNet-110 (1.7M) 6.61 ResNet-164 (1.7M) 25.16

ResNet-1202 (19.4M) 7.93 ResNet-1001 (10.2M) 27.82

ResNet-164, pre-activation (1.7M) 5.46 ResNet-164, pre-activation (1.7M) 24.33

ResNet-1001, pre-activation (10.2M) 4.92 (4 . 8 9 ±0 . 1 4 ) ResNet-1001, pre-activation (10.2M) 22.71 (2 2 . 6 8 ±0 . 2 2 )

*all based on moderate augmentation

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Identity Mappings in Deep Residual Networks”. arXiv 2016.
ImageNet Experiments
ImageNet single-crop (320x320) val error
method data augmentation top-1 error (%) top-5 error (%)
ResNet-152, original scale 21.3 5.5
ResNet-152, pre-activation scale 21.1 5.5
ResNet-200, original scale 21.8 6.0
ResNet-200, pre-activation scale 20.7 5.3
ResNet-200, pre-activation scale + aspect ratio 20.1* 4.8*
*independently reproduced by:
https://github.com/facebook/fb.resnet.torch/tree/master/pretrained#notes
training code and models available.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Identity Mappings in Deep Residual Networks”. arXiv 2016.
Summary of observations
• Keep the shortest path as smooth as possible
• by making ℎ and 𝑓 identity xl
• forward/backward signals directly flow through this path
BN

ReLU
• Features of any layers are additive outcomes weight

• 1000-layer ResNets can be easily trained and have ReLU

better accuracy weight

addition
xl+1

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Identity Mappings in Deep Residual Networks”. arXiv 2016.
Future Works
• Representation
• skipping 1 layer vs. multiple layers?
xl
• Flat vs. Bottleneck?
• Inception-ResNet [Szegedy et al 2016]
• ResNet in ResNet [Targ et al 2016] BN

• Width vs. Depth [Zagoruyko & Komodakis 2016] ReLU

weight
• Generalization BN
• DropOut, MaxOut, DropConnect, …
• Drop Layer (Stochastic Depth) [Huang et al 2016] ReLU

weight
• Optimization addition
• Without residual/shortcut?
xl+1

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Identity Mappings in Deep Residual Networks”. arXiv 2016.
Applications
“Features matter”
“Features matter.” (quote [Girshick et al. 2014], the R-CNN paper)
2nd-place margin
task winner
ResNets (relative)

ImageNet Localization (top-5 error) 12.0 9.0 27%

ImageNet Detection ([email protected]) 53.6 absolute 62.1 16%
8.5% better!
COCO Detection ([email protected]:.95) 33.5 37.3 11%
COCO Segmentation ([email protected]:.95) 25.1 28.2 12%

• Our results are all based on ResNet-101

• Deeper features are well transferrable
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.
101 layers
Revolution of Depth
86
Engines of
66
visual recognition 58

34
16 layers
8 layers
shallow

HOG, DPM AlexNet VGG ResNet

(RCNN) (RCNN) (Faster RCNN)*

PASCAL VOC 2007 Object Detection mAP (%)

*w/ other improvements & more data
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.
Deep Learning for Computer Vision
detection
network
(e.g. R-CNN)

segmentation
network
ImageNet target
(e.g. FCN)
data data

…...
backbone classification
structure network human pose
pre-train features estimation fine-tune
network

depth
estimation
network
Example: Object Detection
ü boat
ü person

Image Classification Object Detection

(what?) (what + where?)
Object Detection: R-CNN
figure credit: R. Girshick et al.

warped region aeroplane? no.

..
person? yes.
CNN ..
tvmonitor? no.

input image region proposals 1 CNN for each region classify regions
~2,000

Region-based CNN pipeline

Girshick, Donahue, Darrell, Malik. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. CVPR 2014
Object Detection: R-CNN
• R-CNN
feature
feature
feature
feature

End-to-End
training
CNN CNN
CNN CNN

pre-computed
Regions-of-Interest
image
(RoIs)

Girshick, Donahue, Darrell, Malik. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. CVPR 2014
Object Detection: Fast R-CNN
• Fast R-CNN
feature
feature
feature
pre-computed
Regions-of-Interest RoI pooling
(RoIs) End-to-End
training

shared conv CNN

layers

image

Girshick. Fast R-CNN. ICCV 2015

Object Detection: Faster R-CNN
features
• Faster R-CNN
RoI pooling
• Solely based on CNN
proposals
• No external modules
• Each step is end-to-end
Region Proposal Net End-to-End
training

feature map

CNN

image

Shaoqing Ren, Kaiming He, Ross Girshick, & Jian Sun. “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”. NIPS 2015.
Object Detection
ImageNet detection
data data

backbone classification detection

structure network network
pre-train features fine-tune

• R-CNN
• AlexNet
• Fast R-CNN
• VGG-16
• Faster R-CNN
• GoogleNet
• MultiBox
• ResNet-101
• SSD
• …
• …
independently
“plug-in” developed
features detectors
classifier
Object Detection
RoI pooling

• Simply “Faster R-CNN + ResNet” proposals

Faster R-CNN Region Proposal Net

[email protected] [email protected]:.95
baseline
VGG-16 41.5 21.5
feature map
ResNet-101 48.4 27.2

COCO detection results

ResNet-101 has 28% relative gain CNN
vs VGG-16
image

• Add components:
• Iterative localization
• Context modeling
• Multi-scale testing

• All are based on CNN features; all are end-to-end

• All benefit more from deeper features – cumulative gains!

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.
Shaoqing Ren, Kaiming He, Ross Girshick, & Jian Sun. “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”. NIPS 2015.
ResNet’s object detection result on COCO
*the original image is from the COCO dataset
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015.
Shaoqing Ren, Kaiming He, Ross Girshick, & Jian Sun. “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”. NIPS 2015.
*the original image is from the COCO dataset
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015.
Shaoqing Ren, Kaiming He, Ross Girshick, & Jian Sun. “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”. NIPS 2015.
*the original image is from the COCO dataset
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015.
Shaoqing Ren, Kaiming He, Ross Girshick, & Jian Sun. “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”. NIPS 2015.
this video is available online: https://youtu.be/WZmSMkK9VuA
Results on real video. Models trained on MS COCO (80 categories).
(frame-by-frame; no temporal processing)
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015.
Shaoqing Ren, Kaiming He, Ross Girshick, & Jian Sun. “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”. NIPS 2015.
More Visual Recognition Tasks
ResNet-based methods lead on these benchmarks (incomplete list):
• ImageNet classification, detection, localization
• MS COCO detection, segmentation
ResNet-101
• PASCAL VOC detection, segmentation

• Human pose estimation [Newell et al 2016] PASCAL segmentation leaderboard

• Depth estimation [Laina et al 2016]

• Segment proposal [Pinheiro et al 2016] ResNet-101
•…

PASCAL detection leaderboard

Potential Applications
Visual Recognition

Image Generation
(Pixel RNN, Neural Art, etc.)
ResNets have
shown outstanding or Natural Language Processing
promising results on: (Very deep CNN)

Speech Recognition
(preliminary results)

Advertising, user prediction

(preliminary results)
Conclusions of the Tutorial
• Deep Residual Learning:
• Ultra deep networks can be easy to train
• Ultra deep networks can gain accuracy from depth
• Ultra deep representations are well transferrable
• Now 200 layers on ImageNet and 1000 layers on CIFAR!

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Identity Mappings in Deep Residual Networks”. arXiv 2016.
Resources
• Models and Code
• Our ImageNet models in Caffe: https://github.com/KaimingHe/deep-residual-networks

• Many available implementation

(list in https://github.com/KaimingHe/deep-residual-networks)
• Facebook AI Research’s Torch ResNet:
https://github.com/facebook/fb.resnet.torch
• Torch, CIFAR-10, with ResNet-20 to ResNet-110, training code, and curves: code
• Lasagne, CIFAR-10, with ResNet-32 and ResNet-56 and training code: code
• Neon, CIFAR-10, with pre-trained ResNet-32 to ResNet-110 models, training code, and curves: code
• Torch, MNIST, 100 layers: blog, code
• A winning entry in Kaggle's right whale recognition challenge: blog, code
• Neon, Place2 (mini), 40 layers: blog, code
• …....
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Identity Mappings in Deep Residual Networks”. arXiv 2016.

Unicos URK-700 Keratometer - Service Manual
50% (2)
Unicos URK-700 Keratometer - Service Manual
64 pages
Object Detection Week 2 YOLOv1-YOLOv8
No ratings yet
Object Detection Week 2 YOLOv1-YOLOv8
264 pages
An Introduction To Photoshop
No ratings yet
An Introduction To Photoshop
8 pages
Brock String Instructions
No ratings yet
Brock String Instructions
1 page
WhiBalUserGuide V10 PDF
No ratings yet
WhiBalUserGuide V10 PDF
31 pages
CNN Architectures - Transfer Learning
No ratings yet
CNN Architectures - Transfer Learning
64 pages
Lecture06 VDL
No ratings yet
Lecture06 VDL
79 pages
Recurrent Neural Networks
No ratings yet
Recurrent Neural Networks
111 pages
RNN
No ratings yet
RNN
86 pages
Asset-V1 MITx+6.86x+3T2020+typeasset+blockslides Lecture12 Compressed PDF
No ratings yet
Asset-V1 MITx+6.86x+3T2020+typeasset+blockslides Lecture12 Compressed PDF
13 pages
Object Detection
No ratings yet
Object Detection
31 pages
Deep Residual Learning For Image Recognition (Summary)
No ratings yet
Deep Residual Learning For Image Recognition (Summary)
11 pages
Convolutional Neural Network Ilsvrc Alexnet (2012) Zfnet (2013) Vggnet (2014) Googlenet 2014) Resnet (2015) Conclusion
No ratings yet
Convolutional Neural Network Ilsvrc Alexnet (2012) Zfnet (2013) Vggnet (2014) Googlenet 2014) Resnet (2015) Conclusion
82 pages
Detecting Small Signs From Large Images
No ratings yet
Detecting Small Signs From Large Images
9 pages
Parsimony and Self-Consistency-with-Translation
No ratings yet
Parsimony and Self-Consistency-with-Translation
46 pages
Lecture-22-CAP6412_Spring2018_Mask-RCNN_New
No ratings yet
Lecture-22-CAP6412_Spring2018_Mask-RCNN_New
36 pages
GoogleNET and ResNet v4 With Nin and Bias
No ratings yet
GoogleNET and ResNet v4 With Nin and Bias
82 pages
Unit 5a - Machine Vision
No ratings yet
Unit 5a - Machine Vision
55 pages
DL_IT324a_7
No ratings yet
DL_IT324a_7
29 pages
Lecture 6 CNN - Detection
No ratings yet
Lecture 6 CNN - Detection
48 pages
Deep Face Recognition: Omkar M. Parkhi, Andrea Vedaldi, Andrew Zisserman
No ratings yet
Deep Face Recognition: Omkar M. Parkhi, Andrea Vedaldi, Andrew Zisserman
30 pages
Aidl 2023s DL 08 CNN Architectures
No ratings yet
Aidl 2023s DL 08 CNN Architectures
51 pages
5. Object Detection and Segmentation - part 2
No ratings yet
5. Object Detection and Segmentation - part 2
36 pages
Jacket's Wavelets: Density Estimation When Data Are Size-Biased: Wavelet-Based Matlab Toolbox
No ratings yet
Jacket's Wavelets: Density Estimation When Data Are Size-Biased: Wavelet-Based Matlab Toolbox
11 pages
Lec9 CNN 25jan18
No ratings yet
Lec9 CNN 25jan18
111 pages
CS273a Final Exam
No ratings yet
CS273a Final Exam
9 pages
бизнес матем ЯН база last
No ratings yet
бизнес матем ЯН база last
41 pages
Differential Cryptanalysis of PRESENT: (Mqwang) @sdu - Edu.cn
No ratings yet
Differential Cryptanalysis of PRESENT: (Mqwang) @sdu - Edu.cn
10 pages
Automated Image Captioning With Convnets and Recurrent Nets: Andrej Karpathy, Fei-Fei Li
No ratings yet
Automated Image Captioning With Convnets and Recurrent Nets: Andrej Karpathy, Fei-Fei Li
105 pages
CS60010: Deep Learning CNN - Part 1: Sudeshna Sarkar
No ratings yet
CS60010: Deep Learning CNN - Part 1: Sudeshna Sarkar
64 pages
Ejercicios de Maraton
No ratings yet
Ejercicios de Maraton
16 pages
Blockchain Hack Script - PDF - Orthography - Information Science
No ratings yet
Blockchain Hack Script - PDF - Orthography - Information Science
1 page
Section 6.4 Filled Lecture Notes
No ratings yet
Section 6.4 Filled Lecture Notes
7 pages
Introduction To Deep Convolutional Neural Networks: March 2016
No ratings yet
Introduction To Deep Convolutional Neural Networks: March 2016
51 pages
27999-Article Text-32053-1-2-20240324
No ratings yet
27999-Article Text-32053-1-2-20240324
9 pages
Lainzine 04
No ratings yet
Lainzine 04
66 pages
L7-CNNs_NT
No ratings yet
L7-CNNs_NT
82 pages
CS60010: Deep Learning CNN - Part 3: Sudeshna Sarkar
No ratings yet
CS60010: Deep Learning CNN - Part 3: Sudeshna Sarkar
167 pages
Download Full Intelligent information technologies and applications 1st Edition Vijayan Sugumaran PDF All Chapters
100% (1)
Download Full Intelligent information technologies and applications 1st Edition Vijayan Sugumaran PDF All Chapters
67 pages
2023 AN2DL Lez 4 CNN Famous Architectures
No ratings yet
2023 AN2DL Lez 4 CNN Famous Architectures
113 pages
Aggregated Residual Transformations For Deep Neural Networks
No ratings yet
Aggregated Residual Transformations For Deep Neural Networks
10 pages
Lego EV3 Сarousel. Smooth acceleration and deceleration
No ratings yet
Lego EV3 Сarousel. Smooth acceleration and deceleration
20 pages
Chapter 5
No ratings yet
Chapter 5
8 pages
Image Representation (Compression) : Muhammad Aminul Akbar
No ratings yet
Image Representation (Compression) : Muhammad Aminul Akbar
35 pages
A Fully Progressive Approach To Single Image Super Resolution Paper 1
No ratings yet
A Fully Progressive Approach To Single Image Super Resolution Paper 1
10 pages
AI Chip Designs EDA
No ratings yet
AI Chip Designs EDA
76 pages
2020 Chen SiamBOMB
No ratings yet
2020 Chen SiamBOMB
3 pages
Supervised Edge Attention Network For
No ratings yet
Supervised Edge Attention Network For
16 pages
Convolutional Neural Network2 26112024 015227pm
No ratings yet
Convolutional Neural Network2 26112024 015227pm
41 pages
9.b.IP - TCP - 2019 - CAT 1 - Answers
No ratings yet
9.b.IP - TCP - 2019 - CAT 1 - Answers
9 pages
Download Full (Ebook) Java Programming Exercises: Volume One: Language Fundamentals and Core Concepts by Christian Ullenboom ISBN 9781032593975, 1032593970 PDF All Chapters
No ratings yet
Download Full (Ebook) Java Programming Exercises: Volume One: Language Fundamentals and Core Concepts by Christian Ullenboom ISBN 9781032593975, 1032593970 PDF All Chapters
76 pages
Caches: Shmuel Wimer
No ratings yet
Caches: Shmuel Wimer
69 pages
Deep CNN MedicalImaging
No ratings yet
Deep CNN MedicalImaging
28 pages
Deeplearning Rostami Part 2
No ratings yet
Deeplearning Rostami Part 2
114 pages
Convolutional Neural Network Models
No ratings yet
Convolutional Neural Network Models
83 pages
Architecture
No ratings yet
Architecture
1 page
Lect-7 Segmentation Localization
No ratings yet
Lect-7 Segmentation Localization
151 pages
vSAN_Feature_Matrix_8_0
No ratings yet
vSAN_Feature_Matrix_8_0
1 page
Convolutional Neural Networks
No ratings yet
Convolutional Neural Networks
161 pages
Download Complete OpenGL Programming Guide 5th Edition Opengl Architecture Review Board PDF for All Chapters
100% (1)
Download Complete OpenGL Programming Guide 5th Edition Opengl Architecture Review Board PDF for All Chapters
67 pages
Full download OpenGL Programming Guide 5th Edition Opengl Architecture Review Board pdf docx
100% (1)
Full download OpenGL Programming Guide 5th Edition Opengl Architecture Review Board pdf docx
67 pages
bunny masterclass
No ratings yet
bunny masterclass
16 pages
YOLO V2 For Object Detection
No ratings yet
YOLO V2 For Object Detection
38 pages
One Block Wonders Encore: New Shapes, Multiple Fabrics, Out-of-this-World Quilts
From Everand
One Block Wonders Encore: New Shapes, Multiple Fabrics, Out-of-this-World Quilts
Joy Pelzmann
4/5 (9)
Ticket TK569365144W45
No ratings yet
Ticket TK569365144W45
3 pages
1 s2.0 S0141029615004824 Main
No ratings yet
1 s2.0 S0141029615004824 Main
12 pages
Ticket TK577421922g48
No ratings yet
Ticket TK577421922g48
3 pages
06 Explicit Bearclay
No ratings yet
06 Explicit Bearclay
23 pages
On The in Uence of Vertical Loads On The Lateral Response of Pile Foundation
No ratings yet
On The in Uence of Vertical Loads On The Lateral Response of Pile Foundation
2 pages
4 Prasads - SriRangaraya
No ratings yet
4 Prasads - SriRangaraya
85 pages
3rdMIT Schedule
No ratings yet
3rdMIT Schedule
79 pages
Three-Dimensional Analysis of Reinforced Concrete Beam-Column Structures in Fire
No ratings yet
Three-Dimensional Analysis of Reinforced Concrete Beam-Column Structures in Fire
33 pages
03-3 ts-3 Design Activities
No ratings yet
03-3 ts-3 Design Activities
31 pages
5 Punjab 6 Punjab and 7 Punjab in 1965 W
No ratings yet
5 Punjab 6 Punjab and 7 Punjab in 1965 W
1 page
1 s2.0 S0267726117303718 Main
No ratings yet
1 s2.0 S0267726117303718 Main
11 pages
04 Contents
No ratings yet
04 Contents
5 pages
02 DR - NAVEED Topic4 ConceptualDesigninStructuralSystemDevelopment
No ratings yet
02 DR - NAVEED Topic4 ConceptualDesigninStructuralSystemDevelopment
60 pages
4 Zhouetal
No ratings yet
4 Zhouetal
11 pages
1 s2.0 095579979290013W Main
No ratings yet
1 s2.0 095579979290013W Main
6 pages
1 s2.0 0097316573900320 Main
No ratings yet
1 s2.0 0097316573900320 Main
14 pages
1 s2.0 0898122187900435 Main
No ratings yet
1 s2.0 0898122187900435 Main
30 pages
1 s2.0 004579499290388G Main
No ratings yet
1 s2.0 004579499290388G Main
10 pages
Assessment of Vulnerability For Coastal
No ratings yet
Assessment of Vulnerability For Coastal
12 pages
1 s2.0 004579499290143N Main
No ratings yet
1 s2.0 004579499290143N Main
9 pages
Stability of Modified Geotextile Wrap Ar
No ratings yet
Stability of Modified Geotextile Wrap Ar
9 pages
This Thesis Was Submitted in Partial Ful
No ratings yet
This Thesis Was Submitted in Partial Ful
192 pages
Instant Download Digital Media Production For Beginners 1st Edition Julia V. Griffey PDF All Chapters
100% (6)
Instant Download Digital Media Production For Beginners 1st Edition Julia V. Griffey PDF All Chapters
70 pages
Module 3 Visual Elements of Art
No ratings yet
Module 3 Visual Elements of Art
4 pages
Bl-Ge-6115-Lec-1923t Art Appreciation
100% (1)
Bl-Ge-6115-Lec-1923t Art Appreciation
5 pages
Chapter 2 Digital Image Fundamantels
No ratings yet
Chapter 2 Digital Image Fundamantels
58 pages
10th The Human Eye and The Colourful World Solved Questions
No ratings yet
10th The Human Eye and The Colourful World Solved Questions
4 pages
Element of Art
No ratings yet
Element of Art
1 page
Tamron SP - 350mm f5.6 06b - 500mm f8 55b - Manual PDF
No ratings yet
Tamron SP - 350mm f5.6 06b - 500mm f8 55b - Manual PDF
20 pages
Contempo. L 01
No ratings yet
Contempo. L 01
52 pages
Health Ahcip Bulletin Opto 35 Opto Benefits Amendment
No ratings yet
Health Ahcip Bulletin Opto 35 Opto Benefits Amendment
2 pages
Simplified LR Workflow
No ratings yet
Simplified LR Workflow
5 pages
What Is Visual Acuity and Why Do NDT Technicians Need It
No ratings yet
What Is Visual Acuity and Why Do NDT Technicians Need It
1 page
Effect Pigments From Basf: Table of Content
No ratings yet
Effect Pigments From Basf: Table of Content
10 pages
Changing Hair Color in Adobe Photoshop CS6
No ratings yet
Changing Hair Color in Adobe Photoshop CS6
8 pages
RGBD-D Image Analysis and Processing PDF
No ratings yet
RGBD-D Image Analysis and Processing PDF
522 pages
Inspirational Student Photo Guide
No ratings yet
Inspirational Student Photo Guide
191 pages
NCERT-Solutions-for-Class-10-Science-Chapter-11_The-Human-Eye-and-the-Colourful-World
No ratings yet
NCERT-Solutions-for-Class-10-Science-Chapter-11_The-Human-Eye-and-the-Colourful-World
8 pages
Chapter 2 - Stereo Vision-3d Video Capture and Scene Representations
No ratings yet
Chapter 2 - Stereo Vision-3d Video Capture and Scene Representations
20 pages
CATALOGO - SRC - Gomas 2024 - Rev2
No ratings yet
CATALOGO - SRC - Gomas 2024 - Rev2
33 pages
Corneal Melt in Leptospirosis
No ratings yet
Corneal Melt in Leptospirosis
1 page
Colour Chart Carte de Nuances Carta de Colores Cartella Colori Farbkarte Kleurenkaart Färgkartav
No ratings yet
Colour Chart Carte de Nuances Carta de Colores Cartella Colori Farbkarte Kleurenkaart Färgkartav
2 pages
Minolta Flash Meter IV 01
No ratings yet
Minolta Flash Meter IV 01
50 pages
Human Eye Pyq 2023
No ratings yet
Human Eye Pyq 2023
13 pages
MK20109V1 1 HDR
No ratings yet
MK20109V1 1 HDR
13 pages
Template PPT Tokoh Panutan Berakhlak
No ratings yet
Template PPT Tokoh Panutan Berakhlak
11 pages
Farsightedness
No ratings yet
Farsightedness
2 pages
Tabla de Colores 1
No ratings yet
Tabla de Colores 1
3 pages