0% found this document useful (0 votes)
19 views

1-Resnet Slides

see

Uploaded by

satyakali24
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

1-Resnet Slides

see

Uploaded by

satyakali24
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 89

7x7 conv, 64, /2, pool /2

1x1 conv, 64

3x3 conv, 64

1x1 conv, 256

1x1 conv, 64

3x3 conv, 64

1x1 conv, 256

1x1 conv, 64

3x3 conv, 64

1x1 conv, 256

1x1 conv, 128, /2

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 256, /2

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024


Kaiming He

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256


Facebook AI Research*

1x1 conv, 1024

1x1 conv, 256


ICML 2016 tutorial

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256


8:30-10:30am, June 19

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256


*as of July 2016. Formerly affiliated with Microsoft Research Asia

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024


Deep Residual Networks

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256


Deep Learning Gets Way Deeper

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 512, /2

3x3 conv, 512

1x1 conv, 2048

1x1 conv, 512

3x3 conv, 512

1x1 conv, 2048

1x1 conv, 512

3x3 conv, 512

1x1 conv, 2048

ave pool, fc 1000


Overview
• Introduction
• Background
• From shallow to deep
• Deep Residual Networks
• From 10 layers to 100 layers
• From 100 layers to 1000 layers
• Applications
• Q&A
Introduction
Introduction
Deep Residual Networks (ResNets)
• “Deep Residual Learning for Image Recognition”. CVPR 2016 (next week)

• A simple and clean framework of training “very” deep nets

• State-of-the-art performance for


• Image classification
• Object detection
• Semantic segmentation
• and more…

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.
ResNets @ ILSVRC & COCO 2015 Competitions
• 1st places in all five main tracks
• ImageNet Classification: “Ultra-deep” 152-layer nets
• ImageNet Detection: 16% better than 2nd
• ImageNet Localization: 27% better than 2nd
• COCO Detection: 11% better than 2nd
• COCO Segmentation: 12% better than 2nd

*improvements are relative numbers

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.
Revolution of Depth 28.2
25.8
152 layers

16.4

11.7
22 layers 19 layers
6.7 7.3

3.57 8 layers 8 layers shallow

ILSVRC'15 ILSVRC'14 ILSVRC'14 ILSVRC'13 ILSVRC'12 ILSVRC'11 ILSVRC'10


ResNet GoogleNet VGG AlexNet

ImageNet Classification top-5 error (%)


Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.
Revolution of Depth
AlexNet, 8 layers 11x11 conv, 96, /4, pool/2
(ILSVRC 2012)
5x5 conv, 256, pool/2

3x3 conv, 384

3x3 conv, 384

3x3 conv, 256, pool/2

fc, 4096

fc, 4096

fc, 1000

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.
Revolution of Depth soft max2

Soft maxAct iv at ion

FC

Av eragePool
7 x7 + 1 (V)

11x11 conv, 96, /4, pool/2 3x3 conv, 64


AlexNet, 8 layers VGG, 19 layers GoogleNet, 22 layers
Dept hConcat

Conv Conv Conv Conv


1 x1 + 1 (S) 3 x3 + 1 (S) 5 x5 + 1 (S) 1 x1 + 1 (S)

5x5 conv, 256, pool/2 3x3 conv, 64, pool/2


(ILSVRC 2012) (ILSVRC 2014) (ILSVRC 2014)
Conv Conv MaxPool
1 x1 + 1 (S) 1 x1 + 1 (S) 3 x3 + 1 (S)

Dept hConcat

3x3 conv, 384 3x3 conv, 128 Conv


1 x1 + 1 (S)
Conv
3 x3 + 1 (S)
Conv
5 x5 + 1 (S)
Conv
1 x1 + 1 (S)
soft max1

Conv Conv MaxPool


Soft maxAct iv at ion
1 x1 + 1 (S) 1 x1 + 1 (S) 3 x3 + 1 (S)

3x3 conv, 384 3x3 conv, 128, pool/2 MaxPool


3 x3 + 2 (S)
FC

Dept hConcat FC

3x3 conv, 256, pool/2 3x3 conv, 256 Conv


1 x1 + 1 (S)
Conv
3 x3 + 1 (S)
Conv
5 x5 + 1 (S)
Conv
1 x1 + 1 (S)
Conv
1 x1 + 1 (S)

Conv Conv MaxPool Av eragePool


1 x1 + 1 (S) 1 x1 + 1 (S) 3 x3 + 1 (S) 5 x5 + 3 (V)

fc, 4096 3x3 conv, 256 Dept hConcat

Conv Conv Conv Conv


1 x1 + 1 (S) 3 x3 + 1 (S) 5 x5 + 1 (S) 1 x1 + 1 (S)

fc, 4096 3x3 conv, 256 Conv


1 x1 + 1 (S)
Conv
1 x1 + 1 (S)
MaxPool
3 x3 + 1 (S)

Dept hConcat soft max0

fc, 1000 3x3 conv, 256, pool/2 Conv


1 x1 + 1 (S)
Conv
3 x3 + 1 (S)
Conv
5 x5 + 1 (S)
Conv
1 x1 + 1 (S)
Soft maxAct iv at ion

Conv Conv MaxPool


FC
1 x1 + 1 (S) 1 x1 + 1 (S) 3 x3 + 1 (S)

3x3 conv, 512 Dept hConcat FC

Conv Conv Conv Conv Conv


1 x1 + 1 (S) 3 x3 + 1 (S) 5 x5 + 1 (S) 1 x1 + 1 (S) 1 x1 + 1 (S)

3x3 conv, 512 Conv


1 x1 + 1 (S)
Conv
1 x1 + 1 (S)
MaxPool
3 x3 + 1 (S)
Av eragePool
5 x5 + 3 (V)

Dept hConcat

3x3 conv, 512 Conv


1 x1 + 1 (S)
Conv
3 x3 + 1 (S)
Conv
5 x5 + 1 (S)
Conv
1 x1 + 1 (S)

Conv Conv MaxPool


1 x1 + 1 (S) 1 x1 + 1 (S) 3 x3 + 1 (S)

3x3 conv, 512, pool/2 MaxPool


3 x3 + 2 (S)

Dept hConcat

3x3 conv, 512 Conv


1 x1 + 1 (S)
Conv
3 x3 + 1 (S)
Conv
5 x5 + 1 (S)
Conv
1 x1 + 1 (S)

Conv Conv MaxPool


1 x1 + 1 (S) 1 x1 + 1 (S) 3 x3 + 1 (S)

3x3 conv, 512 Dept hConcat

Conv Conv Conv Conv


1 x1 + 1 (S) 3 x3 + 1 (S) 5 x5 + 1 (S) 1 x1 + 1 (S)

3x3 conv, 512 Conv


1 x1 + 1 (S)
Conv
1 x1 + 1 (S)
MaxPool
3 x3 + 1 (S)

MaxPool
3 x3 + 2 (S)

3x3 conv, 512, pool/2 LocalRespNorm

Conv
3 x3 + 1 (S)

fc, 4096 Conv


1 x1 + 1 (V)

LocalRespNorm

fc, 4096 MaxPool


3 x3 + 2 (S)

Conv
7 x7 + 2 (S)

fc, 1000 input

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.
7x7 conv, 64, /2, pool /2

1x1 conv, 64

3x3 conv, 64

1x1 conv, 256

1x1 conv, 64

3x3 conv, 64

1x1 conv, 256

1x1 conv, 64

3x3 conv, 64

1x1 conv, 256

1x1 conv, 128, /2

3x3 conv, 128

Revolution of Depth
1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 128

3x3 conv, 128

1x1 conv, 512

1x1 conv, 256, /2

3x3 conv, 256

11x11 conv, 96, /4, pool/2 1x1 conv, 1024

5x5 conv, 256, pool/2 1x1 conv, 256

AlexNet, 8 layers VGG, 19 layers ResNet, 152 layers


3x3 conv, 384 3x3 conv, 64 3x3 conv, 256

3x3 conv, 384 3x3 conv, 64, pool/2 1x1 conv, 1024

3x3 conv, 256, pool/2 3x3 conv, 128 1x1 conv, 256

fc, 4096 3x3 conv, 128, pool/2 3x3 conv, 256

fc, 4096 3x3 conv, 256 1x1 conv, 1024


3x3 conv, 256 1x1 conv, 256
fc, 1000
3x3 conv, 256 3x3 conv, 256

(ILSVRC 2012) (ILSVRC 2014) (ILSVRC 2015)


3x3 conv, 256, pool/2 1x1 conv, 1024
3x3 conv, 512 1x1 conv, 256
3x3 conv, 512 3x3 conv, 256
3x3 conv, 512 1x1 conv, 1024
3x3 conv, 512, pool/2 1x1 conv, 256
3x3 conv, 512 3x3 conv, 256
3x3 conv, 512 1x1 conv, 1024
3x3 conv, 512 1x1 conv, 256
3x3 conv, 512, pool/2 3x3 conv, 256
fc, 4096 1x1 conv, 1024
fc, 4096 1x1 conv, 256
fc, 1000 3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 256

3x3 conv, 256

1x1 conv, 1024

1x1 conv, 512, /2

3x3 conv, 512

1x1 conv, 2048

1x1 conv, 512

3x3 conv, 512

1x1 conv, 2048

1x1 conv, 512

3x3 conv, 512

1x1 conv, 2048

ave pool, fc 1000

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.
101 layers
Revolution of Depth
86
Engines of
66
visual recognition 58

34
16 layers
8 layers
shallow

HOG, DPM AlexNet VGG ResNet


(RCNN) (RCNN) (Faster RCNN)*

PASCAL VOC 2007 Object Detection mAP (%)


*w/ other improvements & more data
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.
ResNet’s object detection result on COCO
*the original image is from the COCO dataset

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.
Very simple, easy to follow
• Many third-party implementations (list in https://github.com/KaimingHe/deep-residual-networks)
• Facebook AI Research’s Torch ResNet:
• Torch, CIFAR-10, with ResNet-20 to ResNet-110, training code, and curves: code
• Lasagne, CIFAR-10, with ResNet-32 and ResNet-56 and training code: code
• Neon, CIFAR-10, with pre-trained ResNet-32 to ResNet-110 models, training code, and curves: code
• Torch, MNIST, 100 layers: blog, code
• A winning entry in Kaggle's right whale recognition challenge: blog, code
• Neon, Place2 (mini), 40 layers: blog, code
• …

• Easily reproduced results (e.g. Torch ResNet: https://github.com/facebook/fb.resnet.torch)


• A series of extensions and follow-ups
• > 200 citations in 6 months after posted on arXiv (Dec. 2015)

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.
Background
From shallow to deep
Traditional recognition But what’s next?
shallower
classifier “bus”?
pixels

edges classifier “bus”?

SIFT/HOG
deeper
edges histogram classifier “bus”?

K-means/ “bus”?
edges histogram sparse code classifier
Deep Learning
Specialized components, domain knowledge required

K-means/ “bus”?
edges histogram sparse code classifier

Generic components (“layers”), less domain knowledge

“bus”?

Repeat elementary layers => Going deeper

“bus”?

• End-to-end learning
• Richer solution space
Spectrum of Depth
5 layers: easy
>10 layers: initialization, Batch Normalization
>30 layers: skip connections
>100 layers: identity skip connections
>1000 layers: ?

shallower deeper
If:
Initialization • Linear activation
• 𝑥, 𝑦, 𝑤: independent
weight Then:
𝑊
1-layer:
𝑉𝑎𝑟 𝑦 = (𝑛+, 𝑉𝑎𝑟 𝑤 )𝑉𝑎𝑟[𝑥]
input output
𝑋 𝑌 = 𝑊𝑋
Multi-layer:
+,
𝑉𝑎𝑟 𝑦 = (2 𝑛3 𝑉𝑎𝑟 𝑤3 )𝑉𝑎𝑟[𝑥]
𝑛+, 𝑛567 3
LeCun et al 1998 “Efficient Backprop”
Glorot & Bengio 2010 “Understanding the difficulty of training deep feedforward neural networks”
Both forward (response) and backward (gradient)
Initialization signal can vanish/explode
Forward:
𝑉𝑎𝑟 𝑦 = (2 𝑛+,
3 𝑉𝑎𝑟 𝑤3 )𝑉𝑎𝑟[𝑥]
3 exploding
Backward:
𝜕 567 𝜕
𝑉𝑎𝑟 = (2 𝑛3 𝑉𝑎𝑟 𝑤3 )𝑉𝑎𝑟[ ]
𝜕𝑥 𝜕𝑦
3
ideal

vanishing
1 3 5 7 9 11 13 15
depth
LeCun et al 1998 “Efficient Backprop”
Glorot & Bengio 2010 “Understanding the difficulty of training deep feedforward neural networks”
Initialization
• Initialization under linear assumption

∏3 𝑛+,
3 𝑉𝑎𝑟 𝑤3 = 𝑐𝑜𝑛𝑠𝑡>? (healthy forward)
and
∏3 𝑛567
3 𝑉𝑎𝑟 𝑤3 = 𝑐𝑜𝑛𝑠𝑡@? (healthy backward)

D5,E7FG ,MNO
𝑛+,
3 𝑉𝑎𝑟 𝑤3 = 1
*: 𝑛567
3 = +,
𝑛3BC, so D5,E7 = IJKL
< ∞.
HG ,RS
HPQKL
or* It is sufficient to use either form.
𝑛567
3 𝑉𝑎𝑟 𝑤3 = 1
“Xavier” init in Caffe
LeCun et al 1998 “Efficient Backprop”
Glorot & Bengio 2010 “Understanding the difficulty of training deep feedforward neural networks”
Initialization
• Initialization under ReLU activation
C
∏3 𝑛+, 𝑉𝑎𝑟 𝑤3 = 𝑐𝑜𝑛𝑠𝑡>? (healthy forward)
V 3
and
C 567
∏3 𝑛3 𝑉𝑎𝑟 𝑤3 = 𝑐𝑜𝑛𝑠𝑡@?(healthy backward)
V

1 +,
𝑛3 𝑉𝑎𝑟 𝑤3 = 1
2 With 𝐷 layers, a factor of 2 per layer has
or exponential impact of 2Y
1 567
𝑛3 𝑉𝑎𝑟 𝑤3 = 1
2 “MSRA” init in Caffe
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification”. ICCV 2015.
Initialization
22-layer ReLU net: 30-layer ReLU net:
good init converges faster good init is able to converge

1
𝑛𝑉𝑎𝑟 𝑤ours
=1
2 1
2
ours
𝑛𝑉𝑎𝑟 𝑤 =1

𝑛𝑉𝑎𝑟 𝑤 Xavier
=1 𝑛𝑉𝑎𝑟Xavier
𝑤 =1

*Figures show the beginning of training

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification”. ICCV 2015.
Batch Normalization (BN)
• Normalizing input (LeCun et al 1998 “Efficient Backprop”)

• BN: normalizing each layer, for each mini-batch

• Greatly accelerate training

• Less sensitive to initialization

• Improve regularization

S. Ioffe & C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. ICML 2015
Batch Normalization (BN)
𝑥−𝜇
layer 𝑥 𝑥Z = 𝑦 = 𝛾𝑥Z + 𝛽
𝜎

• 𝜇: mean of 𝑥 in mini-batch • 𝜇, 𝜎: functions of 𝑥,


• 𝜎: std of 𝑥 in mini-batch analogous to responses
• 𝛾: scale • 𝛾, 𝛽: parameters to be learned,
• 𝛽: shift analogous to weights

S. Ioffe & C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. ICML 2015
Batch Normalization (BN)
𝑥−𝜇
layer 𝑥 𝑥Z = 𝑦 = 𝛾𝑥Z + 𝛽
𝜎

2 modes of BN:
• Train mode:
• 𝜇, 𝜎 are functions of 𝑥; backprop gradients
• Test mode:
Caution: make sure your BN
• 𝜇, 𝜎 are pre-computed* on training set
is in a correct mode

*: by running average, or post-processing after training

S. Ioffe & C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. ICML 2015
Batch Normalization (BN)
accuracy best of w/ BN w/o BN

iter.
Figure taken from [S. Ioffe & C. Szegedy]

S. Ioffe & C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. ICML 2015
Deep Residual Networks
From 10 layers to 100 layers
Going Deeper
• Initialization algorithms ✓
• Batch Normalization ✓

• Is learning better networks as simple as stacking more layers?

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.
Simply stacking layers?
CIFAR-10
train error (%) test error (%)
20 20
56-layer
56-layer

10 10

20-layer
20-layer
0 0
0 1 2 3 4 5 6 0 1 2 3 4 5 6
iter. (1e4) iter. (1e4)

• Plain nets: stacking 3x3 conv layers…


• 56-layer net has higher training error and test error than 20-layer net

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.
Simply stacking layers?
CIFAR-10 ImageNet-1000
20
60
56-layer
44-layer 50

32-layer

error (%)
error (%)

10 34-layer
20-layer 40

5 plain-20 30
plain-32
plain-18
plain-44
plain-56 solid: test/val plain-34 18-layer
0 20
0 1 2 3
iter. (1e4)
4 5 6
dashed: train 0 10 20 30
iter. (1e4)
40 50

• “Overly deep” plain nets have higher training error


• A general phenomenon, observed in many datasets

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.
7x7 conv, 64, /2 7x7 conv, 64, /2

a shallower 3x3 conv, 64 3x3 conv, 64 a deeper


model counterpart
3x3 conv, 64 3x3 conv, 64

3x3 conv, 64 3x3 conv, 64

(18 layers) (34 layers)


3x3 conv, 64 3x3 conv, 64

3x3 conv, 64

3x3 conv, 64

3x3 conv, 128, /2 3x3 conv, 128, /2

3x3 conv, 128

3x3 conv, 128


3x3 conv, 128

3x3 conv, 128


• Richer solution space
3x3 conv, 128 3x3 conv, 128

3x3 conv, 128

3x3 conv, 128

3x3 conv, 128


• A deeper model should not have higher
3x3 conv, 128
training error
“extra”
3x3 conv, 256, /2 3x3 conv, 256, /2

3x3 conv, 256 3x3 conv, 256

layers
• A solution by construction:
3x3 conv, 256 3x3 conv, 256

3x3 conv, 256 3x3 conv, 256

• original layers: copied from a


3x3 conv, 256

3x3 conv, 256

learned shallower model


3x3 conv, 256

3x3 conv, 256

• extra layers: set as identity


3x3 conv, 256

3x3 conv, 256

• at least the same training error


3x3 conv, 256

3x3 conv, 256

3x3 conv, 512, /2 3x3 conv, 512, /2

3x3 conv, 512 3x3 conv, 512

3x3 conv, 512

3x3 conv, 512


3x3 conv, 512

3x3 conv, 512


• Optimization difficulties: solvers cannot
3x3 conv, 512

3x3 conv, 512


find the solution when going deeper…
fc 1000 fc 1000

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.
Deep Residual Learning
• Plaint net 𝐻 𝑥 is any desired mapping,
𝑥 hope the 2 weight layers fit 𝐻(𝑥)

weight layer
any two
stacked layers relu
weight layer
relu
𝐻(𝑥)

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.
Deep Residual Learning
• Residual net 𝐻 𝑥 is any desired mapping,
𝑥 hope the 2 weight layers fit 𝐻(𝑥)

weight layer hope the 2 weight layers fit 𝐹(𝑥)


𝐹(𝑥) relu identity let 𝐻 𝑥 = 𝐹 𝑥 + 𝑥
weight layer 𝑥

𝐻 𝑥 = 𝐹 𝑥 +𝑥
relu

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.
Deep Residual Learning
• 𝐹 𝑥 is a residual mapping w.r.t. identity
𝑥
• If identity were optimal,
weight layer
easy to set weights as 0
𝐹(𝑥) relu identity
weight layer 𝑥 • If optimal mapping is closer to identity,
easier to find small fluctuations
𝐻 𝑥 = 𝐹 𝑥 +𝑥
relu

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.
Related Works – Residual Representations
• VLAD & Fisher Vector [Jegou et al 2010], [Perronnin et al 2007]
• Encoding residual vectors; powerful shallower representations.

• Product Quantization (IVF-ADC) [Jegou et al 2011]


• Quantizing residual vectors; efficient nearest-neighbor search.

• MultiGrid & Hierarchical Precondition [Briggs, et al 2000], [Szeliski 1990, 2006]


• Solving residual sub-problems; efficient PDE solvers.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.
7x7 conv, 64, /2 7x7 conv, 64, /2

pool, /2 pool, /2

3x3 conv, 64 3x3 conv, 64

Network “Design”
3x3 conv, 64 3x3 conv, 64

plain net ResNet


3x3 conv, 64 3x3 conv, 64

3x3 conv, 64 3x3 conv, 64

3x3 conv, 64 3x3 conv, 64

3x3 conv, 64 3x3 conv, 64

3x3 conv, 128, /2 3x3 conv, 128, /2

3x3 conv, 128 3x3 conv, 128

• Keep it simple 3x3 conv, 128

3x3 conv, 128

3x3 conv, 128


3x3 conv, 128

3x3 conv, 128

3x3 conv, 128

3x3 conv, 128 3x3 conv, 128

3x3 conv, 128 3x3 conv, 128

• Our basic design (VGG-style)


3x3 conv, 128 3x3 conv, 128

3x3 conv, 256, /2 3x3 conv, 256, /2

3x3 conv, 256 3x3 conv, 256

• all 3x3 conv (almost) 3x3 conv, 256

3x3 conv, 256


3x3 conv, 256

3x3 conv, 256

• spatial size /2 => # filters x2 (~same complexity per layer)


3x3 conv, 256 3x3 conv, 256

3x3 conv, 256 3x3 conv, 256

• Simple design; just deep!


3x3 conv, 256 3x3 conv, 256

3x3 conv, 256 3x3 conv, 256

3x3 conv, 256 3x3 conv, 256

3x3 conv, 256 3x3 conv, 256

3x3 conv, 256 3x3 conv, 256

• Other remarks: 3x3 conv, 256

3x3 conv, 512, /2


3x3 conv, 256

3x3 conv, 512, /2

• no hidden fc
3x3 conv, 512 3x3 conv, 512

3x3 conv, 512 3x3 conv, 512

• no dropout
3x3 conv, 512 3x3 conv, 512

3x3 conv, 512 3x3 conv, 512

3x3 conv, 512 3x3 conv, 512

avg pool avg pool

fc 1000 fc 1000

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.
Training
• All plain/residual nets are trained from scratch

• All plain/residual nets use Batch Normalization

• Standard hyper-parameters & augmentation

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.
CIFAR-10 experiments
CIFAR-10 plain nets CIFAR-10 ResNets
20 20
ResNet-20
56-layer ResNet-32
ResNet-44
ResNet-56
44-layer ResNet-110

32-layer 20-layer

error (%)
error (%)

10
20-layer
10 32-layer
44-layer
5 plain-20
plain-32
5
56-layer
plain-44
plain-56
solid: test 110-layer
0
0 1 2 3 4 5 6 dashed: train 0
0 1 2 3 4 5 6
iter. (1e4) iter. (1e4)

• Deep ResNets can be trained without difficulties


• Deeper ResNets have lower training error, and also lower test error

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.
ImageNet experiments
ImageNet plain nets ImageNet ResNets

60 60

50 50

error (%)
error (%)

40
34-layer 40 18-layer

30 30
solid: test ResNet-18
plain-18
plain-34
dashed: train 18-layer ResNet-34 34-layer
20 20
0 10 20 30 40 50 0 10 20 30 40 50
iter. (1e4) iter. (1e4)

• Deep ResNets can be trained without difficulties


• Deeper ResNets have lower training error, and also lower test error

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.
ImageNet experiments
• A practical design of going deeper
64-d 256-d

3x3, 64 1x1, 64
relu
relu
3x3, 64
relu
3x3, 64
1x1, 256

relu relu

similar
all-3x3 complexity bottleneck
(for ResNet-50/101/152)

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.
ImageNet experiments
8
• Deeper ResNets have lower error 7.4
this model has
lower time complexity
than VGG-16/19 6.7 7

6.1
5.7 6

ResNet-152 ResNet-101 ResNet-50 ResNet-34


10-crop testing, top-5 val error (%)
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.
ImageNet experiments 28.2
25.8
152 layers

16.4

11.7
22 layers 19 layers
6.7 7.3

3.57 8 layers 8 layers shallow

ILSVRC'15 ILSVRC'14 ILSVRC'14 ILSVRC'13 ILSVRC'12 ILSVRC'11 ILSVRC'10


ResNet GoogleNet VGG AlexNet

ImageNet Classification top-5 error (%)


Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.
Discussions
Representation, Optimization, Generalization
Issues on learning deep models
• Ability of model to fit training data, if
• Representation ability optimum could be found
• If model A’s solution space is a superset of
B’s, A should be better.

• Optimization ability • Feasibility of finding an optimum


• Not all models are equally easy to optimize

• Once training data is fit, how good is the test


• Generalization ability performance
How do ResNets address these issues?
• No explicit advantage on representation
• Representation ability (only re-parameterization), but
• Allow models to go deeper

• Optimization ability • Enable very smooth forward/backward prop


• Greatly ease optimizing deeper models

• Not explicitly address generalization, but


• Generalization ability • Deeper+thinner is good generalization
On the Importance of
Identity Mapping
From 100 layers to 1000 layers
On identity mappings for optimization
𝑥𝑙 • shortcut mapping: ℎ = identity
• after-add mapping: 𝑓 = ReLU
layer
𝐹(𝑥𝑙 ) ℎ(𝑥c )
layer

𝑥cBC = 𝑓(ℎ 𝑥c + 𝐹 𝑥c )

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Identity Mappings in Deep Residual Networks”. arXiv 2016.
On identity mappings for optimization
𝑥𝑙 • shortcut mapping: ℎ = identity
• after-add mapping: 𝑓 = ReLU
layer
• What if 𝑓 = identity?
𝐹(𝑥𝑙 ) ℎ(𝑥c )
layer

𝑥cBC = 𝑓(ℎ 𝑥c + 𝐹 𝑥c )

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Identity Mappings in Deep Residual Networks”. arXiv 2016.
On identity mappings for optimization
𝑥𝑙 • shortcut mapping: ℎ = identity
• after-add mapping: 𝑓 = ReLU
layer
• What if 𝑓 = identity?
𝐹(𝑥𝑙 ) ℎ(𝑥c )
layer

𝑥cBC = 𝑓(ℎ 𝑥c + 𝐹 𝑥c )

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Identity Mappings in Deep Residual Networks”. arXiv 2016.
Very smooth forward propagation
𝑥cBC = 𝑥c + 𝐹 𝑥c

𝑥cBV = 𝑥cBC + 𝐹 𝑥cBC

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Identity Mappings in Deep Residual Networks”. arXiv 2016.
Very smooth forward propagation
𝑥cBC = 𝑥c + 𝐹 𝑥c

𝑥cBV = 𝑥cBC + 𝐹 𝑥cBC


𝑥cBV = 𝑥c + 𝐹 𝑥c + 𝐹 𝑥cBC

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Identity Mappings in Deep Residual Networks”. arXiv 2016.
Very smooth forward propagation
𝑥cBC = 𝑥c + 𝐹 𝑥c

𝑥cBV = 𝑥cBC + 𝐹 𝑥cBC


𝑥cBV = 𝑥c + 𝐹 𝑥c + 𝐹 𝑥cBC
giC

𝑥g = 𝑥c + h 𝐹 𝑥+
+jc
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Identity Mappings in Deep Residual Networks”. arXiv 2016.
7x7 conv, 64, /2 7x7 conv, 64, /2

pool, /2 pool, /2

3x3 conv, 64 3x3 conv, 64

3x3 conv, 64 3x3 conv, 64

Very smooth forward propagation


3x3 conv, 64 3x3 conv, 64

3x3 conv, 64 3x3 conv, 64

3x3 conv, 64 3x3 conv, 64

3x3 conv, 64 3x3 conv, 64

giC
3x3 conv, 128, /2 3x3 conv, 128, /2

3x3 conv, 128 3x3 conv, 128

3x3 conv, 128 3x3 conv, 128

𝑥g = 𝑥c + h 𝐹 𝑥+
3x3 conv, 128 3x3 conv, 128

3x3 conv, 128 3x3 conv, 128

3x3 conv, 128 3x3 conv, 128

3x3 conv, 128 3x3 conv, 128

+jc 3x3 conv, 128

3x3 conv, 256, /2


3x3 conv, 128

3x3 conv, 256, /2

𝑥c 3x3 conv, 256 3x3 conv, 256

• Any 𝑥c is directly forward-prop to any 𝑥g ,


3x3 conv, 256 3x3 conv, 256

3x3 conv, 256 3x3 conv, 256

3x3 conv, 256 3x3 conv, 256

plus residual.
3x3 conv, 256 3x3 conv, 256

3x3 conv, 256 3x3 conv, 256

3x3 conv, 256 3x3 conv, 256

• Any 𝑥g is an additive outcome.


3x3 conv, 256 3x3 conv, 256

3x3 conv, 256 3x3 conv, 256

3x3 conv, 256 3x3 conv, 256

• in contrast to multiplicative: 𝑥g = ∏giC 𝑥g 3x3 conv, 256 3x3 conv, 256

+jc 𝑊+ 𝑥c
3x3 conv, 512, /2 3x3 conv, 512, /2

3x3 conv, 512 3x3 conv, 512

3x3 conv, 512 3x3 conv, 512

3x3 conv, 512 3x3 conv, 512

3x3 conv, 512 3x3 conv, 512

3x3 conv, 512 3x3 conv, 512

avg pool avg pool

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Identity Mappingsfcin
1000 Deep Residual
fc 1000 Networks”. arXiv 2016.
7x7 conv, 64, /2 7x7 conv, 64, /2

pool, /2 pool, /2

3x3 conv, 64 3x3 conv, 64

3x3 conv, 64 3x3 conv, 64

Very smooth backward propagation


3x3 conv, 64 3x3 conv, 64

3x3 conv, 64 3x3 conv, 64

3x3 conv, 64 3x3 conv, 64

3x3 conv, 64 3x3 conv, 64

giC
3x3 conv, 128, /2 3x3 conv, 128, /2

3x3 conv, 128 3x3 conv, 128

3x3 conv, 128 3x3 conv, 128

𝑥g = 𝑥c + h 𝐹 𝑥+
3x3 conv, 128 3x3 conv, 128

3x3 conv, 128 3x3 conv, 128

3x3 conv, 128 3x3 conv, 128

3x3 conv, 128 3x3 conv, 128

+jc 3x3 conv, 128 3x3 conv, 128

𝜕𝐸 3x3 conv, 256, /2

3x3 conv, 256


3x3 conv, 256, /2

3x3 conv, 256

𝜕𝑥c
3x3 conv, 256 3x3 conv, 256

3x3 conv, 256 3x3 conv, 256

3x3 conv, 256 3x3 conv, 256

giC
3x3 conv, 256 3x3 conv, 256

𝜕𝐸 𝜕𝐸 𝜕𝑥g 𝜕𝐸 𝜕
3x3 conv, 256 3x3 conv, 256

3x3 conv, 256 3x3 conv, 256

= = (1 + h 𝐹 𝑥+ )
3x3 conv, 256 3x3 conv, 256

𝜕𝑥c 𝜕𝑥g 𝜕𝑥c 𝜕𝑥g 𝜕𝑥c


3x3 conv, 256 3x3 conv, 256

𝜕𝐸 3x3 conv, 256 3x3 conv, 256

+jC
3x3 conv, 256 3x3 conv, 256

𝜕𝑥g
3x3 conv, 512, /2 3x3 conv, 512, /2

3x3 conv, 512 3x3 conv, 512

3x3 conv, 512 3x3 conv, 512

3x3 conv, 512 3x3 conv, 512

3x3 conv, 512 3x3 conv, 512

3x3 conv, 512 3x3 conv, 512

avg pool avg pool

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Identity Mappingsfcin
1000 Deep Residual
fc 1000 Networks”. arXiv 2016.
7x7 conv, 64, /2 7x7 conv, 64, /2

pool, /2 pool, /2

3x3 conv, 64 3x3 conv, 64

3x3 conv, 64 3x3 conv, 64

Very smooth backward propagation


3x3 conv, 64 3x3 conv, 64

3x3 conv, 64 3x3 conv, 64

3x3 conv, 64 3x3 conv, 64

3x3 conv, 64 3x3 conv, 64

3x3 conv, 128, /2 3x3 conv, 128, /2

giC 3x3 conv, 128 3x3 conv, 128

𝜕𝐸 𝜕𝐸 𝜕 3x3 conv, 128 3x3 conv, 128

= (1 + h 𝐹 𝑥+ )
3x3 conv, 128 3x3 conv, 128

3x3 conv, 128 3x3 conv, 128

𝜕𝑥c 𝜕𝑥g 𝜕𝑥c 3x3 conv, 128

3x3 conv, 128


3x3 conv, 128

3x3 conv, 128

+jC 3x3 conv, 128 3x3 conv, 128

𝜕𝐸 3x3 conv, 256, /2

3x3 conv, 256


3x3 conv, 256, /2

3x3 conv, 256

lm lm 𝜕𝑥c
3x3 conv, 256 3x3 conv, 256

• Any lno
is directly back-prop to any lnp
, 3x3 conv, 256

3x3 conv, 256


3x3 conv, 256

3x3 conv, 256

3x3 conv, 256 3x3 conv, 256

plus residual.
3x3 conv, 256 3x3 conv, 256

3x3 conv, 256 3x3 conv, 256

3x3 conv, 256 3x3 conv, 256

lm
3x3 conv, 256 3x3 conv, 256

• Any is additive; unlikely to vanish 𝜕𝐸 3x3 conv, 256 3x3 conv, 256

lnp
3x3 conv, 256 3x3 conv, 256

𝜕𝑥g
3x3 conv, 512, /2 3x3 conv, 512, /2

3x3 conv, 512 3x3 conv, 512

lm lm
• in contrast to multiplicative: ln = ∏giC
+jc 𝑊+ ln
3x3 conv, 512

3x3 conv, 512


3x3 conv, 512

3x3 conv, 512


p o 3x3 conv, 512 3x3 conv, 512

3x3 conv, 512 3x3 conv, 512

avg pool avg pool

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Identity Mappingsfcin
1000 Deep Residual
fc 1000 Networks”. arXiv 2016.
Residual for every layer
giC
Enabled by:
forward: 𝑥g = 𝑥c + h 𝐹 𝑥+
• shortcut mapping: ℎ = identity
+jc
• after-add mapping: 𝑓 = identity
giC
𝜕𝐸 𝜕𝐸 𝜕
backward: = (1 + h 𝐹 𝑥+ )
𝜕𝑥c 𝜕𝑥g 𝜕𝑥c
+jC

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Identity Mappings in Deep Residual Networks”. arXiv 2016.
Experiments
• Set 1: what if shortcut mapping ℎ ≠ identity

• Set 2: what if after-add mapping 𝑓 is identity

• Experiments on ResNets with more than 100 layers


• deeper models suffer more from optimization difficulty
Experiment Set 1:
what if shortcut mapping ℎ ≠ identity?
* ResNet-110 on CIFAR-10
3x3 conv 3x3 conv
ReLU ReLU
3x3 conv 3x3 conv

ℎ 𝑥 =𝑥 0.5 0.5 ℎ 𝑥 = 0.5𝑥


error: 6.6% addition
ReLU (a) original
addition
ReLU (b) constant scaling
error: 12.4%

3x3 conv 3x3 conv


ReLU ReLU

ℎ 𝑥 = gate · 𝑥 1x1 conv


sigmoid
3x3 conv 1x1 conv
sigmoid
3x3 conv ℎ 𝑥 = gate · 𝑥
error: 8.7% 1- 1- error: 12.9%
addition addition
*similar to “Highway ReLU (c) exclusive gating ReLU (d) shortcut-only gating
Network”

3x3 conv 3x3 conv


ℎ 𝑥 = conv(𝑥) 1x1 conv
ReLU
3x3 conv dropout
ReLU
3x3 conv ℎ 𝑥 = dropout(𝑥)
error: 12.2% error: > 20%
addition addition
ReLU (e) conv shortcut ReLU (f) dropout shortcut
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Identity Mappings in Deep Residual Networks”. arXiv 2016.
* ResNet-110 on CIFAR-10
3x3 conv 3x3 conv
ReLU ReLU
3x3 conv 3x3 conv

ℎ 𝑥 =𝑥 0.5 0.5 ℎ 𝑥 = 0.5𝑥


error: 6.6% addition
ReLU (a) original
addition
ReLU (b) constant scaling
error: 12.4%

3x3 conv 3x3 conv


ReLU ReLU

ℎ 𝑥 = gate · 𝑥 1x1 conv


sigmoid
3x3 conv 1x1 conv
sigmoid
3x3 conv ℎ 𝑥 = gate · 𝑥
error: 8.7% 1- 1- error: 12.9%
addition
ReLU shortcuts
(c) exclusive gating
addition
ReLU (d) shortcut-only gating
blocked by
multiplications
3x3 conv 3x3 conv
ℎ 𝑥 = conv(𝑥) 1x1 conv
ReLU
3x3 conv dropout
ReLU
3x3 conv ℎ 𝑥 = dropout(𝑥)
error: 12.2% error: > 20%
addition addition
ReLU (e) conv shortcut ReLU (f) dropout shortcut
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Identity Mappings in Deep Residual Networks”. arXiv 2016.
If ℎ is multiplicative, e.g. ℎ 𝑥 = λ𝑥
giC
• if ℎ is multiplicative,
forward: 𝑥g = λgic 𝑥c + h 𝐹u 𝑥+ shortcuts are blocked
+jc
• direct propagation is decayed

giC
𝜕𝐸 𝜕𝐸 gic 𝜕
backward: = (λ + h 𝐹u 𝑥+ )
𝜕𝑥c 𝜕𝑥g 𝜕𝑥c
+jC

*assuming 𝑓 = identity

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Identity Mappings in Deep Residual Networks”. arXiv 2016.
3x3 conv
ReLU
1x1 conv 3x3 conv
sigmoid
1-

addition
ReLU
ℎ is gating

solid: test
dashed: train
3x3 conv ℎ is identity
ReLU
3x3 conv

addition
ReLU

• gating should have better representation


ability (identity is a special case), but
• optimization difficulty dominates results
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Identity Mappings in Deep Residual Networks”. arXiv 2016.
Experiment Set 2:
what if after-add mapping 𝑓 is identity
xl xl xl

weight weight BN

BN BN ReLU

ReLU ReLU weight

weight weight BN

BN addition ReLU

addition BN weight

ReLU ReLU addition


xl+1 xl+1 xl+1

𝑓 is ReLU 𝑓 is BN+ReLU 𝑓 is identity


(original ResNet) (pre-activation ResNet)

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Identity Mappings in Deep Residual Networks”. arXiv 2016.
xl xl

weight weight

BN BN

ReLU ReLU

weight weight

BN addition

addition BN

ReLU ReLU
solid: test xl+1 xl+1
dashed: train

𝑓 = BN+ReLU 𝑓 = ReLU 𝑓 = BN+ReLU

𝑓 = ReLU
• BN could block prop
• Keep the shortest pass as
smooth as possible

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Identity Mappings in Deep Residual Networks”. arXiv 2016.
1001-layer ResNets on CIFAR-10 xl xl

weight BN

BN ReLU

ReLU weight

weight BN

BN ReLU

addition weight

ReLU addition
solid: test xl+1 xl+1
dashed: train

𝑓 = ReLU 𝑓 = identity
𝑓 = ReLU
𝑓 = identity
• ReLU could block prop when there
are 1000 layers
• pre-activation design eases
optimization (and improves generalization; see paper)
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Identity Mappings in Deep Residual Networks”. arXiv 2016.
Comparisons on CIFAR-10/100
CIFAR-10 CIFAR-100
method error (%) method error (%)
NIN 8.81 NIN 35.68

DSN 8.22 DSN 34.57

FitNet 8.39 FitNet 35.04

Highway 7.72 Highway 32.39

ResNet-110 (1.7M) 6.61 ResNet-164 (1.7M) 25.16

ResNet-1202 (19.4M) 7.93 ResNet-1001 (10.2M) 27.82


ResNet-164, pre-activation (1.7M) 5.46 ResNet-164, pre-activation (1.7M) 24.33

ResNet-1001, pre-activation (10.2M) 4.92 (4 . 8 9 ±0 . 1 4 ) ResNet-1001, pre-activation (10.2M) 22.71 (2 2 . 6 8 ±0 . 2 2 )

*all based on moderate augmentation


Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Identity Mappings in Deep Residual Networks”. arXiv 2016.
ImageNet Experiments
ImageNet single-crop (320x320) val error
method data augmentation top-1 error (%) top-5 error (%)
ResNet-152, original scale 21.3 5.5
ResNet-152, pre-activation scale 21.1 5.5
ResNet-200, original scale 21.8 6.0
ResNet-200, pre-activation scale 20.7 5.3
ResNet-200, pre-activation scale + aspect ratio 20.1* 4.8*
*independently reproduced by:
https://github.com/facebook/fb.resnet.torch/tree/master/pretrained#notes
training code and models available.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Identity Mappings in Deep Residual Networks”. arXiv 2016.
Summary of observations
• Keep the shortest path as smooth as possible
• by making ℎ and 𝑓 identity xl
• forward/backward signals directly flow through this path
BN

ReLU
• Features of any layers are additive outcomes weight

BN

• 1000-layer ResNets can be easily trained and have ReLU

better accuracy weight

addition
xl+1

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Identity Mappings in Deep Residual Networks”. arXiv 2016.
Future Works
• Representation
• skipping 1 layer vs. multiple layers?
xl
• Flat vs. Bottleneck?
• Inception-ResNet [Szegedy et al 2016]
• ResNet in ResNet [Targ et al 2016] BN

• Width vs. Depth [Zagoruyko & Komodakis 2016] ReLU

weight
• Generalization BN
• DropOut, MaxOut, DropConnect, …
• Drop Layer (Stochastic Depth) [Huang et al 2016] ReLU

weight
• Optimization addition
• Without residual/shortcut?
xl+1

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Identity Mappings in Deep Residual Networks”. arXiv 2016.
Applications
“Features matter”
“Features matter.” (quote [Girshick et al. 2014], the R-CNN paper)
2nd-place margin
task winner
ResNets (relative)

ImageNet Localization (top-5 error) 12.0 9.0 27%


ImageNet Detection ([email protected]) 53.6 absolute 62.1 16%
8.5% better!
COCO Detection ([email protected]:.95) 33.5 37.3 11%
COCO Segmentation ([email protected]:.95) 25.1 28.2 12%

• Our results are all based on ResNet-101


• Deeper features are well transferrable
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.
101 layers
Revolution of Depth
86
Engines of
66
visual recognition 58

34
16 layers
8 layers
shallow

HOG, DPM AlexNet VGG ResNet


(RCNN) (RCNN) (Faster RCNN)*

PASCAL VOC 2007 Object Detection mAP (%)


*w/ other improvements & more data
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.
Deep Learning for Computer Vision
detection
network
(e.g. R-CNN)

segmentation
network
ImageNet target
(e.g. FCN)
data data

…...
backbone classification
structure network human pose
pre-train features estimation fine-tune
network

depth
estimation
network
Example: Object Detection
ü boat
ü person

Image Classification Object Detection


(what?) (what + where?)
Object Detection: R-CNN
figure credit: R. Girshick et al.

warped region aeroplane? no.


..
person? yes.
CNN ..
tvmonitor? no.

input image region proposals 1 CNN for each region classify regions
~2,000

Region-based CNN pipeline

Girshick, Donahue, Darrell, Malik. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. CVPR 2014
Object Detection: R-CNN
• R-CNN
feature
feature
feature
feature

End-to-End
training
CNN CNN
CNN CNN

pre-computed
Regions-of-Interest
image
(RoIs)

Girshick, Donahue, Darrell, Malik. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. CVPR 2014
Object Detection: Fast R-CNN
• Fast R-CNN
feature
feature
feature
pre-computed
Regions-of-Interest RoI pooling
(RoIs) End-to-End
training

shared conv CNN


layers

image

Girshick. Fast R-CNN. ICCV 2015


Object Detection: Faster R-CNN
features
• Faster R-CNN
RoI pooling
• Solely based on CNN
proposals
• No external modules
• Each step is end-to-end
Region Proposal Net End-to-End
training

feature map

CNN

image

Shaoqing Ren, Kaiming He, Ross Girshick, & Jian Sun. “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”. NIPS 2015.
Object Detection
ImageNet detection
data data

backbone classification detection


structure network network
pre-train features fine-tune

• R-CNN
• AlexNet
• Fast R-CNN
• VGG-16
• Faster R-CNN
• GoogleNet
• MultiBox
• ResNet-101
• SSD
• …
• …
independently
“plug-in” developed
features detectors
classifier
Object Detection
RoI pooling

• Simply “Faster R-CNN + ResNet” proposals

Faster R-CNN Region Proposal Net


[email protected] [email protected]:.95
baseline
VGG-16 41.5 21.5
feature map
ResNet-101 48.4 27.2

COCO detection results


ResNet-101 has 28% relative gain CNN
vs VGG-16
image

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.
Shaoqing Ren, Kaiming He, Ross Girshick, & Jian Sun. “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”. NIPS 2015.
Object Detection
• RPN learns proposals by extremely deep nets
• We use only 300 proposals (no hand-designed proposals)

• Add components:
• Iterative localization
• Context modeling
• Multi-scale testing

• All are based on CNN features; all are end-to-end

• All benefit more from deeper features – cumulative gains!

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.
Shaoqing Ren, Kaiming He, Ross Girshick, & Jian Sun. “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”. NIPS 2015.
ResNet’s object detection result on COCO
*the original image is from the COCO dataset
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015.
Shaoqing Ren, Kaiming He, Ross Girshick, & Jian Sun. “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”. NIPS 2015.
*the original image is from the COCO dataset
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015.
Shaoqing Ren, Kaiming He, Ross Girshick, & Jian Sun. “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”. NIPS 2015.
*the original image is from the COCO dataset
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015.
Shaoqing Ren, Kaiming He, Ross Girshick, & Jian Sun. “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”. NIPS 2015.
this video is available online: https://youtu.be/WZmSMkK9VuA
Results on real video. Models trained on MS COCO (80 categories).
(frame-by-frame; no temporal processing)
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015.
Shaoqing Ren, Kaiming He, Ross Girshick, & Jian Sun. “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”. NIPS 2015.
More Visual Recognition Tasks
ResNet-based methods lead on these benchmarks (incomplete list):
• ImageNet classification, detection, localization
• MS COCO detection, segmentation
ResNet-101
• PASCAL VOC detection, segmentation

• Human pose estimation [Newell et al 2016] PASCAL segmentation leaderboard

• Depth estimation [Laina et al 2016]


• Segment proposal [Pinheiro et al 2016] ResNet-101
•…

PASCAL detection leaderboard


Potential Applications
Visual Recognition

Image Generation
(Pixel RNN, Neural Art, etc.)
ResNets have
shown outstanding or Natural Language Processing
promising results on: (Very deep CNN)

Speech Recognition
(preliminary results)

Advertising, user prediction


(preliminary results)
Conclusions of the Tutorial
• Deep Residual Learning:
• Ultra deep networks can be easy to train
• Ultra deep networks can gain accuracy from depth
• Ultra deep representations are well transferrable
• Now 200 layers on ImageNet and 1000 layers on CIFAR!

Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Identity Mappings in Deep Residual Networks”. arXiv 2016.
Resources
• Models and Code
• Our ImageNet models in Caffe: https://github.com/KaimingHe/deep-residual-networks

• Many available implementation


(list in https://github.com/KaimingHe/deep-residual-networks)
• Facebook AI Research’s Torch ResNet:
https://github.com/facebook/fb.resnet.torch
• Torch, CIFAR-10, with ResNet-20 to ResNet-110, training code, and curves: code
• Lasagne, CIFAR-10, with ResNet-32 and ResNet-56 and training code: code
• Neon, CIFAR-10, with pre-trained ResNet-32 to ResNet-110 models, training code, and curves: code
• Torch, MNIST, 100 layers: blog, code
• A winning entry in Kaggle's right whale recognition challenge: blog, code
• Neon, Place2 (mini), 40 layers: blog, code
• …....
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. CVPR 2016.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Identity Mappings in Deep Residual Networks”. arXiv 2016.

You might also like