Ilsvrc2015 deep residual_learning_kaiminghe

Deep Residual Learning
MSRA @ ILSVRC & COCO 2015 competitions
Kaiming He
with Xiangyu Zhang, Shaoqing Ren, Jifeng Dai, & Jian Sun
Microsoft Research Asia (MSRA)

MSRA @ ILSVRC & COCO 2015 Competitions
• 1st places in all five main tracks
• ImageNet Classification: “Ultra-deep” (quote Yann) 152-layer nets
• ImageNet Detection: 16% better than 2nd
• ImageNet Localization: 27% better than 2nd
• COCO Detection: 11% better than 2nd
• COCO Segmentation: 12% better than 2nd
Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015.
*improvements are relative numbers

Revolution of Depth
3.57
6.7 7.3
11.7
16.4
25.8
28.2
ILSVRC'15
ResNet
ILSVRC'14
GoogleNet
ILSVRC'14
VGG
ILSVRC'13 ILSVRC'12
AlexNet
ILSVRC'11 ILSVRC'10
ImageNet Classification top-5 error (%)
shallow8 layers
19 layers22 layers
152 layers
8 layers

Revolution of Depth
34
58
66
86
HOG, DPM AlexNet
(RCNN)
VGG
(RCNN)
ResNet
(Faster RCNN)*
PASCAL VOC 2007 Object Detection mAP (%)
shallow
8 layers
16 layers
101 layers
*w/ other improvements & more data
Engines of
visual recognition

Revolution of Depth
11x11 conv, 96, /4, pool/2
5x5 conv, 256, pool/2
3x3 conv, 384
3x3 conv, 384
fc, 4096
fc, 4096
fc, 1000
AlexNet, 8 layers
(ILSVRC 2012)

Revolution of Depth
11x11 conv, 96, /4, pool/2
3x3 conv, 384
3x3 conv, 384
fc, 4096
fc, 4096
fc, 1000
AlexNet, 8 layers
(ILSVRC 2012)
3x3 conv, 64
3x3 conv, 128
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
fc, 4096
fc, 4096
fc, 1000
VGG, 19 layers
(ILSVRC 2014)
input
Conv
7x7+ 2(S)
MaxPool
3x3+ 2(S)
LocalRespNorm
Conv
1x1+ 1(V)
Conv
3x3+ 1(S)
LocalRespNorm
MaxPool
3x3+ 2(S)
Conv Conv Conv Conv
1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool
1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
Dept hConcat
Conv Conv Conv Conv
1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool
1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
Dept hConcat
MaxPool
3x3+ 2(S)
Conv Conv Conv Conv
1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool
1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
Dept hConcat
Conv Conv Conv Conv
1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool
1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
AveragePool
5x5+ 3(V)
Dept hConcat
Conv Conv Conv Conv
1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool
1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
Dept hConcat
Conv Conv Conv Conv
1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool
1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
Dept hConcat
Conv Conv Conv Conv
1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool
1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
AveragePool
5x5+ 3(V)
Dept hConcat
MaxPool
3x3+ 2(S)
Conv Conv Conv Conv
1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool
1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
Dept hConcat
Conv Conv Conv Conv
1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)
Conv Conv MaxPool
1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)
Dept hConcat
AveragePool
7x7+ 1(V)
FC
Conv
1x1+ 1(S)
FC
FC
Soft maxAct ivat ion
soft max0
Conv
1x1+ 1(S)
FC
FC
soft max1
soft max2
GoogleNet, 22 layers
(ILSVRC 2014)

AlexNet, 8 layers
(ILSVRC 2012)
Revolution of Depth
ResNet, 152 layers
(ILSVRC 2015)
3x3 conv, 64
3x3 conv, 128
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
fc, 4096
fc, 4096
fc, 1000
11x11 conv, 96, /4, pool/2
3x3 conv, 384
3x3 conv, 384
fc, 4096
fc, 4096
fc, 1000
1x1 conv, 64
3x3 conv, 64
1x1 conv, 256
1x1 conv, 64
3x3 conv, 64
1x1 conv, 256
1x1 conv, 64
3x3 conv, 64
1x1 conv, 256
1x2 conv, 128, /2
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 256, /2
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 512, /2
3x3 conv, 512
1x1 conv, 2048
1x1 conv, 512
3x3 conv, 512
1x1 conv, 2048
1x1 conv, 512
3x3 conv, 512
1x1 conv, 2048
ave pool, fc 1000
7x7 conv, 64, /2, pool/2
VGG, 19 layers
(ILSVRC 2014)

Revolution of Depth
ResNet, 152 layers
1x1 conv, 64
3x3 conv, 64
1x1 conv, 256
1x1 conv, 64
3x3 conv, 64
1x1 conv, 256
1x1 conv, 64
3x3 conv, 64
1x1 conv, 256
1x2 conv, 128, /2
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 256, /2
3x3 conv, 256
7x7 conv, 64, /2, pool/2
(there was an animation here)

Revolution of Depth
ResNet, 152 layers
1x1 conv, 512
1x1 conv, 128
3x3 conv, 128
1x1 conv, 512
1x1 conv, 256, /2
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015.

Revolution of Depth
ResNet, 152 layers
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256Kaiming He, Xiangyu Zhang, Shaoqing Ren, & Jian Sun. “Deep Residual Learning for Image Recognition”. arXiv 2015.

Revolution of Depth
ResNet, 152 layers
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 256
3x3 conv, 256
1x1 conv, 1024
1x1 conv, 512, /2
3x3 conv, 512
1x1 conv, 2048
1x1 conv, 512
3x3 conv, 512
1x1 conv, 2048
1x1 conv, 512
3x3 conv, 512
1x1 conv, 2048
ave pool, fc 1000

Is learning better networks
as simple as stacking more layers?

Simply stacking layers?
0 1 2 3 4 5 6
0
10
20
iter. (1e4)
train error (%)
0 1 2 3 4 5 6
0
10
20
iter. (1e4)
test error (%)
CIFAR-10
56-layer
20-layer
56-layer
20-layer
• Plain nets: stacking 3x3 conv layers…
• 56-layer net has higher training error and test error than 20-layer net

Simply stacking layers?
0 1 2 3 4 5 6
0
5
10
20
iter. (1e4)
error(%)
plain-20
plain-32
plain-44
plain-56
CIFAR-10
20-layer
32-layer
44-layer
56-layer
0 10 20 30 40 50
20
30
40
50
60
iter. (1e4)
error(%)
plain-18
plain-34
ImageNet-1000
34-layer
18-layer
• “Overly deep” plain nets have higher training error
• A general phenomenon, observed in many datasets
solid: test/val
dashed: train

7x7 conv, 64, /2
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 128, /2
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
3x3 conv, 256, /2
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 512, /2
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
fc 1000
a shallower
model
(18 layers)
a deeper
counterpart
(34 layers)
7x7 conv, 64, /2
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 128, /2
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
3x3 conv, 256, /2
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 512, /2
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
fc 1000
“extra”
layers
• A deeper model should not have
higher training error
• A solution by construction:
• original layers: copied from a
learned shallower model
• extra layers: set as identity
• at least the same training error
• Optimization difficulties: solvers
cannot find the solution when going
deeper…

• Plaint net
any two
stacked layers
𝑥
𝐻(𝑥)
weight layer
weight layer
relu
relu
𝐻 𝑥 is any desired mapping,
hope the 2 weight layers fit 𝐻(𝑥)

• Residual net
𝐻 𝑥 is any desired mapping,
hope the 2 weight layers fit 𝐻(𝑥)
hope the 2 weight layers fit 𝐹(𝑥)
let 𝐻 𝑥 = 𝐹 𝑥 + 𝑥
weight layer
weight layer
relu
relu
𝑥
𝐻 𝑥 = 𝐹 𝑥 + 𝑥
identity
𝑥
𝐹(𝑥)

• 𝐹 𝑥 is a residual mapping w.r.t. identity
• If identity were optimal,
easy to set weights as 0
• If optimal mapping is closer to identity,
easier to find small fluctuations
weight layer
weight layer
relu
relu
𝑥
𝐻 𝑥 = 𝐹 𝑥 + 𝑥
identity
𝑥
𝐹(𝑥)

Related Works – Residual Representations
• VLAD & Fisher Vector [Jegou et al 2010], [Perronnin et al 2007]
• Encoding residual vectors; powerful shallower representations.
• Product Quantization (IVF-ADC) [Jegou et al 2011]
• Quantizing residual vectors; efficient nearest-neighbor search.
• MultiGrid & Hierarchical Precondition [Briggs, et al 2000], [Szeliski 1990, 2006]
• Solving residual sub-problems; efficient PDE solvers.

7x7 conv, 64, /2
pool, /2
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 128, /2
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
3x3 conv, 256, /2
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 512, /2
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
avg pool
fc 1000
7x7 conv, 64, /2
pool, /2
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 128, /2
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
3x3 conv, 128
3x3 conv, 256, /2
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
3x3 conv, 512, /2
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
avg pool
fc 1000
Network “Design”
• Keep it simple
• Our basic design (VGG-style)
• all 3x3 conv (almost)
• spatial size /2 => # filters x2
• Simple design; just deep!
• Other remarks:
• no max pooling (almost)
• no hidden fc
• no dropout
plain net ResNet

Training
• All plain/residual nets are trained from scratch
• All plain/residual nets use Batch Normalization
• Standard hyper-parameters & augmentation

CIFAR-10 experiments
0 1 2 3 4 5 6
0
5
10
20
iter. (1e4)
error(%)
plain-20
plain-32
plain-44
plain-56
20-layer
32-layer
44-layer
56-layer
CIFAR-10 plain nets
0 1 2 3 4 5 6
0
5
10
20
iter. (1e4)
error(%)
ResNet-20
ResNet-32
ResNet-44
ResNet-56
ResNet-110
CIFAR-10 ResNets
56-layer
44-layer
32-layer
20-layer
110-layer
• Deep ResNets can be trained without difficulties
• Deeper ResNets have lower training error, and also lower test error
solid: test
dashed: train

ImageNet experiments
0 10 20 30 40 50
20
30
40
50
60
iter. (1e4)
error(%)
ResNet-18
ResNet-34
0 10 20 30 40 50
20
30
40
50
60
iter. (1e4)
error(%)
plain-18
plain-34
ImageNet plain nets ImageNet ResNets
solid: test
dashed: train
34-layer
18-layer
18-layer
34-layer
• Deep ResNets can be trained without difficulties
• Deeper ResNets have lower training error, and also lower test error

• A practical design of going deeper
3x3, 64
3x3, 64
relu
relu
64-d
3x3, 64
1x1, 64
relu
1x1, 256
relu
relu
256-d
all-3x3
bottleneck
(for ResNet-50/101/152)
similar
complexity

7.4
6.7
6.1
5.7
4
5
6
7
8
ResNet-34ResNet-50ResNet-101ResNet-152
10-crop testing, top-5 val error (%)
this model has
lower time complexity
than VGG-16/19
• Deeper ResNets have lower error

3.57
6.7 7.3
11.7
16.4
25.8
28.2
ILSVRC'15
ResNet
ILSVRC'14
GoogleNet
ILSVRC'14
VGG
ILSVRC'13 ILSVRC'12
AlexNet
ILSVRC'11 ILSVRC'10
ImageNet Classification top-5 error (%)
shallow8 layers
19 layers22 layers
152 layers
8 layers

Just classification?
A treasure from ImageNet is on learning features.

“Features matter.” (quote [Girshick et al. 2014], the R-CNN paper)
task
2nd-place
winner
MSRA margin
(relative)
ImageNet Localization (top-5 error) 12.0 9.0 27%
ImageNet Detection (mAP@.5) 53.6 62.1 16%
COCO Detection (mAP@.5:.95) 33.5 37.3 11%
COCO Segmentation (mAP@.5:.95) 25.1 28.2 12%
• Our results are all based on ResNet-101
• Our features are well transferrable
absolute
8.5% better!

Object Detection (brief)
• Simply “Faster R-CNN + ResNet”
Shaoqing Ren, Kaiming He, Ross Girshick, & Jian Sun. “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”. NIPS 2015.
image
CNN
feature map
Region Proposal Net
proposals
classifier
RoI pooling
Faster R-CNN
baseline
mAP@.5 mAP@.5:.95
VGG-16 41.5 21.5
ResNet-101 48.4 27.2
COCO detection results
(ResNet has 28% relative gain)

Object Detection (brief)
• RPN learns proposals by extremely deep nets
• We use only 300 proposals (no SS/EB/MCG!)
• Add what is just missing in Faster R-CNN…
• Iterative localization
• Context modeling
• Multi-scale testing
• All are based on CNN features; all are end-to-end (train and/or inference)
• All benefit more from deeper features – cumulative gains!

Our results on COCO – too many objects, let’s check carefully!
*the original image is from the COCO dataset

Jifeng Dai, Kaiming He, & Jian Sun. “Instance-aware Semantic Segmentation via Multi-task Network Cascades”. arXiv 2015.
Instance Segmentation (brief)
for each RoI
for each RoI
CONVs
conv feature map
FCs
FCs
RoI warping ,
pooling
masking
CONVs
box instances (RoIs)
mask instances
categorized instances
person
personperson
horse
• Solely CNN-based (“features matter”)
• Differentiable RoI warping layer (w.r.t box coord.)
• Multi-task cascades, exact end-to-end training

input

Conclusions
• Deeper is still better
• “Features matter”!
• Faster R-CNN is just amazing
Kaiming He Xiangyu Zhang Shaoqing Ren
Jifeng Dai Jian Sun
MSRA team

Ilsvrc2015 deep residual_learning_kaiminghe

More Related Content

What's hot (20)

Similar to Ilsvrc2015 deep residual_learning_kaiminghe (20)

More from pramod naik (13)

Recently uploaded (20)

Ilsvrc2015 deep residual_learning_kaiminghe