V05 SS24 DL CNNs Lecture2
V05 SS24 DL CNNs Lecture2
Vorlesung SS 2024
Prof. Dr.-Ing. Rainer Stiefelhagen, Dr. Saquib Sarfraz, Dr. Simon Reiß
Maschinensehen für MMI, Institut für Anthropomatik & Robotik
Zentrum für digitale Barrierefreiheit und Assistive Technologien (ACCESS@KIT)
Institut für Anthropomatik und Robotik, Fakultät für Informatik
Deeper Networks
conv3-512
top-5 error rate: 7.1% conv3-512
conv1-512
compare against 12-14% maxpool
conv3-512
conv3-512
conv1-512
maxpool
K. Simonyan, A. Zisserman
FC-4096
FC-4096 Very Deep Convolutional Networks for
fully connected layers FC-1000 Large-scale Image Recognition
softmax ICLR 2015
DeCAF7 features
(4096 dim).
Classifier
Soft-max layer
A. Razavian, et al.
CNN Features off-the-shelf:
An Astounding Baseline for Recognition
DeepVision Workshop @ CVPR 2014 Object instance retrieval: 5 data sets!
28 Deep Learning in CV - CNNs Maschinensehen für MMI (Prof. Stiefelhagen)
Institut für Anthropomatik und Robotik
DeepFace
Y. Taigman et al., DeepFace: Closing the Gap to Human-Level Performance in Face Verification. CVPR 2014
30 Deep Learning in CV - CNNs Maschinensehen für MMI (Prof. Stiefelhagen)
Institut für Anthropomatik und Robotik
Performance
LFW
face image verification
restricted setting
accuracy: 97.0% (single)
non-deep: 96.3%
YTF
video face verification
accuracy: 91.4%
non-deep: 79.7%
Deeper Networks
- 22 layers
- Efficient “Inception” module
- No FC layers
128 192 96
Input:
28x28x256
28x28x128
128 192 96
Input:
28x28x256
128 192 96
Input:
28x28x256
128 192 96
Input:
28x28x256
128 192 96
Input:
28x28x256
128 192 96
Input:
28x28x256
Problem: computational complexity
128 192 96
Input:
28x28x256
Problem: computational complexity
128 192 96
Input:
28x28x256
Solution: “bottleneck” layers that
use 1x1 convolutions to reduce
Naive Inception module feature depth
1x1 CONV
with 32 filters
1x1 CONV
with 32 filters
preserves spatial
dimensions, reduces depth!
1x1 conv
“bottleneck” layers
Conv Ops:
[1x1 conv, 64] 28x28x64x1x1x256
[1x1 conv, 64] 28x28x64x1x1x256
[1x1 conv, 128] 28x28x128x1x1x256
[3x3 conv, 192] 28x28x192x3x3x64
[5x5 conv, 96] 28x28x96x5x5x64
[1x1 conv, 64] 28x28x64x1x1x256
Total: 358M ops
Compared to 854M ops for naive
version bottleneck can also reduce
depth after pooling layer
Stem Network:
conv - Pool - 2x conv- Pool
Classifier output:
- no fc layers
- Instead:
Average pooling +
1 linear layer
0 params
Deeper with
computational efficiency
- 22 layers
- Efficient “Inception”
module
- No FC layers
- 12x less params than
AlexNet
- ILSVRC’14 classification
winner (6.7% top 5 error)
Depth Revolution
- ILSVRC’15 classification
winner (3.57% top 5 error)
Whats Strange ?
F(x) + x:
element-wise addition
„ … learning residual functions with reference to the layer inputs, instead of learning
unreferenced functions.“
67 Deep Learning in CV - CNNs Maschinensehen für MMI (Prof. Stiefelhagen)
Institut für Anthropomatik und Robotik
ResNet
H(x)=F(x)+x
Hypothesis:
If identity mappings would be optimal (at some late stage in a deep
network), „it would be easier to push the residual to zero, than to fit an
identity mapping by a stack of non-linear layers“
Residual blocks should also help if optimal function is close to an identity
mapping
3x3 conv 64
filters
➢ Increases width
of residual block
through multiple
parallel pathways
➢ Parallel pathways
similar in spirit to
Inception module
➢ Alleviates vanishing
gradient, strengthens
feature propagation,
encourages feature
reuse
From Google
- Useful for mobile and embedded
vision applications
- Smaller model size (fewer params)
- Smaller complexity (fewer Multiply-
additions)
Main Idea:
Howard et al. : MobileNets: Efficient Convolutional Neural Networks for Mobile Vision
Applications. arXiv 2018
Separable convolution
Factor the conv kernel by two operations
- Depthwise conv & pointwise conv
More recent trend towards examining necessity of depth vs. width and
residual connections
More recent trend towards examining necessity of depth vs. width and
residual connections
Key Papers:
GoogleNet: C. Szegedy et al., Gooing Deeper with Convolutions, CVPR
2015 (arxiv, ImageNet-Challenge 2014)
ResNet: K. He et al., Deep Residual Learning for Image Recognition,
CVPR 2016 (arxiv, 2015)
Additional
See the previous slides …