0% found this document useful (0 votes)
13 views

V05 SS24 DL CNNs Lecture2

Uploaded by

junrunchen
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

V05 SS24 DL CNNs Lecture2

Uploaded by

junrunchen
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 73

Deep Learning For Computer Vision

Vorlesung SS 2024
Prof. Dr.-Ing. Rainer Stiefelhagen, Dr. Saquib Sarfraz, Dr. Simon Reiß
Maschinensehen für MMI, Institut für Anthropomatik & Robotik
Zentrum für digitale Barrierefreiheit und Assistive Technologien (ACCESS@KIT)
Institut für Anthropomatik und Robotik, Fakultät für Informatik

KIT – Universität des Landes Baden-Württemberg und


nationales Forschungszentrum in der Helmholtz-Gemeinschaft www.kit.edu
Today

VGG (from last week)

CNNs as feature extractor

Newer / better architectures


GoogleNet – Inception Modules
ResNet

Very recent adaptations


Wide ResNet
ResNext
MobileNet

3 Deep Learning in CV - CNNs Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
Last Week

Basics of Convolutional Neural Networks (CNNs)


Convolutional Layers
Pooling
Normalization (Batch Normalization)
Non-Linearity: sigmoid, tanh, ReLU

AlexNet (2012) – 8 layers

4 Deep Learning in CV - CNNs Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
Modern CNN revolution

Large Scale Visual Recognition challenege (ILSVRC) winners

Deeper Networks

Figure copyright : Keiming He 2016

12 Deep Learning in CV - CNNs Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
image

Going deeper conv3-64


conv3-64
maxpool
16-19 weight layers conv3-128
simple filters conv3-128
maxpool
small receptive fields: 3x3
conv3-256
small filters, reduces #weights conv3-256
conv1-256 convolutional layers
maxpool

conv3-512
top-5 error rate: 7.1% conv3-512
conv1-512
compare against 12-14% maxpool

conv3-512
conv3-512
conv1-512
maxpool
K. Simonyan, A. Zisserman
FC-4096
FC-4096 Very Deep Convolutional Networks for
fully connected layers FC-1000 Large-scale Image Recognition
softmax ICLR 2015

13 Deep Learning in CV - CNNs Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
Some results (1)

21 Deep Learning in CV - CNNs Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
Some results (2)

22 Deep Learning in CV - CNNs Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
Neuron visualization

Filter visualization for Conv 1


Other layers, mean image of top 100 images with largest response
Object blobs

B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, A. Oliva


Learning Deep Features for Scene Recognition using Places Database, NIPS 2014.

23 Deep Learning in CV - CNNs Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
Using CNN as Feature Extractors: DeCaf

Donahue J., et al.


DeCAF: A Deep Convolutional Activation Feature for Generic Visual
Recognition. ICML 2014

24 Deep Learning in CV - CNNs Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
Deep networks as feature extractors

Describe (feature) the image, don’t just classify what is in it

Deep networks automatically learn good features


Hierarchy of filters going from simple edges, to object parts, to objects

Last layer of CNNs, typically the Soft-max


Output of one layer before as feature extractor

Donahue J., et al., DeCAF: A Deep Convolutional Activation Feature for


Generic Visual Recognition. ICML 2014
25 Deep Learning in CV - CNNs Maschinensehen für MMI (Prof. Stiefelhagen)
Institut für Anthropomatik und Robotik
DeCAF

Train network end-to-end on image classification (e.g., ImageNet)


Use pre-trained network for
classification for other tasks (e.g., scene recognition)
switch last layer for new task, re-run training for few epochs
feature extractor
remove last layer, use hidden unit values as feature

DeCAF7 features
(4096 dim).

Classifier
Soft-max layer

26 Deep Learning in CV - CNNs Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
More data sets and problems

Fine-grained recognition: CUB 200 Birds data set

Image classification: MIT-67 Indoor scenes

It’s all about the features!


SIFT / HOG were similar
breakthroughs

A. Razavian, et al.
CNN Features off-the-shelf:
An Astounding Baseline for Recognition
DeepVision Workshop @ CVPR 2014 Object instance retrieval: 5 data sets!
28 Deep Learning in CV - CNNs Maschinensehen für MMI (Prof. Stiefelhagen)
Institut für Anthropomatik und Robotik
DeepFace

Y. Taigman, M. Yang, M.-A. Ranzato, L. Wolf


DeepFace: Closing the Gap to Human-Level Performance in Face
Verification. CVPR 2014

29 Deep Learning in CV - CNNs Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
Deep networks as face feature extractors

3D aligned face image: 152x152 pixels


Convolutional layer C1: 32 filters, 11x11x3 (3 RGB layers)
Max-pooling layer M2: 3x3, stride 2
Convolutional layer C3: 16 filters, 9x9x32
Locally connected filters L4, L5, L6: 4096 dim representation
Fully connected F7, F8: feature representation
No more max pooling, since images are already aligned and only faces

Y. Taigman et al., DeepFace: Closing the Gap to Human-Level Performance in Face Verification. CVPR 2014
30 Deep Learning in CV - CNNs Maschinensehen für MMI (Prof. Stiefelhagen)
Institut für Anthropomatik und Robotik
Performance

Almost as good as humans!

LFW
face image verification
restricted setting
accuracy: 97.0% (single)
non-deep: 96.3%

YTF
video face verification
accuracy: 91.4%
non-deep: 79.7%

31 Deep Learning in CV - CNNs Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
Face Verification
Image verification: Labeled Faces in the Wild (LFW)

Same pair Different pair

Video verification: YouTube Faces (YTF)

Same pair Different pair

32 Deep Learning in CV - CNNs Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
Modern CNN revolution

Large Scale Visual Recognition challenege (ILSVRC) winners

Deeper Networks

Figure copyright : Keiming He 2016

37 Deep Learning in CV - CNNs Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
VGGNet (heavy memory more parameters)

38 Deep Learning in CV - CNNs Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
GoogleNet
[Szegedy et al. 2014] (ImageNet-Challenge 2014, arxiv 2014)
C. Szegedy et al., Gooing Deeper with Convolutions, CVPR 2015

Deeper networks, with computational


Efficiency

- 22 layers
- Efficient “Inception” module
- No FC layers

- Only 5 million parameters!


12x less than AlexNet

- ILSVRC’14 classification winner


(6.7% top 5 error)

39 Deep Learning in CV - CNNs Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
GoogleNet
[Szegedy et al. 2014]

“Inception module”: design a


good local network topology
(network within a network) and
then stack these modules on
top of each other

Modules inspired by multi-scale


processing

Name inspired by internet meme

40 Deep Learning in CV - CNNs Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
GoogleNet
[Szegedy et al. 2014]

Apply parallel filter operations


on the input from previous
layer:

- Multiple receptive field sizes


for convolution (1x1, 3x3,
5x5)

- Pooling operation (3x3)


Concatenate all filter outputs
together depth-wise
Naive Inception module

41 Deep Learning in CV - CNNs Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
GoogleNet
[Szegedy et al. 2014]

Example Q1: What is the output size of


the 1x1 conv with 128 filters?

128 192 96

Input:
28x28x256

Naive Inception module

42 Deep Learning in CV - CNNs Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
GoogleNet
[Szegedy et al. 2014]

Example Q1: What is the output size of


the 1x1 conv with 128 filters?

28x28x128

128 192 96

Input:
28x28x256

Naive Inception module

43 Deep Learning in CV - CNNs Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
GoogleNet
[Szegedy et al. 2014]

Example Q2: What are the output sizes


of all filters?

28x28x128 ?x?x192 ?x?x96 ?x?x256

128 192 96

Input:
28x28x256

Naive Inception module

44 Deep Learning in CV - CNNs Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
GoogleNet
[Szegedy et al. 2014]

Example Q2: What are the output sizes


of all filters?

28x28x128 28x28x192 28x28x96 28x28x256

128 192 96

Input:
28x28x256

Naive Inception module

45 Deep Learning in CV - CNNs Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
GoogleNet
[Szegedy et al. 2014]

Example Q3: What is the output size


after filter concatenation?

28x28x128 28x28x192 28x28x96 28x28x256

128 192 96

Input:
28x28x256

Naive Inception module

46 Deep Learning in CV - CNNs Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
GoogleNet
[Szegedy et al. 2014]

Example 28x28x(128+192+96+256)= 28x28x672

28x28x128 28x28x192 28x28x96 28x28x256

128 192 96

Input:
28x28x256
Problem: computational complexity

Naive Inception module

47 Deep Learning in CV - CNNs Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
GoogleNet
[Szegedy et al. 2014]

Example 28x28x(128+192+96+256)= 28x28x672


Conv Ops:
[1x1 conv, 128]
28x28x128 28x28x192 28x28x96 28x28x256 28x28x128x1x1x256
[3x3 conv, 192]
128 192 96 28x28x192x3x3x256
[5x5 conv, 96]
28x28x96x5x5x256
Total: 854M ops
Input:
28x28x256 Very expensive compute
Pooling layer also preserves
feature depth, which means total
Naive Inception module depth after concatenation can
only grow at every layer!

48 Deep Learning in CV - CNNs Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
GoogleNet
[Szegedy et al. 2014]

Example 28x28x(128+192+96+256)= 28x28x672

28x28x128 28x28x192 28x28x96 28x28x256

128 192 96

Input:
28x28x256
Problem: computational complexity

Naive Inception module

49 Deep Learning in CV - CNNs Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
GoogleNet
[Szegedy et al. 2014]

Example 28x28x(128+192+96+256)= 28x28x672

28x28x128 28x28x192 28x28x96 28x28x256

128 192 96

Input:
28x28x256
Solution: “bottleneck” layers that
use 1x1 convolutions to reduce
Naive Inception module feature depth

50 Deep Learning in CV - CNNs Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
1 x 1 Conv layer

1x1 CONV
with 32 filters

each filter is 1x1x64, and


performs a 64-dimensional
dot product

51 Deep Learning in CV - CNNs Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
1 x 1 Conv layer

1x1 CONV
with 32 filters

preserves spatial
dimensions, reduces depth!

Projects depth to lower


dimension (combination of
feature maps)

52 Deep Learning in CV - CNNs Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
GoogleNet

1x1 conv
“bottleneck” layers

Naive inception module Inception with dimension reduction

1x1 convolutions included before


expensive 3x3 and 5x5 convolutions

53 Deep Learning in CV - CNNs Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
GoogleNet
Using same parallel layers as
naive example, and adding “1x1
conv, 64 filter ”bottlenecks”:

Conv Ops:
[1x1 conv, 64] 28x28x64x1x1x256
[1x1 conv, 64] 28x28x64x1x1x256
[1x1 conv, 128] 28x28x128x1x1x256
[3x3 conv, 192] 28x28x192x3x3x64
[5x5 conv, 96] 28x28x96x5x5x64
[1x1 conv, 64] 28x28x64x1x1x256
Total: 358M ops
Compared to 854M ops for naive
version bottleneck can also reduce
depth after pooling layer

54 Deep Learning in CV - CNNs Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
GoogleNet

Stack Inception modules


with dimension reduction
on top of each other

55 Deep Learning in CV - CNNs Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
GoogleNet
Full Architecture

Stem Network:
conv - Pool - 2x conv- Pool

56 Deep Learning in CV - CNNs Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
GoogleNet
Full Architecture

Stacked Inception Modules

57 Deep Learning in CV - CNNs Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
GoogleNet
Full Architecture

Classifier output:
- no fc layers
- Instead:
Average pooling +
1 linear layer

58 Deep Learning in CV - CNNs Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
Global Average Pooling

AlexNet / VGG use two FC-layers towards


end of the network

Now: Global Average Pooling


Average last feature maps to 1x1
Then 1 linear layer and softmax for ~51M params
classification (7x7x1024x1024)

Much less parameters

Better performance than FC layers (+0,6%)

0 params

59 Deep Learning in CV - CNNs Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
GoogleNet
Full Architecture

Two auxilary loss layers: inject additional gradient at


lower layers:
(AvgPool-1x1Conv-FC-FC-Softmax))

60 Deep Learning in CV - CNNs Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
GoogleNet
Full Architecture

22 total layers with weights (including each parallel


layer in an Inception module)

61 Deep Learning in CV - CNNs Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
GoogleNet Summary

Deeper with
computational efficiency

- 22 layers
- Efficient “Inception”
module

- No FC layers
- 12x less params than
AlexNet

- ILSVRC’14 classification
winner (6.7% top 5 error)

62 Deep Learning in CV - CNNs Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
Moderen CNN revolution

Large Scale Visual Recognition challenege (ILSVRC) winners

Depth Revolution

Figure copyright : Keiming He 2016

63 Deep Learning in CV - CNNs Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
ResNet
[He et al 2015], K. He et al., Deep Residual Learning for Image
Recognition, CVPR 2016

Very deep networks using


residual connections

- 152-layer model for ImageNet

- ILSVRC’15 classification
winner (3.57% top 5 error)

- Swept all classification and


detection competitions in
ILSVRC’15 and COCO’15

64 Deep Learning in CV - CNNs Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
ResNet

Directly stacking more layers on a plain CNN

Whats Strange ?

Deeper Model performs worse, but its not due to overfitting

65 Deep Learning in CV - CNNs Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
ResNet

Directly stacking more layers on a plain CNN

Deeper Model performs worse, but its not due to overfitting


Test error

(Optimzation Problem) Train error


Deeper models are very hard to optimize.
This would be overfitting

66 Deep Learning in CV - CNNs Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
ResNet

Solution: Use network layers to fit a residual mapping instead of the


direct underlying mapping.

F(x) + x:
element-wise addition

„ … learning residual functions with reference to the layer inputs, instead of learning
unreferenced functions.“
67 Deep Learning in CV - CNNs Maschinensehen für MMI (Prof. Stiefelhagen)
Institut für Anthropomatik und Robotik
ResNet

Solution: Use network layers to fit a residual mapping instead of the


direct underlying mapping.

H(x)=F(x)+x

Use layers to fit residual F(x) = H(x) – x instead of H(x) directly

68 Deep Learning in CV - CNNs Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
Motivation / Related Work

Hypothesis:
If identity mappings would be optimal (at some late stage in a deep
network), „it would be easier to push the residual to zero, than to fit an
identity mapping by a stack of non-linear layers“
Residual blocks should also help if optimal function is close to an identity
mapping

Modeling residuals, e.g. wrt to some codebooks, has been quite


successful in computer vision
E.g. fisher vectors, VLAD

Shortcut connections have been studied for long time


reduce the vanishing/exploding gradient problem when using many layers
See also LSTMS, later this lecture …
Are also known in biological systems (i.e. in brains)
But most important: it works ☺
69 Deep Learning in CV - CNNs Maschinensehen für MMI (Prof. Stiefelhagen)
Institut für Anthropomatik und Robotik
ResNet

Full ResNet architecture

- Stack residual blocks


- Every residual block
has two 3x3 layers

70 Deep Learning in CV - CNNs Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
ResNet

Full ResNet architecture

- Stack residual blocks


- Every residual block
has two 3x3 layers
- Periodically, double # of
filters and downsample
3x3 conv
by 2. 128 filters. /2
with stride 2

3x3 conv 64
filters

71 Deep Learning in CV - CNNs Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
ResNet

Full ResNet architecture

- Stack residual blocks


- Every residual block
has two 3x3 layers
- Periodically, double # of
filters and downsample
3x3 conv
by 2. 128 filters. /2
- Additional conv layer at with stride 2
the begining
3x3 conv 64
filters

72 Deep Learning in CV - CNNs Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
ResNet

Full ResNet architecture

- Stack residual blocks


- Every residual block
has two 3x3 layers
- Periodically, double # of
filters and downsample
3x3 conv
by 2. 128 filters. /2
- Additional conv layer at with stride 2
the begining
- No extra FC layers at 3x3 conv 64
filters
the end.
- Global average pooling
after last conv

73 Deep Learning in CV - CNNs Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
ResNet

Total depths tested

34, 50, 101 and 152


layers for Imagenet

74 Deep Learning in CV - CNNs Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
ResNet
For deeper nets (50+ layers)
Use bottleneck layer to improve efficiency
Simialr to GoogleNet

75 Deep Learning in CV - CNNs Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
ResNet
ResNet Training

- Batch Normalization after every conv layer


- Xavier initialization
- SGD +Momentum (0.9)
- Learning rate: 0.1, divided by 10 when validation error
plateaus
- No dropout layer

- ILSCVRC 2015 winner

- Top 5 error 3.6%


- Better than human performance

76 Deep Learning in CV - CNNs Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
Complexity Comparison

Alfredo Canziani, Adam Paszke, Eugenio Culurciello, An Anylysis of Deep


Neural Netowrks Models for practical application , 2017 (arxiv)

77 Deep Learning in CV - CNNs Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
Complexity Comparison
Inception-v4: ResNet + Inception

Alfredo Canziani, Adam Paszke, Eugenio Culurciello, An Anylysis of Deep


Neural Netowrks Models for practical application , 2017 (arxiv)

78 Deep Learning in CV - CNNs Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
Complexity Comparison
VGG: Highest memory /ops

Alfredo Canziani, Adam Paszke, Eugenio Culurciello, An Anylysis of Deep


Neural Netowrks Models for practical application , 2017 (arxiv)

79 Deep Learning in CV - CNNs Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
Complexity Comparison
GoogleNet: most efficient

Alfredo Canziani, Adam Paszke, Eugenio Culurciello, An Anylysis of Deep


Neural Netowrks Models for practical application , 2017 (arxiv)

80 Deep Learning in CV - CNNs Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
Complexity Comparison
AlexNet: smaller compute
Memory heavy, low accuracy

Alfredo Canziani, Adam Paszke, Eugenio Culurciello, An Anylysis of Deep


Neural Netowrks Models for practical application , 2017 (arxiv)

81 Deep Learning in CV - CNNs Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
Complexity Comparison
ResNet:moderate compute %
memory, highest accuracy

Alfredo Canziani, Adam Paszke, Eugenio Culurciello, An Anylysis of Deep


Neural Netowrks Models for practical application , 2017 (arxiv)

82 Deep Learning in CV - CNNs Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
More Architectures : Current Improvements

83 Deep Learning in CV - CNNs Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
Improving ResNet

Identity Mappings in Deep Residual


Networks [He et al 2016]

➢ Improved ResNet block design from


creators of ResNet

➢ Creates a more direct path for


propagating information throughout
network (moves activation to residual
mapping pathway)

➢ Gives better performance

84 Deep Learning in CV - CNNs Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
Improving ResNet

Aggregated Residual Transformations for Deep


Neural Networks (ResNeXt)
[Xie et al 2016]

➢ Also from the


creators of
ResNet

➢ Increases width
of residual block
through multiple
parallel pathways

➢ Parallel pathways
similar in spirit to
Inception module

86 Deep Learning in CV - CNNs Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
Beyond ResNets

Densely connected CNN DenseNet


[Huang et al. 2017]

➢ Dense blocks where


each layer is connected
to every other layer in
feedforward fashion

➢ Alleviates vanishing
gradient, strengthens
feature propagation,
encourages feature
reuse

87 Deep Learning in CV - CNNs Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
MobileNetv1

From Google
- Useful for mobile and embedded
vision applications
- Smaller model size (fewer params)
- Smaller complexity (fewer Multiply-
additions)

Main Idea:

- Depthwise Separable Convolution

Howard et al. : MobileNets: Efficient Convolutional Neural Networks for Mobile Vision
Applications. arXiv 2018

89 Deep Learning in CV - CNNs Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
MobileNet - Depthwise Separable Convolution

Separable convolution
Factor the conv kernel by two operations
- Depthwise conv & pointwise conv

90 Deep Learning in CV - CNNs Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
MobileNet - Depthwise Separable Convolution

Compute for a normal convolution to


produce output 8x8x256

- We need 256 5x5x3 kernels


Total compute: 256x5x5x3x8x8 =
1,228,800 multiplications
Depthwise conv with 3 kernels
Compute for a depthwise separable
convolution to produce output 8x8x256

- Depthwise conv: 3 5x5x1 kernels


- Compute: 3x5x5xx8x8 = 4800 Pointwise conv with 256 kernels
- Pointwise conv: 256 1x1x3 kernels
- compute: 256x1x1x3x8x8 =
49152
- Total : 53,952 multiplications

91 Deep Learning in CV - CNNs Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
MobileNet Performance

92 Deep Learning in CV - CNNs Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
Take Home Messages : CNN Architectures

VGG, GoogLeNet, ResNet all in wide use, available in different deep


learn platforms

ResNet and its variants current best default (as of ~2017)

Trend towards extremely deep networks

Significant research centers around design of layer / skip connections


and improving gradient flow

More recent trend towards examining necessity of depth vs. width and
residual connections

93 Deep Learning in CV - CNNs Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
Take Home Messages : CNN Architectures

VGG, GoogLeNet, ResNet all in wide use, available in different deep


learn platforms

ResNet and its variants are the current best default

Significant research centers around design of layer / skip connections


and improving gradient flow

More recent trend towards examining necessity of depth vs. width and
residual connections

Neural Architecture Search (NAS-Net) search for best building blocks


for a particular application/dataset [see some reading resources here:
[https://github.com/anonymone/Neural-Architecture-Search ]

94 Deep Learning in CV - CNNs Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
References

Key Papers:
GoogleNet: C. Szegedy et al., Gooing Deeper with Convolutions, CVPR
2015 (arxiv, ImageNet-Challenge 2014)
ResNet: K. He et al., Deep Residual Learning for Image Recognition,
CVPR 2016 (arxiv, 2015)

Additional
See the previous slides …

95 Deep Learning in CV - CNNs Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik

You might also like