0% found this document useful (0 votes)
57 views82 pages

GoogleNET and ResNet v4 With Nin and Bias

The document discusses convolutional 3D neural networks and ResNet architectures for image recognition. It describes the components and design of ResNet, including residual blocks, identity mappings, and periodically doubling the number of filters and downsampling spatially using stride.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views82 pages

GoogleNET and ResNet v4 With Nin and Bias

The document discusses convolutional 3D neural networks and ResNet architectures for image recognition. It describes the components and design of ResNet, including residual blocks, identity mappings, and periodically doubling the number of filters and downsampling spatially using stride.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 82

Convolutional 3D Neural Network (C3D)

Zhang, A., Lipton, Z. C., Li, M., & Smola, A. J. (2021). Dive
into deep learning. arXiv preprint arXiv:2106.11342.
Zhang, A., Lipton, Z. C., Li, M., & Smola, A. J. (2021). Dive
into deep learning. arXiv preprint arXiv:2106.11342.
Zhang, A., Lipton, Z. C., Li, M., & Smola, A. J. (2021). Dive
into deep learning. arXiv preprint arXiv:2106.11342.
Zhang, A., Lipton, Z. C., Li, M., & Smola, A. J. (2021). Dive
into deep learning. arXiv preprint arXiv:2106.11342.
Zhang, A., Lipton, Z. C., Li, M., & Smola, A. J. (2021). Dive
into deep learning. arXiv preprint arXiv:2106.11342.
ImageNet Large Scale Visual Recognition Challenge
(ILSVRC) winners

Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
• Global Average Pooling is a pooling operation designed to replace fully
connected layers in classical CNNs. The idea is to generate one feature map
for each corresponding category of the classification task in the last
mlpconv layer. Instead of adding fully connected layers on top of the feature
maps, we take the average of each feature map, and the resulting vector is
fed directly into the softmax layer.
• One advantage of global average pooling over the fully connected layers is
that it is more native to the convolution structure by enforcing
correspondences between feature maps and categories. Thus the feature
maps can be easily interpreted as categories confidence maps. Another
advantage is that there is no parameter to optimize in the global average
pooling thus overfitting is avoided at this layer. Furthermore, global average
pooling sums out the spatial information, thus it is more robust to spatial
translations of the input.
An example Likely to overfit the data

36
Underfitting and Overfitting
Underfitting Overfitting

Complexity of a Decision
Tree := number of nodes
It uses

Complexity of the Used Model


Underfitting: when model is too simple, both training and test errors are large
Overfitting: when model is too complex and test errors are large although training errors
are small.
How Overfitting affects Prediction
Predictive
Error

Error on Test Data

Error on Training Data

Model Complexity
How Overfitting affects Prediction
Underfitting Overfitting
Predictive
Error

Error on Test Data

Error on Training Data

Model Complexity

Ideal Range
for Model Complexity
Bias and Variance
• In statistics and machine learning, the bias–variance tradeoff (or dilemma)
is the problem of simultaneously minimizing two sources of error that
prevent supervised learning algorithms from generalizing beyond their
training set:
• The bias is error from erroneous assumptions in the learning algorithm.
High bias can cause an algorithm to miss the relevant relations between
features and target outputs (underfitting).i.e The model class does not
contain the solution.
• The variance is error from sensitivity to small fluctuations in the training
set. High variance can cause overfitting: modeling the random noise in the
training data, rather than the intended outputs.i.e The model is too general
and also learn the noise this is overfitting.

40
Bias and Variance
• Ensemble methods
• Combine learners to reduce variance

from Elder, John. From Trees to Forests and Rule Sets - A Unified
41
Overview of Ensemble Methods. 2007.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep
residual learning for image recognition. In Proceedings of
the IEEE conference on computer vision and pattern
recognition (pp. 770-778).
If the identity mapping f(x)=x is the desired underlying mapping, the residual
mapping amounts to g(x)=0 and it is thus easier to learn: we only need to push
the weights and biases of the upper weight layer (e.g., fully connected layer and
convolutional layer) within the dotted-line box to zero.
Residual Blocks

𝑎[𝑙+1]
𝑎[𝑙] 𝑎[𝑙+2]

𝑧 [𝑙+1] = 𝑊 [𝑙+1] 𝑎[𝑙] + 𝑏 [𝑙+1] 𝑎[𝑙+1] = 𝑔(𝑧 [𝑙+1] )


“linear” “relu”

𝑧 [𝑙+2] = 𝑊 [𝑙+2] 𝑎[𝑙+1] + 𝑏 [𝑙+2] 𝑎[𝑙+2] = 𝑔 𝑧 𝑙+2 + 𝑎 𝑙


“output” “relu on (output plus input)”
Softmax
FC 1000
Pool

ResNet Architecture 3x3 conv, 512


3x3 conv, 512

3x3 conv, 512


3x3 conv, 512
Full ResNet architecture:
relu 3x3 conv, 512
- Stack residual blocks 3x3 conv, 512, /2
F(x) + x
- Every residual block has ..
.
two 3x3 conv layers 3x3 conv, 128

- Periodically, double # of 3x3 conv 3x3 conv, 128


3x3 conv, 128
filters and downsample F(x) X 3x3 conv, 128
filters, /2
relu 3x3 conv, 128
spatially with
identity
spatially using stride 2 3x3 conv, 128 stride 2
3x3 conv 3x3 conv, 128, / 2
(/2 in each dimension) 3x3 conv, 64 3x3 conv, 64
3x3 conv, 64
filters
3x3 conv, 64
X 3x3 conv, 64
Residual block 3x3 conv, 64
3x3 conv, 64

Pool
7x7 conv, 64, / 2
Input

Technical
Intro ResNet Results ResNet 1000 Comparison
details
Softmax
FC 1000
Pool

ResNet Architecture 3x3 conv, 512


3x3 conv, 512

3x3 conv, 512


3x3 conv, 512
Full ResNet architecture:
relu 3x3 conv, 512
- Stack residual blocks 3x3 conv, 512, /2
F(x) + x
- Every residual block has ..
.
two 3x3 conv layers 3x3 conv, 128

- Periodically, double # of 3x3 conv 3x3 conv, 128

filters and downsample F(x) X 3x3 conv, 128


relu 3x3 conv, 128
identity
spatially using stride 2 3x3 conv, 128
3x3 conv 3x3 conv, 128, / 2
(/2 in each dimension) 3x3 conv, 64
- Additional conv layer at 3x3 conv, 64

the beginning X
3x3 conv, 64
3x3 conv, 64
Residual block 3x3 conv, 64
3x3 conv, 64

Pool
7x7 conv, 64, / 2 Beginning
Input conv layer

Intro Results ResNet 1000 Comparison


Softmax
FC 1000 No FC layers
Pool besides FC
ResNet Architecture 3x3 conv, 512
3x3 conv, 512
1000 to
output
classes
3x3 conv, 512
3x3 conv, 512
Full ResNet architecture: Global
relu 3x3 conv, 512 average
- Stack residual blocks 3x3 conv, 512, /2 pooling layer
F(x) + x after last
- Every residual block has ..
conv layer
.
two 3x3 conv layers 3x3 conv, 128

- Periodically, double # of 3x3 conv 3x3 conv, 128

filters and downsample F(x) X 3x3 conv, 128


relu 3x3 conv, 128
identity
spatially using stride 2 3x3 conv, 128
3x3 conv 3x3 conv, 128, / 2
(/2 in each dimension) 3x3 conv, 64
- Additional conv layer at 3x3 conv, 64

the beginning X
3x3 conv, 64
3x3 conv, 64
- No FC layers at the end Residual block 3x3 conv, 64
3x3 conv, 64
(only FC 1000 to output
Pool
classes) 7x7 conv, 64, / 2
Input
Softmax
FC 1000
Pool

ResNet Architecture 3x3 conv, 512


3x3 conv, 512

3x3 conv, 512


3x3 conv, 512

3x3 conv, 512


3x3 conv, 512, /2
Total depths of 34, 50, 101, or
..
152 layers for ImageNet .
3x3 conv, 128
3x3 conv, 128

3x3 conv, 128


3x3 conv, 128

3x3 conv, 128


3x3 conv, 128, / 2

3x3 conv, 64
3x3 conv, 64

3x3 conv, 64
3x3 conv, 64

3x3 conv, 64
3x3 conv, 64

Pool
7x7 conv, 64, / 2
Input
ResNet Architecture
28x28x256
output

For deeper networks 1x1 conv, 256


(ResNet-50+), use “bottleneck”
layer to improve efficiency 3x3 conv, 64
(similar to GoogLeNet)
1x1 conv, 64

28x28x256
input
Zhang, A., Lipton, Z. C., Li, M., & Smola, A. J. (2021). Dive
into deep learning. arXiv preprint arXiv:2106.11342.
ResNet Architecture
28x28x256
output
1x1 conv, 256 filters projects
back to 256 feature maps
For deeper networks (28x28x256) 1x1 conv, 256
(ResNet-50+), use “bottleneck”
layer to improve efficiency 3x3 conv operates over
3x3 conv, 64
(similar to GoogLeNet) only 64 feature maps

1x1 conv, 64 filters 1x1 conv, 64


to project to
28x28x64 28x28x256
input
Residual Blocks (skip connections)
Training ResNet in practice
• Batch Normalization after every CONV layer.
• Xavier/2 initialization from He et al.
• SGD + Momentum (0.9)
• Learning rate: 0.1, divided by 10 when validation error
plateaus.
• Mini-batch size 256.
• Weight decay of 1e-5.
• No dropout used.
Case Study: ResNet [He et al., 2015]
ILSVRC 2015 winner (3.6% top 5 error)

Slide from Kaiming He’s recent presentation https://www.youtube.com/watch?v=1PGLj-uKT1w

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 2016
Case Study: ResNet [He et al., 2015]
ILSVRC 2015 winner (3.6% top 5 error)

2-3 weeks of training


on 8 GPU machine

at runtime: faster
than a VGGNet!
(even though it has
8x more layers)

(slide from Kaiming He’s recent presentation)

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 2016
Case Study:
224x224x3
ResNet
[He et al., 2015]
spatial dimension
only 56x56!

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 2016
Case Study: ResNet [He et al., 2015]

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 2016
Comparing complexity...

An Analysis of Deep Neural Network Models for Practical Applications, 2017.


Figures copyright Alfredo Canziani, Adam Paszke, Eugenio Culurciello, 2017. Reproduced with permission.

Technical ResNet
Intro ResNet Results Comparison
details 1000
Comparing complexity... Inception-v4: Resnet + Inception!

An Analysis of Deep Neural Network Models for Practical Applications, 2017.

Technical ResNet
Intro ResNet Results Comparison
details 1000
VGG: Highest
Comparing complexity... memory, most
operations

An Analysis of Deep Neural Network Models for Practical Applications, 2017.

Technical ResNet
Intro ResNet Results Comparison
details 1000
GoogLeNet:
Comparing complexity... most efficient

An Analysis of Deep Neural Network Models for Practical Applications, 2017.

Technical ResNet
Intro ResNet Results Comparison
details 1000
AlexNet:
Comparing complexity... Smaller compute, still memory
heavy, lower accuracy

An Analysis of Deep Neural Network Models for Practical Applications, 2017.

Technical ResNet
Intro ResNet Results Comparison
details 1000
ResNet:
Comparing complexity... Moderate efficiency depending on
model, highest accuracy

An Analysis of Deep Neural Network Models for Practical Applications, 2017.

Technical ResNet
Intro ResNet Results Comparison
details 1000
We can take some inspiration from the Inception block of Fig. 8.4.1 which has information flowing through the
block in separate groups. Applying the idea of multiple independent groups to the ResNet block of Fig.
8.6.3 led to the design of ResNeXt (Xie et al., 2017). Different from the smorgasbord of transformations in
Inception, ResNeXt adopts the same transformation in all branches, thus minimizing the need for manual
tuning of each branch.

Zhang, A., Lipton, Z. C., Li, M., & Smola, A. J. (2021). Dive
into deep learning. arXiv preprint arXiv:2106.11342.
• Breaking up a convolution from ci to co channels into one of g groups of
size ci/g generating g outputs of size co/g is called, quite fittingly, a grouped
convolution. The computational cost (proportionally) is reduced
from O(ci⋅co) to O(g⋅(ci/g)⋅(co/g))=O(ci⋅co/g), i.e., it is g times faster. Even better,
the number of parameters needed to generate the output is also reduced from
a ci×co matrix to g smaller matrices of size (ci/g)×(co/g), again a g times
reduction. In what follows we assume that both ci and co are divisible by g.
• The only challenge in this design is that no information is exchanged between
the g groups. The ResNeXt block of Fig. amends this in two ways: the grouped
convolution with a 3×3 kernel is sandwiched in between two 1×1 convolutions.
The second one serves double duty in changing the number of channels back. The
benefit is that we only pay the O(c⋅b) cost for 1×1 kernels and can make do with
an O(b2/g) cost for 3×3 kernels. Similar to the residual block implementation in,
the residual connection is replaced (thus generalized) by a 1×1 convolution.
“You need a lot of a data if you want to
train/use CNNs”

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 2016

You might also like