GoogleNET and ResNet v4 With Nin and Bias
GoogleNET and ResNet v4 With Nin and Bias
Zhang, A., Lipton, Z. C., Li, M., & Smola, A. J. (2021). Dive
into deep learning. arXiv preprint arXiv:2106.11342.
Zhang, A., Lipton, Z. C., Li, M., & Smola, A. J. (2021). Dive
into deep learning. arXiv preprint arXiv:2106.11342.
Zhang, A., Lipton, Z. C., Li, M., & Smola, A. J. (2021). Dive
into deep learning. arXiv preprint arXiv:2106.11342.
Zhang, A., Lipton, Z. C., Li, M., & Smola, A. J. (2021). Dive
into deep learning. arXiv preprint arXiv:2106.11342.
Zhang, A., Lipton, Z. C., Li, M., & Smola, A. J. (2021). Dive
into deep learning. arXiv preprint arXiv:2106.11342.
ImageNet Large Scale Visual Recognition Challenge
(ILSVRC) winners
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
• Global Average Pooling is a pooling operation designed to replace fully
connected layers in classical CNNs. The idea is to generate one feature map
for each corresponding category of the classification task in the last
mlpconv layer. Instead of adding fully connected layers on top of the feature
maps, we take the average of each feature map, and the resulting vector is
fed directly into the softmax layer.
• One advantage of global average pooling over the fully connected layers is
that it is more native to the convolution structure by enforcing
correspondences between feature maps and categories. Thus the feature
maps can be easily interpreted as categories confidence maps. Another
advantage is that there is no parameter to optimize in the global average
pooling thus overfitting is avoided at this layer. Furthermore, global average
pooling sums out the spatial information, thus it is more robust to spatial
translations of the input.
An example Likely to overfit the data
36
Underfitting and Overfitting
Underfitting Overfitting
Complexity of a Decision
Tree := number of nodes
It uses
Model Complexity
How Overfitting affects Prediction
Underfitting Overfitting
Predictive
Error
Model Complexity
Ideal Range
for Model Complexity
Bias and Variance
• In statistics and machine learning, the bias–variance tradeoff (or dilemma)
is the problem of simultaneously minimizing two sources of error that
prevent supervised learning algorithms from generalizing beyond their
training set:
• The bias is error from erroneous assumptions in the learning algorithm.
High bias can cause an algorithm to miss the relevant relations between
features and target outputs (underfitting).i.e The model class does not
contain the solution.
• The variance is error from sensitivity to small fluctuations in the training
set. High variance can cause overfitting: modeling the random noise in the
training data, rather than the intended outputs.i.e The model is too general
and also learn the noise this is overfitting.
40
Bias and Variance
• Ensemble methods
• Combine learners to reduce variance
from Elder, John. From Trees to Forests and Rule Sets - A Unified
41
Overview of Ensemble Methods. 2007.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep
residual learning for image recognition. In Proceedings of
the IEEE conference on computer vision and pattern
recognition (pp. 770-778).
If the identity mapping f(x)=x is the desired underlying mapping, the residual
mapping amounts to g(x)=0 and it is thus easier to learn: we only need to push
the weights and biases of the upper weight layer (e.g., fully connected layer and
convolutional layer) within the dotted-line box to zero.
Residual Blocks
𝑎[𝑙+1]
𝑎[𝑙] 𝑎[𝑙+2]
Pool
7x7 conv, 64, / 2
Input
Technical
Intro ResNet Results ResNet 1000 Comparison
details
Softmax
FC 1000
Pool
the beginning X
3x3 conv, 64
3x3 conv, 64
Residual block 3x3 conv, 64
3x3 conv, 64
Pool
7x7 conv, 64, / 2 Beginning
Input conv layer
the beginning X
3x3 conv, 64
3x3 conv, 64
- No FC layers at the end Residual block 3x3 conv, 64
3x3 conv, 64
(only FC 1000 to output
Pool
classes) 7x7 conv, 64, / 2
Input
Softmax
FC 1000
Pool
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
3x3 conv, 64
Pool
7x7 conv, 64, / 2
Input
ResNet Architecture
28x28x256
output
28x28x256
input
Zhang, A., Lipton, Z. C., Li, M., & Smola, A. J. (2021). Dive
into deep learning. arXiv preprint arXiv:2106.11342.
ResNet Architecture
28x28x256
output
1x1 conv, 256 filters projects
back to 256 feature maps
For deeper networks (28x28x256) 1x1 conv, 256
(ResNet-50+), use “bottleneck”
layer to improve efficiency 3x3 conv operates over
3x3 conv, 64
(similar to GoogLeNet) only 64 feature maps
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 2016
Case Study: ResNet [He et al., 2015]
ILSVRC 2015 winner (3.6% top 5 error)
at runtime: faster
than a VGGNet!
(even though it has
8x more layers)
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 2016
Case Study:
224x224x3
ResNet
[He et al., 2015]
spatial dimension
only 56x56!
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 2016
Case Study: ResNet [He et al., 2015]
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 2016
Comparing complexity...
Technical ResNet
Intro ResNet Results Comparison
details 1000
Comparing complexity... Inception-v4: Resnet + Inception!
Technical ResNet
Intro ResNet Results Comparison
details 1000
VGG: Highest
Comparing complexity... memory, most
operations
Technical ResNet
Intro ResNet Results Comparison
details 1000
GoogLeNet:
Comparing complexity... most efficient
Technical ResNet
Intro ResNet Results Comparison
details 1000
AlexNet:
Comparing complexity... Smaller compute, still memory
heavy, lower accuracy
Technical ResNet
Intro ResNet Results Comparison
details 1000
ResNet:
Comparing complexity... Moderate efficiency depending on
model, highest accuracy
Technical ResNet
Intro ResNet Results Comparison
details 1000
We can take some inspiration from the Inception block of Fig. 8.4.1 which has information flowing through the
block in separate groups. Applying the idea of multiple independent groups to the ResNet block of Fig.
8.6.3 led to the design of ResNeXt (Xie et al., 2017). Different from the smorgasbord of transformations in
Inception, ResNeXt adopts the same transformation in all branches, thus minimizing the need for manual
tuning of each branch.
Zhang, A., Lipton, Z. C., Li, M., & Smola, A. J. (2021). Dive
into deep learning. arXiv preprint arXiv:2106.11342.
• Breaking up a convolution from ci to co channels into one of g groups of
size ci/g generating g outputs of size co/g is called, quite fittingly, a grouped
convolution. The computational cost (proportionally) is reduced
from O(ci⋅co) to O(g⋅(ci/g)⋅(co/g))=O(ci⋅co/g), i.e., it is g times faster. Even better,
the number of parameters needed to generate the output is also reduced from
a ci×co matrix to g smaller matrices of size (ci/g)×(co/g), again a g times
reduction. In what follows we assume that both ci and co are divisible by g.
• The only challenge in this design is that no information is exchanged between
the g groups. The ResNeXt block of Fig. amends this in two ways: the grouped
convolution with a 3×3 kernel is sandwiched in between two 1×1 convolutions.
The second one serves double duty in changing the number of channels back. The
benefit is that we only pay the O(c⋅b) cost for 1×1 kernels and can make do with
an O(b2/g) cost for 3×3 kernels. Similar to the residual block implementation in,
the residual connection is replaced (thus generalized) by a 1×1 convolution.
“You need a lot of a data if you want to
train/use CNNs”
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 Jan 2016