Lecture2 Advanced CNN
Lecture2 Advanced CNN
Nguyen Quang Uy
1
Outline
1. Alexnet
2. VGGnet
3. Googlenet
4. Resnet
5. Mobilenet
6. Efficientnet
2
Legends
3
Layers
4
Activation functions
5
Modules/Blocks
6
Repeated layers
7
Alexnet
8
Overview
• Paper: ImageNet Classification with Deep Convolutional Neural Networks
• Published in: NeurIPS 2012.
• Considered to be the most impact in computer vision.
9
Novelties
• Use Rectified Linear Units (ReLUs) as activation functions.
• Use Dropout layer.
• Use data augmentation.
10
Architecture
• AlexNet has 8 layers — 5 convolutional and 3 fully-connected.
• AlexNet Has 60M parameters.
11
Results
• Top-1 error rates is 37.5%
• Top-5 error rates 17.0%
12
VGG
13
Overview
• VGG: Visual Geometry Group
• Paper: Very Deep Convolutional Networks for Large-Scale Image
Recognition
• Published in arXiv 2014
14
Novelties
• Designing of deeper networks (roughly twice as deep as AlexNet). This was done by
stacking uniform convolutions.
• They use only 3x3 kernels, as opposed to AlexNet 11x11. This design decreases the
number of parameters.
15
Architecture
• VGG has 13 convolutional and 3 fully-connected layers.
• This network stacks more layers onto AlexNet.
• It consists of 138M parameters.
16
VGG result
• Top-1 accuracy is 71.5%
• Top-1 accuracy 90.1%
17
Googlenet
18
Overview
• Also known as Inception-v1
• Paper: Going Deeper with Convolutions
• Published in: 2015 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR).
• Achieve competitive result compared to human
19
Novelties
• Building networks using modules/blocks, instead of stacking convolutional layers.
• 1×1 conv are used for dimensionality reduction to remove computational bottlenecks.
• Have parallel convolutions with filters at 1×1, 3×3 and 5×5, followed by concatenation.
• Use two auxiliary classifiers to encourage discrimination in the lower stages.
20
Architecture
21
Architecture
• Stem and Inception module.
22
Results
• Top-1 accuracy is 78.2%
• Top-5 accuracy is 94.1%
• Human error is 5%-8%.
23
Resnet
24
Overview
• Paper: Deep Residual Learning for Image Recognition.
• Published in: 2016 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR).
• The first network achieves better result then human.
25
Novelties
• Popularise skip connections (they weren’t the first to use skip connections).
• Design even deeper CNNs (up to 152 layers) without compromising model’s
generalisation power
• Among the first to use batch normalisation.
26
Architecture
• Conv block and Identity module.
27
Architecture
• Conv block and Identity module.
28
Resnet result
• Top-1 accuracy is 87.0%.
• Top-5 accuracy 96.3%.
• Top-5 human accuracy: 95.0%
29
Mobilenet
30
Overview
• Paper: MobileNets: Efficient Convolutional Neural Networks for Mobile Vision
Applications
• Published in: 2017 IEEE Conference on Computer Vision and Pattern Recognition
(CVPR).
• Specially designed to be used in mobile devices.
31
Novelties
• MobileNet uses depthwise separable convolutions. It significantly reduces the number of
parameters.
• It introduces two shrinking hyperparameters that efficiently trade off between latency and
accuracy
32
Architecture
33
Architecture
• Deepwise separable convolution.
34
Architecture
• Deepwise convolution:
• In a normal convolution, all channels of a kernel are used to produce a feature map.
• In a depthwise convolution, each channel of a kernel is used to produce a feature map.
35
Architecture
• Pointwise convolution.
• In a normal convolution, we just have to use 256 filters of size 5x5x3.
• In a pointwise convolution, we just have to use 256 filters of size 1x1x3.
36
Computation cost
• Standard convolution
38
Computation cost
• Depthwise convolution
39
Computation cost
• The total computational cost of Depthwise separable convolutions can be
calculated as.
40
Results
• Mobilenet is better than Googlenet and VGG with much lower number of
operators and parameters.
41
Efficientnet
42
Overview
• Paper: EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
• Published in: International Conference on Machine Learning, 2019.
• It is considered as the state-of-the-art until today.
43
Novelties
• Compound Scaling from B0 to B7.
• The EfficientNet Architecture (developed using Neural Architecture Search)
44
Architecture
45
Compound scaling
• The most common way to scale up ConvNets was either depth (number of layers), width
(number of channels) or image resolution (image size).
• EfficientNets perform Compound Scaling to scale all three dimensions while mantaining a
balance between all dimensions of the network.
46
Compound scaling
• This idea of compound scaling makes sense because if the input image is
bigger then the network needs more layers (depth) and more channels
(width) to capture more fine-grained patterns.
47
Neural Architecture Search
• This is a reinforcement learning based approach used to develop Efficient-B0 by
leveraging a multi-objective search that optimizes for both Accuracy and FLOPS.
48
Neural Architecture Search
• The objective function can formally be defined as:
49
Mobile inverted bottleneck convolution (MBConv)
• MBConv without squeeze and excitation operation
50
Mobile inverted bottleneck convolution (MBConv)
• MBConv with squeeze and excitation operation
51
Squeeze and excitation operation
• Access to global information
• Modelling channel interdependencies
• Which can be regarded as a self-attention function on channels
52
Scaling Efficient-B0 to get B1-B7
• Let the network depth(d), widt(w) and input image resolution(r) be:
54
Q&A
Thank you!
55