SlideShare a Scribd company logo
Introduction to CNN with
Application to Object Recognition
by Vivek Gandhi, CTO & Co-founder , Artifacia
(@gandhivivek9)
December 10, 2016
Agenda
Introduction
History, progress and current state - CNN
Basic Neural Net
What is convolution?
Various components of CNN and what is the role of each and every one of them
Architectural tweaks
Further readings
Introduction
What is Intelligence?
What is Artificial Intelligence?
What do you understand by Object Recognition?
Why do you think we need Object Recognition?
(Fei-Fei Li & Andrej Karpathy, Lecture 2-3, 7 Jan 2015)
(Fei-Fei Li & Andrej Karpathy, Lecture 2-3, 7 Jan 2015)
(Fei-Fei Li & Andrej Karpathy, Lecture 2-3, 7 Jan 2015)
(Fei-Fei Li & Andrej Karpathy, Lecture 2-3, 7 Jan 2015)
Introduction to CNN with Application to Object Recognition
(Fei-Fei Li & Andrej Karpathy, Lecture 2-3, 7 Jan 2015)
(Fei-Fei Li & Andrej Karpathy, Lecture 2-3, 7 Jan 2015)
(Fei-Fei Li & Andrej Karpathy, Lecture 2-3, 7 Jan 2015)
Introduction (Continued)
What lead to the progress of Object Recognition?
Introduction to CNN with Application to Object Recognition
Fei-Fei Li & Andrej Karpathy Lecture 1-24 5-Jan-15
(Slide from Kaiming He’s Presentation)
Basic Neural Networks
Introduction to CNN with Application to Object Recognition
32
32
3
Convolution Layer
32x32x3 image
5x5x3 filter
convolve (slide) over all
spatial locations
activation map
1
28
28
Introduction to CNN with Application to Object Recognition
Introduction to CNN with Application to Object Recognition
Introduction to CNN with Application to Object Recognition
1 1 2 4
5 6 7 8
3 2 1 0
1 2 3 4
Single depth slice
x
y
max pool with 2x2 filters
and stride 2 6 8
3 4
MAX POOLING
Activation Functions
Convolutional Neural Networks:
Case Study
Case Study: AlexNet
[Krizhevsky et al. 2012]
Input: 227x227x3 images
First layer (CONV1): 96 11x11 filters applied at stride 4
=>
Q: what is the output volume size? Hint: (227-11)/4+1 = 55
Case Study: AlexNet
[Krizhevsky et al. 2012]
Input: 227x227x3 images
First layer (CONV1): 96 11x11 filters applied at stride 4
=>
Output volume [55x55x96]
Q: What is the total number of parameters in this layer?
Case Study: AlexNet
[Krizhevsky et al. 2012]
Input: 227x227x3 images
First layer (CONV1): 96 11x11 filters applied at stride 4
=>
Output volume [55x55x96]
Parameters: (11*11*3)*96 = 35K
Case Study: AlexNet
[Krizhevsky et al. 2012]
Input: 227x227x3 images
After CONV1: 55x55x96
Second layer (POOL1): 3x3 filters applied at stride 2
Q: what is the output volume size? Hint: (55-3)/2+1 = 27
Case Study: AlexNet
[Krizhevsky et al. 2012]
Input: 227x227x3 images
After CONV1: 55x55x96
Second layer (POOL1): 3x3 filters applied at stride 2
Output volume: 27x27x96
Q: what is the number of parameters in this layer?
Case Study: AlexNet
[Krizhevsky et al. 2012]
Input: 227x227x3 images
After CONV1: 55x55x96
Second layer (POOL1): 3x3 filters applied at stride 2
Output volume: 27x27x96
Parameters: 0!
Case Study: AlexNet
[Krizhevsky et al. 2012]
Input: 227x227x3 images
After CONV1: 55x55x96
After POOL1: 27x27x96
...
Case Study: AlexNet
[Krizhevsky et al. 2012]
Full (simplified) AlexNet architecture:
[227x227x3] INPUT
[55x55x96] CONV1: 96 11x11 filters at stride 4, pad 0
[27x27x96] MAX POOL1: 3x3 filters at stride 2
[27x27x96] NORM1: Normalization layer
[27x27x256] CONV2: 256 5x5 filters at stride 1, pad 2
[13x13x256] MAX POOL2: 3x3 filters at stride 2
[13x13x256] NORM2: Normalization layer
[13x13x384] CONV3: 384 3x3 filters at stride 1, pad 1
[13x13x384] CONV4: 384 3x3 filters at stride 1, pad 1
[13x13x256] CONV5: 256 3x3 filters at stride 1, pad 1
[6x6x256] MAX POOL3: 3x3 filters at stride 2
[4096] FC6: 4096 neurons
[4096] FC7: 4096 neurons
[1000] FC8: 1000 neurons (class scores)
Case Study: AlexNet
[Krizhevsky et al. 2012]
Full (simplified) AlexNet architecture:
[227x227x3] INPUT
[55x55x96] CONV1: 96 11x11 filters at stride 4, pad 0
[27x27x96] MAX POOL1: 3x3 filters at stride 2
[27x27x96] NORM1: Normalization layer
[27x27x256] CONV2: 256 5x5 filters at stride 1, pad 2
[13x13x256] MAX POOL2: 3x3 filters at stride 2
[13x13x256] NORM2: Normalization layer
[13x13x384] CONV3: 384 3x3 filters at stride 1, pad 1
[13x13x384] CONV4: 384 3x3 filters at stride 1, pad 1
[13x13x256] CONV5: 256 3x3 filters at stride 1, pad 1
[6x6x256] MAX POOL3: 3x3 filters at stride 2
[4096] FC6: 4096 neurons
[4096] FC7: 4096 neurons
[1000] FC8: 1000 neurons (class scores)
Compared to LeCun 1998:
1 DATA:
- More data: 10^6 vs. 10^3
2 COMPUTE:
- GPU (~20x speedup)
3 ALGORITHM:
- Deeper: More layers (8 weight layers)
- Fancy regularization (dropout)
- Fancy non-linearity (ReLU)
4 INFRASTRUCTURE:
- CUDA
Case Study: AlexNet
[Krizhevsky et al. 2012]
Full (simplified) AlexNet architecture:
[227x227x3] INPUT
[55x55x96] CONV1: 96 11x11 filters at stride 4, pad 0
[27x27x96] MAX POOL1: 3x3 filters at stride 2
[27x27x96] NORM1: Normalization layer
[27x27x256] CONV2: 256 5x5 filters at stride 1, pad 2
[13x13x256] MAX POOL2: 3x3 filters at stride 2
[13x13x256] NORM2: Normalization layer
[13x13x384] CONV3: 384 3x3 filters at stride 1, pad 1
[13x13x384] CONV4: 384 3x3 filters at stride 1, pad 1
[13x13x256] CONV5: 256 3x3 filters at stride 1, pad 1
[6x6x256] MAX POOL3: 3x3 filters at stride 2
[4096] FC6: 4096 neurons
[4096] FC7: 4096 neurons
[1000] FC8: 1000 neurons (class scores)
Details/Retrospectives:
- first use of ReLU
- used Norm layers (not common anymore)
- heavy data augmentation
- dropout 0.5
- batch size 128
- SGD Momentum 0.9
- Learning rate 1e-2, reduced by 10
manually when val accuracy plateaus
- L2 weight decay 5e-4
- 7 CNN ensemble: 18.2% -> 15.4%
Case Study: ZFNet [Zeiler and Fergus, 2013]
AlexNet but:
CONV1: change from (11x11 stride 4) to (7x7 stride 2)
CONV3,4,5: instead of 384, 384, 256 filters use 512, 1024, 512
ImageNet top 5 error: 15.4% -> 14.8%
Case Study: VGGNet
[Simonyan and Zisserman, 2014]
best model
Only 3x3 CONV stride 1, pad 1
and 2x2 MAX POOL stride 2
11.2% top 5 error in ILSVRC 2013
->
7.3% top 5 error
INPUT: [224x224x3] memory: 224*224*3=150K params: 0
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*3)*64 = 1,728
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*64)*64 = 36,864
POOL2: [112x112x64] memory: 112*112*64=800K params: 0
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*64)*128 = 73,728
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*128)*128 = 147,456
POOL2: [56x56x128] memory: 56*56*128=400K params: 0
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*128)*256 = 294,912
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824
POOL2: [28x28x256] memory: 28*28*256=200K params: 0
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*256)*512 = 1,179,648
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
POOL2: [14x14x512] memory: 14*14*512=100K params: 0
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
POOL2: [7x7x512] memory: 7*7*512=25K params: 0
FC: [1x1x4096] memory: 4096 params: 7*7*512*4096 = 102,760,448
FC: [1x1x4096] memory: 4096 params: 4096*4096 = 16,777,216
FC: [1x1x1000] memory: 1000 params: 4096*1000 = 4,096,000
(not counting biases)
INPUT: [224x224x3] memory: 224*224*3=150K params: 0
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*3)*64 = 1,728
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*64)*64 = 36,864
POOL2: [112x112x64] memory: 112*112*64=800K params: 0
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*64)*128 = 73,728
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*128)*128 = 147,456
POOL2: [56x56x128] memory: 56*56*128=400K params: 0
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*128)*256 = 294,912
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824
POOL2: [28x28x256] memory: 28*28*256=200K params: 0
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*256)*512 = 1,179,648
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
POOL2: [14x14x512] memory: 14*14*512=100K params: 0
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
POOL2: [7x7x512] memory: 7*7*512=25K params: 0
FC: [1x1x4096] memory: 4096 params: 7*7*512*4096 = 102,760,448
FC: [1x1x4096] memory: 4096 params: 4096*4096 = 16,777,216
FC: [1x1x1000] memory: 1000 params: 4096*1000 = 4,096,000
(not counting biases)
TOTAL memory: 24M * 4 bytes ~= 93MB / image (only forward! ~*2 for bwd)
TOTAL params: 138M parameters
INPUT: [224x224x3] memory: 224*224*3=150K params: 0
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*3)*64 = 1,728
CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*64)*64 = 36,864
POOL2: [112x112x64] memory: 112*112*64=800K params: 0
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*64)*128 = 73,728
CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*128)*128 = 147,456
POOL2: [56x56x128] memory: 56*56*128=400K params: 0
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*128)*256 = 294,912
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824
CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824
POOL2: [28x28x256] memory: 28*28*256=200K params: 0
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*256)*512 = 1,179,648
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
POOL2: [14x14x512] memory: 14*14*512=100K params: 0
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
POOL2: [7x7x512] memory: 7*7*512=25K params: 0
FC: [1x1x4096] memory: 4096 params: 7*7*512*4096 = 102,760,448
FC: [1x1x4096] memory: 4096 params: 4096*4096 = 16,777,216
FC: [1x1x1000] memory: 1000 params: 4096*1000 = 4,096,000
(not counting biases)
TOTAL memory: 24M * 4 bytes ~= 93MB / image (only forward! ~*2 for bwd)
TOTAL params: 138M parameters
Note:
Most memory is in
early CONV
Most params are
in late FC
Case Study: GoogLeNet [Szegedy et al., 2014]
Inception module
ILSVRC 2014 winner (6.7% top 5 error)
Case Study: GoogLeNet
Fun features:
- Only 5 million params!
(Removes FC layers
completely)
Compared to AlexNet:
- 12X less params
- 2x more compute
- 6.67% (vs. 16.4%)
Slide from Kaiming He’s recent presentation https://www.youtube.com/watch?v=1PGLj-uKT1w
Case Study: ResNet [He et al., 2015]
ILSVRC 2015 winner (3.6% top 5 error)
References
● cs231n (Some of the slides were taken from the slides used in the course)
● Andrej Karpathy, Bay Area DL school 2016 slides
Further Reading
1. http://colah.github.io/posts/2014-07-Conv-Nets-Modular/
2. https://arxiv.org/pdf/1512.03385v1.pdf
3. https://github.com/kjw0612/awesome-deep-vision
From Our Blog:
- http://research.artifacia.com/learn-deep-learning-the-hard-way
THANK YOU
Join
meetup.com/Artifacia-AI-Meet/

More Related Content

PPTX
Datastructure tree
rantd
 
PDF
GPU acceleration of a non-hydrostatic ocean model with a multigrid Poisson/He...
Takateru Yamagishi
 
PDF
NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion
Kai Katsumata
 
PPTX
Psimd open64 workshop_2012-new
dibyendu_das0708
 
PPTX
Dzanan_Bajgoric_C2CUDA_MscThesis_Present
Džanan Bajgorić
 
PDF
Clojure class
Aysylu Greenberg
 
PPTX
Unification and Refactoring of Clones
Nikolaos Tsantalis
 
PPTX
Greedy algo revision 2
maamir farooq
 
Datastructure tree
rantd
 
GPU acceleration of a non-hydrostatic ocean model with a multigrid Poisson/He...
Takateru Yamagishi
 
NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion
Kai Katsumata
 
Psimd open64 workshop_2012-new
dibyendu_das0708
 
Dzanan_Bajgoric_C2CUDA_MscThesis_Present
Džanan Bajgorić
 
Clojure class
Aysylu Greenberg
 
Unification and Refactoring of Clones
Nikolaos Tsantalis
 
Greedy algo revision 2
maamir farooq
 

What's hot (16)

PDF
The Ring programming language version 1.4.1 book - Part 7 of 31
Mahmoud Samir Fayed
 
PDF
Stream-based Data Synchronization
Klemen Verdnik
 
PDF
Java puzzles
Nikola Petrov
 
PDF
ZK Study Club: Sumcheck Arguments and Their Applications
Alex Pruden
 
PDF
Pixel RNN to Pixel CNN++
Dongheon Lee
 
PDF
zkStudyClub: PLONKUP & Reinforced Concrete [Luke Pearson, Joshua Fitzgerald, ...
Alex Pruden
 
PDF
LeetCode Solutions In Java .pdf
zupsezekno
 
PDF
Java Puzzlers
Mike Donikian
 
PDF
Expectation propagation for latent Dirichlet allocation
Tomonari Masada
 
PPTX
Digit recognizer by convolutional neural network
Ding Li
 
PDF
Java Puzzle
SFilipp
 
PDF
Writing your own Neural Network.
shafkatdu9212
 
PDF
The Ring programming language version 1.5.4 book - Part 26 of 185
Mahmoud Samir Fayed
 
PPTX
CppConcurrencyInAction - Chapter07
DooSeon Choi
 
PDF
Eye deep
sveitser
 
PDF
Photo-realistic Single Image Super-resolution using a Generative Adversarial ...
Hansol Kang
 
The Ring programming language version 1.4.1 book - Part 7 of 31
Mahmoud Samir Fayed
 
Stream-based Data Synchronization
Klemen Verdnik
 
Java puzzles
Nikola Petrov
 
ZK Study Club: Sumcheck Arguments and Their Applications
Alex Pruden
 
Pixel RNN to Pixel CNN++
Dongheon Lee
 
zkStudyClub: PLONKUP & Reinforced Concrete [Luke Pearson, Joshua Fitzgerald, ...
Alex Pruden
 
LeetCode Solutions In Java .pdf
zupsezekno
 
Java Puzzlers
Mike Donikian
 
Expectation propagation for latent Dirichlet allocation
Tomonari Masada
 
Digit recognizer by convolutional neural network
Ding Li
 
Java Puzzle
SFilipp
 
Writing your own Neural Network.
shafkatdu9212
 
The Ring programming language version 1.5.4 book - Part 26 of 185
Mahmoud Samir Fayed
 
CppConcurrencyInAction - Chapter07
DooSeon Choi
 
Eye deep
sveitser
 
Photo-realistic Single Image Super-resolution using a Generative Adversarial ...
Hansol Kang
 
Ad

Similar to Introduction to CNN with Application to Object Recognition (20)

PDF
Cs231n 2017 lecture9 CNN Architecture
Yanbin Kong
 
PDF
Lecture 5: Convolutional Neural Network Models
Mohamed Loey
 
PPTX
cs231n_2017_AI lecture9-converted.pptx
SumanMaiti15
 
PPTX
Computer Vision for Beginners
Sanghamitra Deb
 
PDF
lecture_6_jiajun.pdf
Kuan-Tsae Huang
 
PDF
convolutional_neural_networks in deep learning
ssusere5ddd6
 
PDF
Convolutional Neural Network Models - Deep Learning
Mohamed Loey
 
PDF
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Universitat Politècnica de Catalunya
 
PDF
Deep Learning for New User Interactions (Gestures, Speech and Emotions)
Olivia Klose
 
ODP
Convolutional Neural Networks
Tianxiang Xiong
 
PDF
DLD meetup 2017, Efficient Deep Learning
Brodmann17
 
PDF
Faire de la reconnaissance d'images avec le Deep Learning - Cristina & Pierre...
Jedha Bootcamp
 
PDF
Finding the best solution for Image Processing
Tech Triveni
 
PDF
DA FDAFDSasd
WinnerLogin1
 
PPTX
Deep learning requirement and notes for novoice
AmmarAhmedSiddiqui2
 
PPTX
ImageNet classification with deep convolutional neural networks(2012)
WoochulShin10
 
PPT
Cnn method
AmirSajedi1
 
PDF
CNN Algorithm
georgejustymirobi1
 
PDF
Deep Learning for Computer Vision: Memory usage and computational considerati...
Universitat Politècnica de Catalunya
 
PPTX
Mnist report ppt
RaghunandanJairam
 
Cs231n 2017 lecture9 CNN Architecture
Yanbin Kong
 
Lecture 5: Convolutional Neural Network Models
Mohamed Loey
 
cs231n_2017_AI lecture9-converted.pptx
SumanMaiti15
 
Computer Vision for Beginners
Sanghamitra Deb
 
lecture_6_jiajun.pdf
Kuan-Tsae Huang
 
convolutional_neural_networks in deep learning
ssusere5ddd6
 
Convolutional Neural Network Models - Deep Learning
Mohamed Loey
 
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Universitat Politècnica de Catalunya
 
Deep Learning for New User Interactions (Gestures, Speech and Emotions)
Olivia Klose
 
Convolutional Neural Networks
Tianxiang Xiong
 
DLD meetup 2017, Efficient Deep Learning
Brodmann17
 
Faire de la reconnaissance d'images avec le Deep Learning - Cristina & Pierre...
Jedha Bootcamp
 
Finding the best solution for Image Processing
Tech Triveni
 
DA FDAFDSasd
WinnerLogin1
 
Deep learning requirement and notes for novoice
AmmarAhmedSiddiqui2
 
ImageNet classification with deep convolutional neural networks(2012)
WoochulShin10
 
Cnn method
AmirSajedi1
 
CNN Algorithm
georgejustymirobi1
 
Deep Learning for Computer Vision: Memory usage and computational considerati...
Universitat Politècnica de Catalunya
 
Mnist report ppt
RaghunandanJairam
 
Ad

Recently uploaded (20)

PDF
agentic-ai-and-the-future-of-autonomous-systems.pdf
siddharthnetsavvies
 
PPTX
Comunidade Salesforce São Paulo - Desmistificando o Omnistudio (Vlocity)
Francisco Vieira Júnior
 
PDF
REPORT: Heating appliances market in Poland 2024
SPIUG
 
PPTX
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
PDF
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
PDF
Test Bank, Solutions for Java How to Program, An Objects-Natural Approach, 12...
famaw19526
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
PDF
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
PDF
Shreyas_Phanse_Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
SHREYAS PHANSE
 
PDF
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
PDF
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
PDF
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
PDF
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
PDF
Building High-Performance Oracle Teams: Strategic Staffing for Database Manag...
SMACT Works
 
PDF
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
AbdullahSani29
 
PDF
Why Your AI & Cybersecurity Hiring Still Misses the Mark in 2025
Virtual Employee Pvt. Ltd.
 
PDF
How-Cloud-Computing-Impacts-Businesses-in-2025-and-Beyond.pdf
Artjoker Software Development Company
 
PDF
The Evolution of KM Roles (Presented at Knowledge Summit Dublin 2025)
Enterprise Knowledge
 
PDF
madgavkar20181017ppt McKinsey Presentation.pdf
georgschmitzdoerner
 
agentic-ai-and-the-future-of-autonomous-systems.pdf
siddharthnetsavvies
 
Comunidade Salesforce São Paulo - Desmistificando o Omnistudio (Vlocity)
Francisco Vieira Júnior
 
REPORT: Heating appliances market in Poland 2024
SPIUG
 
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
Test Bank, Solutions for Java How to Program, An Objects-Natural Approach, 12...
famaw19526
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
Shreyas_Phanse_Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
SHREYAS PHANSE
 
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
Building High-Performance Oracle Teams: Strategic Staffing for Database Manag...
SMACT Works
 
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
AbdullahSani29
 
Why Your AI & Cybersecurity Hiring Still Misses the Mark in 2025
Virtual Employee Pvt. Ltd.
 
How-Cloud-Computing-Impacts-Businesses-in-2025-and-Beyond.pdf
Artjoker Software Development Company
 
The Evolution of KM Roles (Presented at Knowledge Summit Dublin 2025)
Enterprise Knowledge
 
madgavkar20181017ppt McKinsey Presentation.pdf
georgschmitzdoerner
 

Introduction to CNN with Application to Object Recognition

  • 1. Introduction to CNN with Application to Object Recognition by Vivek Gandhi, CTO & Co-founder , Artifacia (@gandhivivek9) December 10, 2016
  • 2. Agenda Introduction History, progress and current state - CNN Basic Neural Net What is convolution? Various components of CNN and what is the role of each and every one of them Architectural tweaks Further readings
  • 3. Introduction What is Intelligence? What is Artificial Intelligence? What do you understand by Object Recognition? Why do you think we need Object Recognition?
  • 4. (Fei-Fei Li & Andrej Karpathy, Lecture 2-3, 7 Jan 2015)
  • 5. (Fei-Fei Li & Andrej Karpathy, Lecture 2-3, 7 Jan 2015)
  • 6. (Fei-Fei Li & Andrej Karpathy, Lecture 2-3, 7 Jan 2015)
  • 7. (Fei-Fei Li & Andrej Karpathy, Lecture 2-3, 7 Jan 2015)
  • 9. (Fei-Fei Li & Andrej Karpathy, Lecture 2-3, 7 Jan 2015)
  • 10. (Fei-Fei Li & Andrej Karpathy, Lecture 2-3, 7 Jan 2015)
  • 11. (Fei-Fei Li & Andrej Karpathy, Lecture 2-3, 7 Jan 2015)
  • 12. Introduction (Continued) What lead to the progress of Object Recognition?
  • 14. Fei-Fei Li & Andrej Karpathy Lecture 1-24 5-Jan-15
  • 15. (Slide from Kaiming He’s Presentation)
  • 18. 32 32 3 Convolution Layer 32x32x3 image 5x5x3 filter convolve (slide) over all spatial locations activation map 1 28 28
  • 22. 1 1 2 4 5 6 7 8 3 2 1 0 1 2 3 4 Single depth slice x y max pool with 2x2 filters and stride 2 6 8 3 4 MAX POOLING
  • 25. Case Study: AlexNet [Krizhevsky et al. 2012] Input: 227x227x3 images First layer (CONV1): 96 11x11 filters applied at stride 4 => Q: what is the output volume size? Hint: (227-11)/4+1 = 55
  • 26. Case Study: AlexNet [Krizhevsky et al. 2012] Input: 227x227x3 images First layer (CONV1): 96 11x11 filters applied at stride 4 => Output volume [55x55x96] Q: What is the total number of parameters in this layer?
  • 27. Case Study: AlexNet [Krizhevsky et al. 2012] Input: 227x227x3 images First layer (CONV1): 96 11x11 filters applied at stride 4 => Output volume [55x55x96] Parameters: (11*11*3)*96 = 35K
  • 28. Case Study: AlexNet [Krizhevsky et al. 2012] Input: 227x227x3 images After CONV1: 55x55x96 Second layer (POOL1): 3x3 filters applied at stride 2 Q: what is the output volume size? Hint: (55-3)/2+1 = 27
  • 29. Case Study: AlexNet [Krizhevsky et al. 2012] Input: 227x227x3 images After CONV1: 55x55x96 Second layer (POOL1): 3x3 filters applied at stride 2 Output volume: 27x27x96 Q: what is the number of parameters in this layer?
  • 30. Case Study: AlexNet [Krizhevsky et al. 2012] Input: 227x227x3 images After CONV1: 55x55x96 Second layer (POOL1): 3x3 filters applied at stride 2 Output volume: 27x27x96 Parameters: 0!
  • 31. Case Study: AlexNet [Krizhevsky et al. 2012] Input: 227x227x3 images After CONV1: 55x55x96 After POOL1: 27x27x96 ...
  • 32. Case Study: AlexNet [Krizhevsky et al. 2012] Full (simplified) AlexNet architecture: [227x227x3] INPUT [55x55x96] CONV1: 96 11x11 filters at stride 4, pad 0 [27x27x96] MAX POOL1: 3x3 filters at stride 2 [27x27x96] NORM1: Normalization layer [27x27x256] CONV2: 256 5x5 filters at stride 1, pad 2 [13x13x256] MAX POOL2: 3x3 filters at stride 2 [13x13x256] NORM2: Normalization layer [13x13x384] CONV3: 384 3x3 filters at stride 1, pad 1 [13x13x384] CONV4: 384 3x3 filters at stride 1, pad 1 [13x13x256] CONV5: 256 3x3 filters at stride 1, pad 1 [6x6x256] MAX POOL3: 3x3 filters at stride 2 [4096] FC6: 4096 neurons [4096] FC7: 4096 neurons [1000] FC8: 1000 neurons (class scores)
  • 33. Case Study: AlexNet [Krizhevsky et al. 2012] Full (simplified) AlexNet architecture: [227x227x3] INPUT [55x55x96] CONV1: 96 11x11 filters at stride 4, pad 0 [27x27x96] MAX POOL1: 3x3 filters at stride 2 [27x27x96] NORM1: Normalization layer [27x27x256] CONV2: 256 5x5 filters at stride 1, pad 2 [13x13x256] MAX POOL2: 3x3 filters at stride 2 [13x13x256] NORM2: Normalization layer [13x13x384] CONV3: 384 3x3 filters at stride 1, pad 1 [13x13x384] CONV4: 384 3x3 filters at stride 1, pad 1 [13x13x256] CONV5: 256 3x3 filters at stride 1, pad 1 [6x6x256] MAX POOL3: 3x3 filters at stride 2 [4096] FC6: 4096 neurons [4096] FC7: 4096 neurons [1000] FC8: 1000 neurons (class scores) Compared to LeCun 1998: 1 DATA: - More data: 10^6 vs. 10^3 2 COMPUTE: - GPU (~20x speedup) 3 ALGORITHM: - Deeper: More layers (8 weight layers) - Fancy regularization (dropout) - Fancy non-linearity (ReLU) 4 INFRASTRUCTURE: - CUDA
  • 34. Case Study: AlexNet [Krizhevsky et al. 2012] Full (simplified) AlexNet architecture: [227x227x3] INPUT [55x55x96] CONV1: 96 11x11 filters at stride 4, pad 0 [27x27x96] MAX POOL1: 3x3 filters at stride 2 [27x27x96] NORM1: Normalization layer [27x27x256] CONV2: 256 5x5 filters at stride 1, pad 2 [13x13x256] MAX POOL2: 3x3 filters at stride 2 [13x13x256] NORM2: Normalization layer [13x13x384] CONV3: 384 3x3 filters at stride 1, pad 1 [13x13x384] CONV4: 384 3x3 filters at stride 1, pad 1 [13x13x256] CONV5: 256 3x3 filters at stride 1, pad 1 [6x6x256] MAX POOL3: 3x3 filters at stride 2 [4096] FC6: 4096 neurons [4096] FC7: 4096 neurons [1000] FC8: 1000 neurons (class scores) Details/Retrospectives: - first use of ReLU - used Norm layers (not common anymore) - heavy data augmentation - dropout 0.5 - batch size 128 - SGD Momentum 0.9 - Learning rate 1e-2, reduced by 10 manually when val accuracy plateaus - L2 weight decay 5e-4 - 7 CNN ensemble: 18.2% -> 15.4%
  • 35. Case Study: ZFNet [Zeiler and Fergus, 2013] AlexNet but: CONV1: change from (11x11 stride 4) to (7x7 stride 2) CONV3,4,5: instead of 384, 384, 256 filters use 512, 1024, 512 ImageNet top 5 error: 15.4% -> 14.8%
  • 36. Case Study: VGGNet [Simonyan and Zisserman, 2014] best model Only 3x3 CONV stride 1, pad 1 and 2x2 MAX POOL stride 2 11.2% top 5 error in ILSVRC 2013 -> 7.3% top 5 error
  • 37. INPUT: [224x224x3] memory: 224*224*3=150K params: 0 CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*3)*64 = 1,728 CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*64)*64 = 36,864 POOL2: [112x112x64] memory: 112*112*64=800K params: 0 CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*64)*128 = 73,728 CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*128)*128 = 147,456 POOL2: [56x56x128] memory: 56*56*128=400K params: 0 CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*128)*256 = 294,912 CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824 CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824 POOL2: [28x28x256] memory: 28*28*256=200K params: 0 CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*256)*512 = 1,179,648 CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296 CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296 POOL2: [14x14x512] memory: 14*14*512=100K params: 0 CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 POOL2: [7x7x512] memory: 7*7*512=25K params: 0 FC: [1x1x4096] memory: 4096 params: 7*7*512*4096 = 102,760,448 FC: [1x1x4096] memory: 4096 params: 4096*4096 = 16,777,216 FC: [1x1x1000] memory: 1000 params: 4096*1000 = 4,096,000 (not counting biases)
  • 38. INPUT: [224x224x3] memory: 224*224*3=150K params: 0 CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*3)*64 = 1,728 CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*64)*64 = 36,864 POOL2: [112x112x64] memory: 112*112*64=800K params: 0 CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*64)*128 = 73,728 CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*128)*128 = 147,456 POOL2: [56x56x128] memory: 56*56*128=400K params: 0 CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*128)*256 = 294,912 CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824 CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824 POOL2: [28x28x256] memory: 28*28*256=200K params: 0 CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*256)*512 = 1,179,648 CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296 CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296 POOL2: [14x14x512] memory: 14*14*512=100K params: 0 CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 POOL2: [7x7x512] memory: 7*7*512=25K params: 0 FC: [1x1x4096] memory: 4096 params: 7*7*512*4096 = 102,760,448 FC: [1x1x4096] memory: 4096 params: 4096*4096 = 16,777,216 FC: [1x1x1000] memory: 1000 params: 4096*1000 = 4,096,000 (not counting biases) TOTAL memory: 24M * 4 bytes ~= 93MB / image (only forward! ~*2 for bwd) TOTAL params: 138M parameters
  • 39. INPUT: [224x224x3] memory: 224*224*3=150K params: 0 CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*3)*64 = 1,728 CONV3-64: [224x224x64] memory: 224*224*64=3.2M params: (3*3*64)*64 = 36,864 POOL2: [112x112x64] memory: 112*112*64=800K params: 0 CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*64)*128 = 73,728 CONV3-128: [112x112x128] memory: 112*112*128=1.6M params: (3*3*128)*128 = 147,456 POOL2: [56x56x128] memory: 56*56*128=400K params: 0 CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*128)*256 = 294,912 CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824 CONV3-256: [56x56x256] memory: 56*56*256=800K params: (3*3*256)*256 = 589,824 POOL2: [28x28x256] memory: 28*28*256=200K params: 0 CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*256)*512 = 1,179,648 CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296 CONV3-512: [28x28x512] memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296 POOL2: [14x14x512] memory: 14*14*512=100K params: 0 CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 CONV3-512: [14x14x512] memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 POOL2: [7x7x512] memory: 7*7*512=25K params: 0 FC: [1x1x4096] memory: 4096 params: 7*7*512*4096 = 102,760,448 FC: [1x1x4096] memory: 4096 params: 4096*4096 = 16,777,216 FC: [1x1x1000] memory: 1000 params: 4096*1000 = 4,096,000 (not counting biases) TOTAL memory: 24M * 4 bytes ~= 93MB / image (only forward! ~*2 for bwd) TOTAL params: 138M parameters Note: Most memory is in early CONV Most params are in late FC
  • 40. Case Study: GoogLeNet [Szegedy et al., 2014] Inception module ILSVRC 2014 winner (6.7% top 5 error)
  • 41. Case Study: GoogLeNet Fun features: - Only 5 million params! (Removes FC layers completely) Compared to AlexNet: - 12X less params - 2x more compute - 6.67% (vs. 16.4%)
  • 42. Slide from Kaiming He’s recent presentation https://www.youtube.com/watch?v=1PGLj-uKT1w Case Study: ResNet [He et al., 2015] ILSVRC 2015 winner (3.6% top 5 error)
  • 43. References ● cs231n (Some of the slides were taken from the slides used in the course) ● Andrej Karpathy, Bay Area DL school 2016 slides
  • 44. Further Reading 1. http://colah.github.io/posts/2014-07-Conv-Nets-Modular/ 2. https://arxiv.org/pdf/1512.03385v1.pdf 3. https://github.com/kjw0612/awesome-deep-vision From Our Blog: - http://research.artifacia.com/learn-deep-learning-the-hard-way