SlideShare a Scribd company logo
Object Recognition II
Linda Shapiro
EE/CSE 576
with CNN slides from Ross Girshick
1
Outline
• Object detection
• the task, evaluation, datasets
• Convolutional Neural Networks (CNNs)
• overview and history
• Region-based Convolutional Networks (R-CNNs)
2
Image classification
• 𝐾 classes
• Task: assign correct class label to the whole image
Digit classification (MNIST) Object recognition (Caltech-101)
3
Classification vs. Detection
 Dog
Dog
Dog
4
Problem formulation
person
motorbike
Input Desired output
{ airplane, bird, motorbike, person, sofa }
5
Evaluating a detector
Test image (previously unseen)
First detection ...
‘person’ detector predictions
0.9
Second detection ...
0.9
0.6
‘person’ detector predictions
Third detection ...
0.9
0.6
0.2
‘person’ detector predictions
Compare to ground truth
ground truth ‘person’ boxes
0.9
0.6
0.2
‘person’ detector predictions
Sort by confidence
... ... ... ... ...
✓ ✓ ✓
0.9 0.8 0.6 0.5 0.2 0.1
true
positive
(high overlap)
false
positive
(no overlap,
low overlap, or
duplicate)
X X X
Evaluation metric
... ... ... ... ...
0.9 0.8 0.6 0.5 0.2 0.1
✓ ✓ ✓
X X X
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛@𝑡 =
#𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠@𝑡
#𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠@𝑡 + #𝑓𝑎𝑙𝑠𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠@𝑡
𝑟𝑒𝑐𝑎𝑙𝑙@𝑡 =
#𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠@𝑡
#𝑔𝑟𝑜𝑢𝑛𝑑 𝑡𝑟𝑢𝑡ℎ 𝑜𝑏𝑗𝑒𝑐𝑡𝑠
𝑡
✓
✓ + X
Evaluation metric
Average Precision (AP)
0% is worst
100% is best
mean AP over classes
(mAP)
... ... ... ... ...
0.9 0.8 0.6 0.5 0.2 0.1
✓ ✓ ✓
X X X
Pedestrians
Histograms of Oriented Gradients for Human Detection,
Dalal and Triggs, CVPR 2005
AP ~77%
More sophisticated methods: AP ~90%
(a) average gradient image over training examples
(b) each “pixel” shows max positive SVM weight in the block centered on that pixel
(c) same as (b) for negative SVM weights
(d) test image
(e) its R-HOG descriptor
(f) R-HOG descriptor weighted by positive SVM weights
(g) R-HOG descriptor weighted by negative SVM weights
14
Overview of HOG Method
1. Compute gradients in the region to be described
2. Put them in bins according to orientation
3. Group the cells into large blocks
4. Normalize each block
5. Train classifiers to decide if these are parts of a human
15
Details
• Gradients
[-1 0 1] and [-1 0 1]T were good enough filters.
• Cell Histograms
Each pixel within the cell casts a weighted vote for an
orientation-based histogram channel based on the values
found in the gradient computation. (9 channels worked)
• Blocks
Group the cells together into larger blocks, either R-HOG
blocks (rectangular) or C-HOG blocks (circular).
16
More Details
• Block Normalization
• If you think of the block as a vector v, then the
normalized block is v/norm(v)
They tried 4 different kinds of normalization.
• L1-norm
• sqrt of L1-norm
• L2 norm
• L2-norm followed by clipping
17
Example: Dalal-Triggs pedestrian
detector
1. Extract fixed-sized (64x128 pixel) window at each
position and scale
2. Compute HOG (histogram of gradient) features
within each window
3. Score the window with a linear SVM classifier
4. Perform non-maxima suppression to remove
overlapping detections with lower scores
Navneet Dalal and Bill Triggs, Histograms of Oriented Gradients for Human Detection, CVPR05
18
Slides by Pete Barnum Navneet Dalal and Bill Triggs, Histograms of Oriented Gradients for Human Detection, CVPR05
19
uncentered
centered
cubic-corrected
diagonal
Sobel
Slides by Pete Barnum Navneet Dalal and Bill Triggs, Histograms of Oriented Gradients for Human Detection, CVPR05
Outperforms
20
• Histogram of gradient orientations
• Votes weighted by magnitude
• Bilinear interpolation between cells
Orientation: 9 bins (for
unsigned angles)
Histograms in 8x8
pixel cells
Slides by Pete Barnum Navneet Dalal and Bill Triggs, Histograms of Oriented Gradients for Human Detection, CVPR05
21
Normalize with respect to
surrounding cells
Slides by Pete Barnum Navneet Dalal and Bill Triggs, Histograms of Oriented Gradients for Human Detection, CVPR05
22
X=
Slides by Pete Barnum Navneet Dalal and Bill Triggs, Histograms of Oriented Gradients for Human Detection, CVPR05
# features = 15 x 7 x 9 x 4 = 3780
# cells
# orientations
# normalizations by
neighboring cells
23
Training set
24
Slides by Pete Barnum Navneet Dalal and Bill Triggs, Histograms of Oriented Gradients for Human Detection, CVPR05
pos w neg w
25
pedestrian
Slides by Pete Barnum Navneet Dalal and Bill Triggs, Histograms of Oriented Gradients for Human Detection, CVPR05
26
Detection examples
27
30
Deformable Parts Model
• Takes the idea a little further
• Instead of one rigid HOG model, we have multiple
HOG models in a spatial arrangement
• One root part to find first and multiple other parts
in a tree structure.
31
The Idea
Articulated parts model
• Object is configuration of parts
• Each part is detectable
Images from Felzenszwalb
32
Deformable objects
Images from Caltech-256
Slide Credit: Duan Tran
33
Deformable objects
Images from D. Ramanan’s dataset
Slide Credit: Duan Tran
34
How to model spatial relations?
• Tree-shaped model
35
36
Hybrid template/parts model
Detections
Template Visualization
Felzenszwalb et al. 2008
37
Pictorial Structures Model
Appearance likelihood Geometry likelihood
38
Results for person matching
39
Results for person matching
40
BMVC 2009
41
2012 State-of-the-art Detector:
Deformable Parts Model (DPM)
42
Felzenszwalb et al., 2008, 2010, 2011, 2012
Lifetime
Achievement
1. Strong low-level features based on HOG
2. Efficient matching algorithms for deformable part-based
models (pictorial structures)
3. Discriminative learning with latent variables (latent SVM)
Why did gradient-based models work?
Average gradient image
43
Generic categories
Can we detect people, chairs, horses, cars, dogs, buses, bottles, sheep …?
PASCAL Visual Object Categories (VOC) dataset
44
Generic categories
Why doesn’t this work (as well)?
Can we detect people, chairs, horses, cars, dogs, buses, bottles, sheep …?
PASCAL Visual Object Categories (VOC) dataset
45
Quiz time
(Back to Girshick)
46
Warm up
This is an average image of which object class?
47
Warm up
pedestrian
48
A little harder
?
49
A little harder
?
Hint: airplane, bicycle, bus, car, cat, chair, cow, dog, dining table
50
A little harder
bicycle (PASCAL)
51
A little harder, yet
?
52
A little harder, yet
?
Hint: white blob on a green background
53
A little harder, yet
sheep (PASCAL)
54
Impossible?
?
55
Impossible?
dog (PASCAL)
56
Impossible?
dog (PASCAL)
Why does the mean look like this?
There’s no alignment between the examples!
How do we combat this? 57
PASCAL VOC detection history
0%
10%
20%
30%
40%
50%
60%
70%
2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
mean
Average
Precision
(mAP)
year
DPM
DPM,
HOG+
BOW
DPM,
MKL
DPM++
DPM++,
MKL,
Selective
Search
Selective
Search,
DPM++,
MKL
41%
41%
37%
28%
23%
17%
Part-based models & multiple
features (MKL)
0%
10%
20%
30%
40%
50%
60%
70%
2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
mean
Average
Precision
(mAP)
year
DPM
DPM,
HOG+
BOW
DPM,
MKL
DPM++
DPM++,
MKL,
Selective
Search
Selective
Search,
DPM++,
MKL
41%
41%
37%
28%
23%
17%
Kitchen-sink approaches
0%
10%
20%
30%
40%
50%
60%
70%
2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
mean
Average
Precision
(mAP)
year
DPM
DPM,
HOG+
BOW
DPM,
MKL
DPM++
DPM++,
MKL,
Selective
Search
Selective
Search,
DPM++,
MKL
41%
41%
37%
28%
23%
17%
increasing complexity & plateau
Region-based Convolutional
Networks (R-CNNs)
0%
10%
20%
30%
40%
50%
60%
70%
2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
mean
Average
Precision
(mAP)
year
DPM
DPM,
HOG+
BOW
DPM,
MKL
DPM++
DPM++,
MKL,
Selective
Search
Selective
Search,
DPM++,
MKL
41%
41%
37%
28%
23%
17%
53%
62%
R-CNN v1
R-CNN v2
[R-CNN. Girshick et al. CVPR 2014]
0%
10%
20%
30%
40%
50%
60%
70%
2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
mean
Average
Precision
(mAP)
year
~1 year
~5 years
Region-based Convolutional
Networks (R-CNNs)
[R-CNN. Girshick et al. CVPR 2014]
Convolutional Neural Networks
• Overview
63
Standard Neural Networks
𝒙 = 𝑥1, … , 𝑥784
𝑇
𝑧𝑗 = 𝑔(𝒘𝑗
𝑇
𝒙) 𝑔 𝑡 =
1
1 + 𝑒−𝑡
“Fully connected”
64
From NNs to Convolutional NNs
• Local connectivity
• Shared (“tied”) weights
• Multiple feature maps
• Pooling
65
Convolutional NNs
• Local connectivity
• Each green unit is only connected to (3)
neighboring blue units
compare
66
Convolutional NNs
• Shared (“tied”) weights
• All green units share the same parameters 𝒘
• Each green unit computes the same function,
but with a different input window
𝑤1
𝑤2
𝑤3
𝑤1
𝑤2
𝑤3
67
Convolutional NNs
• Convolution with 1-D filter: [𝑤3, 𝑤2, 𝑤1]
• All green units share the same parameters 𝒘
• Each green unit computes the same function,
but with a different input window
𝑤1
𝑤2
𝑤3
68
Convolutional NNs
• Convolution with 1-D filter: [𝑤3, 𝑤2, 𝑤1]
• All green units share the same parameters 𝒘
• Each green unit computes the same function,
but with a different input window
𝑤1
𝑤2
𝑤3
69
Convolutional NNs
• Convolution with 1-D filter: [𝑤3, 𝑤2, 𝑤1]
• All green units share the same parameters 𝒘
• Each green unit computes the same function,
but with a different input window
𝑤1
𝑤2
𝑤3
70
Convolutional NNs
• Convolution with 1-D filter: [𝑤3, 𝑤2, 𝑤1]
• All green units share the same parameters 𝒘
• Each green unit computes the same function,
but with a different input window
𝑤1
𝑤2
𝑤3
71
Convolutional NNs
• Convolution with 1-D filter: [𝑤3, 𝑤2, 𝑤1]
• All green units share the same parameters 𝒘
• Each green unit computes the same function,
but with a different input window
𝑤1
𝑤2
𝑤3
72
Convolutional NNs
• Multiple feature maps
• All orange units compute the same function
but with a different input windows
• Orange and green units compute
different functions
𝑤1
𝑤2
𝑤3
𝑤′1
𝑤′2
𝑤′3
Feature map 1
(array of green
units)
Feature map 2
(array of orange
units)
73
Convolutional NNs
• Pooling (max, average)
1
4
0
3
4
3
• Pooling area: 2 units
• Pooling stride: 2 units
• Subsamples feature maps
74
Image
Pooling
Convolution
2D input
75
1989
Backpropagation applied to handwritten zip code recognition,
Lecun et al., 1989 76
Historical perspective – 1980
77
Historical perspective – 1980
Hubel and Wiesel
1962
Included basic ingredients of ConvNets, but no supervised learning algorithm
78
Supervised learning – 1986
Early demonstration that error backpropagation can be used
for supervised training of neural nets (including ConvNets)
Gradient descent training with error backpropagation
79
Supervised learning – 1986
“T” vs. “C” problem Simple ConvNet
80
Practical ConvNets
Gradient-Based Learning Applied to Document Recognition,
Lecun et al., 1998
81
Demo
• http://cs.stanford.edu/people/karpathy/convnetjs/
demo/mnist.html
• ConvNetJS by Andrej Karpathy (Ph.D. student at
Stanford)
Software libraries
• Caffe (C++, python, matlab)
• Torch7 (C++, lua)
• Theano (python)
82
The fall of ConvNets
• The rise of Support Vector Machines (SVMs)
• Mathematical advantages (theory, convex
optimization)
• Competitive performance on tasks such as digit
classification
• Neural nets became unpopular in the mid 1990s
83
The key to SVMs
• It’s all about the features
Histograms of Oriented Gradients for Human Detection,
Dalal and Triggs, CVPR 2005
SVM weights
(+) (-)
HOG features
84
Core idea of “deep learning”
• Input: the “raw” signal (image, waveform, …)
• Features: hierarchy of features is learned from the
raw input
85
• If SVMs killed neural nets, how did they come back
(in computer vision)?
86
What’s new since the 1980s?
• More layers
• LeNet-3 and LeNet-5 had 3 and 5 learnable layers
• Current models have 8 – 20+
• “ReLU” non-linearities (Rectified Linear Unit)
• 𝑔 𝑥 = max 0, 𝑥
• Gradient doesn’t vanish
• “Dropout” regularization
• Fast GPU implementations
• More data
𝑥
𝑔(𝑥)
87
What else? Object Proposals
• Sliding window based object detection
• Object proposals
• Fast execution
• High recall with low # of candidate boxes
Image
Feature
Extraction
Classificaiton
Iterate over window size, aspect
ratio, and location
Image
Feature
Extraction
Classificaiton
Object
Proposal
88
The number of contours wholly enclosed by a bounding box is indicative of
the likelihood of the box containing an object.
89
Ross’s Own System: Region CNNs
Competitive Results
Top Regions for Six Object Classes
Finale
• Object recognition has moved rapidly in the last 12
years to becoming very appearance based.
• The HOG descriptor lead to fast recognition of
specific views of generic objects, starting with
pedestrians and using SVMs.
• Deformable parts models extended that to allow
more objects with articulated limbs, but still
specific views.
• CNNs have become the method of choice; they
learn from huge amounts of data and can learn
multiple views of each object class.
93

More Related Content

PPTX
DeepLearning
PDF
Genetic Algorithms
PDF
Andrii Belas "Overview of object detection approaches: cases, algorithms and...
PPTX
Genetic algorithm
PDF
Introduction to Genetic Algorithms
PDF
Clustering Methods with R
PDF
BSSML17 - Clusters
PPTX
Decision tree induction
DeepLearning
Genetic Algorithms
Andrii Belas "Overview of object detection approaches: cases, algorithms and...
Genetic algorithm
Introduction to Genetic Algorithms
Clustering Methods with R
BSSML17 - Clusters
Decision tree induction

Similar to ObjRecog2-17 (1).pptx (20)

PDF
L13. Cluster Analysis
PPTX
The painful removal of tiling artefacts in ToF-SIMS data
PDF
Declarative data analysis
PDF
Robots, Small Molecules & R
PDF
BSSML16 L3. Clusters and Anomaly Detection
PDF
Clustering Methods with R
PDF
Unsupervised learning and clustering.pdf
PPTX
Clustering.pptx
PDF
Terminological cluster trees for Disjointness Axiom Discovery
PDF
Computer Vision Computer Vision: Algorithms and Applications Richard Szeliski
PDF
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
PDF
02 - Data validation and validity deze keer
PPTX
Using Feature Grouping as a Stochastic Regularizer for High Dimensional Noisy...
PPTX
Advanced database and data mining & clustering concepts
PDF
Mini datathon
PDF
Cluster Analysis for Dummies
PPTX
Paris Data Geeks
PPTX
Genetic Algorithm
PPTX
07 learning
PPTX
Paris data-geeks-2013-03-28
L13. Cluster Analysis
The painful removal of tiling artefacts in ToF-SIMS data
Declarative data analysis
Robots, Small Molecules & R
BSSML16 L3. Clusters and Anomaly Detection
Clustering Methods with R
Unsupervised learning and clustering.pdf
Clustering.pptx
Terminological cluster trees for Disjointness Axiom Discovery
Computer Vision Computer Vision: Algorithms and Applications Richard Szeliski
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
02 - Data validation and validity deze keer
Using Feature Grouping as a Stochastic Regularizer for High Dimensional Noisy...
Advanced database and data mining & clustering concepts
Mini datathon
Cluster Analysis for Dummies
Paris Data Geeks
Genetic Algorithm
07 learning
Paris data-geeks-2013-03-28
Ad

Recently uploaded (20)

PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PPTX
Big Data Technologies - Introduction.pptx
PDF
Review of recent advances in non-invasive hemoglobin estimation
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Advanced IT Governance
PDF
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
KodekX | Application Modernization Development
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Advanced Soft Computing BINUS July 2025.pdf
PDF
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Unlocking AI with Model Context Protocol (MCP)
Chapter 3 Spatial Domain Image Processing.pdf
Understanding_Digital_Forensics_Presentation.pptx
Advanced methodologies resolving dimensionality complications for autism neur...
Mobile App Security Testing_ A Comprehensive Guide.pdf
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Big Data Technologies - Introduction.pptx
Review of recent advances in non-invasive hemoglobin estimation
The AUB Centre for AI in Media Proposal.docx
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Advanced IT Governance
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
KodekX | Application Modernization Development
“AI and Expert System Decision Support & Business Intelligence Systems”
Advanced Soft Computing BINUS July 2025.pdf
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
Ad

ObjRecog2-17 (1).pptx

  • 1. Object Recognition II Linda Shapiro EE/CSE 576 with CNN slides from Ross Girshick 1
  • 2. Outline • Object detection • the task, evaluation, datasets • Convolutional Neural Networks (CNNs) • overview and history • Region-based Convolutional Networks (R-CNNs) 2
  • 3. Image classification • 𝐾 classes • Task: assign correct class label to the whole image Digit classification (MNIST) Object recognition (Caltech-101) 3
  • 5. Problem formulation person motorbike Input Desired output { airplane, bird, motorbike, person, sofa } 5
  • 6. Evaluating a detector Test image (previously unseen)
  • 7. First detection ... ‘person’ detector predictions 0.9
  • 10. Compare to ground truth ground truth ‘person’ boxes 0.9 0.6 0.2 ‘person’ detector predictions
  • 11. Sort by confidence ... ... ... ... ... ✓ ✓ ✓ 0.9 0.8 0.6 0.5 0.2 0.1 true positive (high overlap) false positive (no overlap, low overlap, or duplicate) X X X
  • 12. Evaluation metric ... ... ... ... ... 0.9 0.8 0.6 0.5 0.2 0.1 ✓ ✓ ✓ X X X 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛@𝑡 = #𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠@𝑡 #𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠@𝑡 + #𝑓𝑎𝑙𝑠𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠@𝑡 𝑟𝑒𝑐𝑎𝑙𝑙@𝑡 = #𝑡𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠@𝑡 #𝑔𝑟𝑜𝑢𝑛𝑑 𝑡𝑟𝑢𝑡ℎ 𝑜𝑏𝑗𝑒𝑐𝑡𝑠 𝑡 ✓ ✓ + X
  • 13. Evaluation metric Average Precision (AP) 0% is worst 100% is best mean AP over classes (mAP) ... ... ... ... ... 0.9 0.8 0.6 0.5 0.2 0.1 ✓ ✓ ✓ X X X
  • 14. Pedestrians Histograms of Oriented Gradients for Human Detection, Dalal and Triggs, CVPR 2005 AP ~77% More sophisticated methods: AP ~90% (a) average gradient image over training examples (b) each “pixel” shows max positive SVM weight in the block centered on that pixel (c) same as (b) for negative SVM weights (d) test image (e) its R-HOG descriptor (f) R-HOG descriptor weighted by positive SVM weights (g) R-HOG descriptor weighted by negative SVM weights 14
  • 15. Overview of HOG Method 1. Compute gradients in the region to be described 2. Put them in bins according to orientation 3. Group the cells into large blocks 4. Normalize each block 5. Train classifiers to decide if these are parts of a human 15
  • 16. Details • Gradients [-1 0 1] and [-1 0 1]T were good enough filters. • Cell Histograms Each pixel within the cell casts a weighted vote for an orientation-based histogram channel based on the values found in the gradient computation. (9 channels worked) • Blocks Group the cells together into larger blocks, either R-HOG blocks (rectangular) or C-HOG blocks (circular). 16
  • 17. More Details • Block Normalization • If you think of the block as a vector v, then the normalized block is v/norm(v) They tried 4 different kinds of normalization. • L1-norm • sqrt of L1-norm • L2 norm • L2-norm followed by clipping 17
  • 18. Example: Dalal-Triggs pedestrian detector 1. Extract fixed-sized (64x128 pixel) window at each position and scale 2. Compute HOG (histogram of gradient) features within each window 3. Score the window with a linear SVM classifier 4. Perform non-maxima suppression to remove overlapping detections with lower scores Navneet Dalal and Bill Triggs, Histograms of Oriented Gradients for Human Detection, CVPR05 18
  • 19. Slides by Pete Barnum Navneet Dalal and Bill Triggs, Histograms of Oriented Gradients for Human Detection, CVPR05 19
  • 20. uncentered centered cubic-corrected diagonal Sobel Slides by Pete Barnum Navneet Dalal and Bill Triggs, Histograms of Oriented Gradients for Human Detection, CVPR05 Outperforms 20
  • 21. • Histogram of gradient orientations • Votes weighted by magnitude • Bilinear interpolation between cells Orientation: 9 bins (for unsigned angles) Histograms in 8x8 pixel cells Slides by Pete Barnum Navneet Dalal and Bill Triggs, Histograms of Oriented Gradients for Human Detection, CVPR05 21
  • 22. Normalize with respect to surrounding cells Slides by Pete Barnum Navneet Dalal and Bill Triggs, Histograms of Oriented Gradients for Human Detection, CVPR05 22
  • 23. X= Slides by Pete Barnum Navneet Dalal and Bill Triggs, Histograms of Oriented Gradients for Human Detection, CVPR05 # features = 15 x 7 x 9 x 4 = 3780 # cells # orientations # normalizations by neighboring cells 23
  • 25. Slides by Pete Barnum Navneet Dalal and Bill Triggs, Histograms of Oriented Gradients for Human Detection, CVPR05 pos w neg w 25
  • 26. pedestrian Slides by Pete Barnum Navneet Dalal and Bill Triggs, Histograms of Oriented Gradients for Human Detection, CVPR05 26
  • 28. 30
  • 29. Deformable Parts Model • Takes the idea a little further • Instead of one rigid HOG model, we have multiple HOG models in a spatial arrangement • One root part to find first and multiple other parts in a tree structure. 31
  • 30. The Idea Articulated parts model • Object is configuration of parts • Each part is detectable Images from Felzenszwalb 32
  • 31. Deformable objects Images from Caltech-256 Slide Credit: Duan Tran 33
  • 32. Deformable objects Images from D. Ramanan’s dataset Slide Credit: Duan Tran 34
  • 33. How to model spatial relations? • Tree-shaped model 35
  • 34. 36
  • 35. Hybrid template/parts model Detections Template Visualization Felzenszwalb et al. 2008 37
  • 36. Pictorial Structures Model Appearance likelihood Geometry likelihood 38
  • 37. Results for person matching 39
  • 38. Results for person matching 40
  • 40. 2012 State-of-the-art Detector: Deformable Parts Model (DPM) 42 Felzenszwalb et al., 2008, 2010, 2011, 2012 Lifetime Achievement 1. Strong low-level features based on HOG 2. Efficient matching algorithms for deformable part-based models (pictorial structures) 3. Discriminative learning with latent variables (latent SVM)
  • 41. Why did gradient-based models work? Average gradient image 43
  • 42. Generic categories Can we detect people, chairs, horses, cars, dogs, buses, bottles, sheep …? PASCAL Visual Object Categories (VOC) dataset 44
  • 43. Generic categories Why doesn’t this work (as well)? Can we detect people, chairs, horses, cars, dogs, buses, bottles, sheep …? PASCAL Visual Object Categories (VOC) dataset 45
  • 44. Quiz time (Back to Girshick) 46
  • 45. Warm up This is an average image of which object class? 47
  • 48. A little harder ? Hint: airplane, bicycle, bus, car, cat, chair, cow, dog, dining table 50
  • 49. A little harder bicycle (PASCAL) 51
  • 50. A little harder, yet ? 52
  • 51. A little harder, yet ? Hint: white blob on a green background 53
  • 52. A little harder, yet sheep (PASCAL) 54
  • 55. Impossible? dog (PASCAL) Why does the mean look like this? There’s no alignment between the examples! How do we combat this? 57
  • 56. PASCAL VOC detection history 0% 10% 20% 30% 40% 50% 60% 70% 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 mean Average Precision (mAP) year DPM DPM, HOG+ BOW DPM, MKL DPM++ DPM++, MKL, Selective Search Selective Search, DPM++, MKL 41% 41% 37% 28% 23% 17%
  • 57. Part-based models & multiple features (MKL) 0% 10% 20% 30% 40% 50% 60% 70% 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 mean Average Precision (mAP) year DPM DPM, HOG+ BOW DPM, MKL DPM++ DPM++, MKL, Selective Search Selective Search, DPM++, MKL 41% 41% 37% 28% 23% 17%
  • 58. Kitchen-sink approaches 0% 10% 20% 30% 40% 50% 60% 70% 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 mean Average Precision (mAP) year DPM DPM, HOG+ BOW DPM, MKL DPM++ DPM++, MKL, Selective Search Selective Search, DPM++, MKL 41% 41% 37% 28% 23% 17% increasing complexity & plateau
  • 59. Region-based Convolutional Networks (R-CNNs) 0% 10% 20% 30% 40% 50% 60% 70% 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 mean Average Precision (mAP) year DPM DPM, HOG+ BOW DPM, MKL DPM++ DPM++, MKL, Selective Search Selective Search, DPM++, MKL 41% 41% 37% 28% 23% 17% 53% 62% R-CNN v1 R-CNN v2 [R-CNN. Girshick et al. CVPR 2014]
  • 60. 0% 10% 20% 30% 40% 50% 60% 70% 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 mean Average Precision (mAP) year ~1 year ~5 years Region-based Convolutional Networks (R-CNNs) [R-CNN. Girshick et al. CVPR 2014]
  • 62. Standard Neural Networks 𝒙 = 𝑥1, … , 𝑥784 𝑇 𝑧𝑗 = 𝑔(𝒘𝑗 𝑇 𝒙) 𝑔 𝑡 = 1 1 + 𝑒−𝑡 “Fully connected” 64
  • 63. From NNs to Convolutional NNs • Local connectivity • Shared (“tied”) weights • Multiple feature maps • Pooling 65
  • 64. Convolutional NNs • Local connectivity • Each green unit is only connected to (3) neighboring blue units compare 66
  • 65. Convolutional NNs • Shared (“tied”) weights • All green units share the same parameters 𝒘 • Each green unit computes the same function, but with a different input window 𝑤1 𝑤2 𝑤3 𝑤1 𝑤2 𝑤3 67
  • 66. Convolutional NNs • Convolution with 1-D filter: [𝑤3, 𝑤2, 𝑤1] • All green units share the same parameters 𝒘 • Each green unit computes the same function, but with a different input window 𝑤1 𝑤2 𝑤3 68
  • 67. Convolutional NNs • Convolution with 1-D filter: [𝑤3, 𝑤2, 𝑤1] • All green units share the same parameters 𝒘 • Each green unit computes the same function, but with a different input window 𝑤1 𝑤2 𝑤3 69
  • 68. Convolutional NNs • Convolution with 1-D filter: [𝑤3, 𝑤2, 𝑤1] • All green units share the same parameters 𝒘 • Each green unit computes the same function, but with a different input window 𝑤1 𝑤2 𝑤3 70
  • 69. Convolutional NNs • Convolution with 1-D filter: [𝑤3, 𝑤2, 𝑤1] • All green units share the same parameters 𝒘 • Each green unit computes the same function, but with a different input window 𝑤1 𝑤2 𝑤3 71
  • 70. Convolutional NNs • Convolution with 1-D filter: [𝑤3, 𝑤2, 𝑤1] • All green units share the same parameters 𝒘 • Each green unit computes the same function, but with a different input window 𝑤1 𝑤2 𝑤3 72
  • 71. Convolutional NNs • Multiple feature maps • All orange units compute the same function but with a different input windows • Orange and green units compute different functions 𝑤1 𝑤2 𝑤3 𝑤′1 𝑤′2 𝑤′3 Feature map 1 (array of green units) Feature map 2 (array of orange units) 73
  • 72. Convolutional NNs • Pooling (max, average) 1 4 0 3 4 3 • Pooling area: 2 units • Pooling stride: 2 units • Subsamples feature maps 74
  • 74. 1989 Backpropagation applied to handwritten zip code recognition, Lecun et al., 1989 76
  • 76. Historical perspective – 1980 Hubel and Wiesel 1962 Included basic ingredients of ConvNets, but no supervised learning algorithm 78
  • 77. Supervised learning – 1986 Early demonstration that error backpropagation can be used for supervised training of neural nets (including ConvNets) Gradient descent training with error backpropagation 79
  • 78. Supervised learning – 1986 “T” vs. “C” problem Simple ConvNet 80
  • 79. Practical ConvNets Gradient-Based Learning Applied to Document Recognition, Lecun et al., 1998 81
  • 80. Demo • http://cs.stanford.edu/people/karpathy/convnetjs/ demo/mnist.html • ConvNetJS by Andrej Karpathy (Ph.D. student at Stanford) Software libraries • Caffe (C++, python, matlab) • Torch7 (C++, lua) • Theano (python) 82
  • 81. The fall of ConvNets • The rise of Support Vector Machines (SVMs) • Mathematical advantages (theory, convex optimization) • Competitive performance on tasks such as digit classification • Neural nets became unpopular in the mid 1990s 83
  • 82. The key to SVMs • It’s all about the features Histograms of Oriented Gradients for Human Detection, Dalal and Triggs, CVPR 2005 SVM weights (+) (-) HOG features 84
  • 83. Core idea of “deep learning” • Input: the “raw” signal (image, waveform, …) • Features: hierarchy of features is learned from the raw input 85
  • 84. • If SVMs killed neural nets, how did they come back (in computer vision)? 86
  • 85. What’s new since the 1980s? • More layers • LeNet-3 and LeNet-5 had 3 and 5 learnable layers • Current models have 8 – 20+ • “ReLU” non-linearities (Rectified Linear Unit) • 𝑔 𝑥 = max 0, 𝑥 • Gradient doesn’t vanish • “Dropout” regularization • Fast GPU implementations • More data 𝑥 𝑔(𝑥) 87
  • 86. What else? Object Proposals • Sliding window based object detection • Object proposals • Fast execution • High recall with low # of candidate boxes Image Feature Extraction Classificaiton Iterate over window size, aspect ratio, and location Image Feature Extraction Classificaiton Object Proposal 88
  • 87. The number of contours wholly enclosed by a bounding box is indicative of the likelihood of the box containing an object. 89
  • 88. Ross’s Own System: Region CNNs
  • 90. Top Regions for Six Object Classes
  • 91. Finale • Object recognition has moved rapidly in the last 12 years to becoming very appearance based. • The HOG descriptor lead to fast recognition of specific views of generic objects, starting with pedestrians and using SVMs. • Deformable parts models extended that to allow more objects with articulated limbs, but still specific views. • CNNs have become the method of choice; they learn from huge amounts of data and can learn multiple views of each object class. 93