cv_2025_Spring_16
cv_2025_Spring_16
x1
a b
• In 𝑑 dimensions,
• 𝑐0 + 𝑐1 ∗ 𝑥1 + ⋯ + 𝑐𝑑 ∗ 𝑥𝑑 = 0 is a hyperplane.
• Idea:
• Use 𝑐0 + 𝑐1 ∗ 𝑥1 + ⋯ + 𝑐𝑑 ∗ 𝑥𝑑 ≥ 0 to denote positive classifications
• Use 𝑐0 + 𝑐1 ∗ 𝑥1 + ⋯ + 𝑐𝑑 ∗ 𝑥𝑑 < 0 to denote negative
classifications
…
xi w i
…
g
y
xn
𝑦 = 𝑔(Σ𝑖 = 1, … , 𝑛 𝑤𝑖 𝑥𝑖 )
𝑔(𝑢) = 1/[1 + exp(−𝑎𝑢)]
w1j w2k
• Nonlinear classifier
• Training: find network weights w to minimize the error between true
training labels yi and estimated labels fw(xi):
𝑁
𝐸 𝑤 = (𝑦𝑖 − 𝑓𝑤 (𝑥𝑖 ))2
𝑖=1
What we want:
1 0
ℎ𝑤 (𝑥) ≈ 0 when pedestrian; ℎ𝑤 (𝑥) ≈ 1 when car
0 0
0 0
0 0
ℎ𝑤 (𝑥) ∈ ℝ𝐾 ℎ𝑤 (𝑥) ≈ 0 when motorcycle; ℎ𝑤 (𝑥) ≈ 0 when truck
1 0
0 1
Classification: Softmax Classifier
• Softmax Classifier (is also the multinomial logistic
regression)
• Remember that we can get scores for each classes
• Key: we want to interpret the raw scores as probabilities
o Probabilities must >= 0
o Must sum up as 1
𝑒 𝑠𝑘
𝑃 𝑌 = 𝑘 𝑋 = 𝑥𝑖 =
σ 𝑗 𝑒 𝑠𝑗
Fully Connected Layer
input activation
𝑊𝑥
1 -> -> 1 ……
3072 10*3072 10
weights
Convolutional Layer
32 × 32 × 3
image
5 × 5 × 3 filter
1 number:
3 (depth)
Convolutional Layer
Convolutional Layer
Activation/Feature Map
32 × 32 × 3
image
5 × 5 × 3 filter
32 (height) 28
Convolve (slide)
over all spatial
locations
32 (width)
28
3 (depth) 1
Convolutional Layer
For example, if we have 6 5*5 filters, we will get 6 separate activation maps.
Activation/Feature Map
28
32 (height) Convolution Layer
32 (width)
28
3 (depth) 1
3. Spatial pooling
4. Normalization Spatial pooling
• Supervised training of convolutional
filters by back-propagating
classification error Non-linearity
Convolution
(Learned)
.
.
.
Feature Map
Input Slide credit: Rob Fergus
2. Non-Linearity
• Per-element (independent)
• Options:
• Tanh
• Sigmoid: 1/(1+exp(-x))
• Rectified linear unit (ReLU)
• Simplifies backpropagation
• Makes learning faster
• Avoids saturation issues
→ Preferred option
Max
Sum
Slide credit: Rob Fergus
Pooling
• Create some translational invariance at each level
by averaging 4 neighboring replicated detectors to
give a single output to the next level.
• Reduces number of inputs to the next layer of feature
extraction, thus allowing us to have many more different
feature maps.
• Taking the maximum of the four works slightly better.
• Problem: After several levels of pooling, we lose
information about where objects are.
• This makes it impossible to use the precise spatial
relationships between high-level parts for recognition.
• So CNNs are good for classification, not (directly) useful for
object localization.
Feature Maps
Feature Maps After Contrast Normalization
Spatial pool
(Sum)
Normalize to Feature
unit length Vector
A. Krizhevsky, I. Sutskever, and G. Hinton, ImageNet Classification with Deep Convolutional Neural Networks,
NIPS 2012
Slide credit: Rob Fergus
ImageNet Challenge 2012
• Similar framework to LeCun’98 but:
• Bigger model (7 hidden layers, 650,000 units, 60,000,000 params)
• More data (106 vs. 103 images)
• GPU implementation (50x speedup over CPU)
• Trained on two GPUs for a week
• Better regularization for training (DropOut)
A. Krizhevsky, I. Sutskever, and G. Hinton, ImageNet Classification with Deep Convolutional Neural Networks,
NIPS 2012
Slide credit: Rob Fergus
ImageNet Challenge 2012
• Krizhevsky et al. -- 16.4% error (top-5)
• Next best (non-convnet) – 26.2% error
35
30
25
Top-5 error rate %
20
15
10
0
SuperVision ISI Oxford INRIA Amsterdam