0% found this document useful (0 votes)
3 views53 pages

cv_2025_Spring_16

The document outlines the course CSCI-B 457: Introduction to Computer Vision, focusing on deep learning and traditional recognition approaches. It discusses the evolution from hand-designed features to learning feature hierarchies through deep architectures, emphasizing the importance of convolutional neural networks (CNNs) in image classification. Additionally, it covers concepts such as linear classifiers, backpropagation, and the success of CNNs in various recognition tasks, including the ImageNet Challenge.

Uploaded by

stefschr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views53 pages

cv_2025_Spring_16

The document outlines the course CSCI-B 457: Introduction to Computer Vision, focusing on deep learning and traditional recognition approaches. It discusses the evolution from hand-designed features to learning feature hierarchies through deep architectures, emphasizing the importance of convolutional neural networks (CNNs) in image classification. Additionally, it covers concepts such as linear classifiers, backpropagation, and the success of CNNs in various recognition tasks, including the ImageNet Challenge.

Uploaded by

stefschr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

CSCI-B 457

Introduction To Computer Vision


Luddy School of Informatics, IUB
Spring, 2025
Instructor: Xuhong Zhang
Deep Learning Advance II
Traditional Recognition Approach

Image/ Video Hand-designed


Trainable Object
Pixels feature
classifier Class
extraction

• Features are not learned


• Trainable classifier is often generic (e.g. SVM)

Slide credit: Rob Fergus


Traditional Recognition Approach
• Features are key to recent progress in recognition
• Multitude of hand-designed features currently in use
• SIFT, HOG, ………….
• Where next? Better classifiers? Or keep building more features?

Felzenszwalb, Girshick, Yan & Huang


McAllester and Ramanan, PAMI 2007 (Winner of PASCAL 2010 classification competition)

Slide credit: Rob Fergus


What about learning the features?
• Learn a feature hierarchy all the way from pixels to
classifier
• Each layer extracts features from the output of
previous layer
• Train all layers jointly

Image/ Video Simple


Layer 1 Layer 2 Layer 3
Pixels Classifier

Slide credit: Rob Fergus


“Shallow” vs. “deep” architectures
Traditional recognition: “Shallow” architecture

Image/ Video Hand-designed Trainable Object


Pixels feature extraction classifier Class

Deep learning: “Deep” architecture


Image/ Video Simple Object
Pixels Layer 1 … Layer N
classifier Class

Slide credit: Rob Fergus


Linear classifiers
• Linear classifiers model the boundary between
two classes:
x2

x1

Adapted from K. Hauser’s slide


Plane Geometry
• In 3𝐷, a plane can be expressed as the set of
solutions (𝑥, 𝑦, 𝑧) to the equation 𝑎𝑥 + 𝑏𝑦 + 𝑐𝑧 + 𝑑 = 0
• 𝑎𝑥 + 𝑏𝑦 + 𝑐𝑧 + 𝑑 > 0 is one side of the plane
• 𝑎𝑥 + 𝑏𝑦 + 𝑐𝑧 + 𝑑 < 0 is the other side
• 𝑎𝑥 + 𝑏𝑦 + 𝑐𝑧 + 𝑑 = 0 is the plane itself
z

a b

y Adapted from K. Hauser’s slide


Linear Classifier

• In 𝑑 dimensions,
• 𝑐0 + 𝑐1 ∗ 𝑥1 + ⋯ + 𝑐𝑑 ∗ 𝑥𝑑 = 0 is a hyperplane.
• Idea:
• Use 𝑐0 + 𝑐1 ∗ 𝑥1 + ⋯ + 𝑐𝑑 ∗ 𝑥𝑑 ≥ 0 to denote positive classifications
• Use 𝑐0 + 𝑐1 ∗ 𝑥1 + ⋯ + 𝑐𝑑 ∗ 𝑥𝑑 < 0 to denote negative
classifications

Adapted from K. Hauser’s slide


Perceptrons

“the embryo of an electronic


computer that [the Navy]
expects will be able to walk,
talk, see, write, reproduce
itself and be conscious of its
existence.”
Frank Rosenblatt, 1958
Unit (Neuron)
x1


xi w i

 g
y

xn

𝑦 = 𝑔(Σ𝑖 = 1, … , 𝑛 𝑤𝑖 𝑥𝑖 )
𝑔(𝑢) = 1/[1 + exp(−𝑎𝑢)]

Adapted from K. Hauser’s slide


Training with Neurons

• Treat the problem as one of minimizing errors between the


example label and the network output, given the example
and weights as input
• Error(𝒙𝑖 , 𝑦𝑖, 𝒘) = (𝑦𝑖 – 𝑓(𝒙𝑖 , 𝒘))2

• Sum this error term over all examples


• 𝐸(𝒘) = 𝑖 Error(𝒙𝑖 , 𝑦𝑖, 𝒘) = 𝑖 (𝑦𝑖 – 𝑓(𝒙𝑖 , 𝒘))2

• Minimize errors using an optimization algorithm


• Gradient descent is typically used

Adapted from K. Hauser’s slide


Gradient direction 𝛻𝐸 is orthogonal to the level sets (contours) of E,
points in direction of steepest increase

Adapted from K. Hauser’s slide


Gradient direction 𝛻𝐸 is orthogonal to the level sets (contours) of E,
points in direction of steepest increase

Adapted from K. Hauser’s slide


Gradient descent: iteratively move in direction −𝛻E

Adapted from K. Hauser’s slide


Gradient descent: iteratively move in direction −𝛻E

Adapted from K. Hauser’s slide


Gradient descent: iteratively move in direction −𝛻E

Adapted from K. Hauser’s slide


Gradient descent: iteratively move in direction −𝛻E

Adapted from K. Hauser’s slide


Gradient descent: iteratively move in direction −𝛻E

Adapted from K. Hauser’s slide


Gradient descent: iteratively move in direction −𝛻𝐸

Adapted from K. Hauser’s slide


Gradient descent: iteratively move in direction −𝛻𝐸

Adapted from K. Hauser’s slide


Two-Layer Feed-Forward Neural Network

w1j w2k

Inputs Hidden Output


layer layer

Adapted from K. Hauser’s slide


Multi-Layer Generalization

Slide credit: T. Martinez


Networks with hidden layers

• Can represent XORs, other nonlinear functions


• Many, many variants:
• Different network structures
• Different activation functions
• Etc…
• As the number of hidden units increases, the network’s
capacity to learn more complicated functions also
increases

• How to train hidden layers?


Adapted from K. Hauser’s slide
Backpropagation Algorithm

• Werbos, Rumelhart, Hinton, Williams (1974)


• Until convergence:
• Present a training pattern to network
• Calculate the error of the output nodes
• Calculate the error of the hidden nodes, based on the output
node error which is propagated back
• Continue back-propagating error until the input layer
• Update all weights in the network
Background: Multi-Layer Neural Networks

• Nonlinear classifier
• Training: find network weights w to minimize the error between true
training labels yi and estimated labels fw(xi):
𝑁
𝐸 𝑤 =෍ (𝑦𝑖 − 𝑓𝑤 (𝑥𝑖 ))2
𝑖=1

• Minimization can be done by gradient descent provided f is


differentiable
• This training method is called back-propagation Slide credit: Rob Fergus
Concepts

• Fully Connected Layer


• Convolutional Layer
• CNN Pipeline
• A Little Bit History
Classification

Pedestrian Car Motorcycle Truck

What we want:
1 0
ℎ𝑤 (𝑥) ≈ 0 when pedestrian; ℎ𝑤 (𝑥) ≈ 1 when car
0 0
0 0

0 0
ℎ𝑤 (𝑥) ∈ ℝ𝐾 ℎ𝑤 (𝑥) ≈ 0 when motorcycle; ℎ𝑤 (𝑥) ≈ 0 when truck
1 0
0 1
Classification: Softmax Classifier
• Softmax Classifier (is also the multinomial logistic
regression)
• Remember that we can get scores for each classes
• Key: we want to interpret the raw scores as probabilities
o Probabilities must >= 0
o Must sum up as 1
𝑒 𝑠𝑘
𝑃 𝑌 = 𝑘 𝑋 = 𝑥𝑖 =
σ 𝑗 𝑒 𝑠𝑗
Fully Connected Layer

• 32 × 32 × 3 image -> stretch to 3072 × 1

input activation
𝑊𝑥
1 -> -> 1 ……

3072 10*3072 10
weights
Convolutional Layer

• 32 × 32 × 3 image -> preserve spatial structure

32 × 32 × 3
image
5 × 5 × 3 filter
1 number:

The result of taking a dot product


32 (height) between the filter and a small 5*5*3
chunk of the image (i.e. 5*5*3=75-
dimensional dot product + bias) 𝑊 𝑇 𝑋 + 𝑏
32 (width)

3 (depth)
Convolutional Layer
Convolutional Layer
Activation/Feature Map
32 × 32 × 3
image
5 × 5 × 3 filter

32 (height) 28
Convolve (slide)
over all spatial
locations
32 (width)
28
3 (depth) 1
Convolutional Layer
For example, if we have 6 5*5 filters, we will get 6 separate activation maps.
Activation/Feature Map

28
32 (height) Convolution Layer

32 (width)
28
3 (depth) 1

We stack these up to get a new “image” of size 28*28*6 !


Convolutional Layer
ConvNet is a sequence of Convolutinal Layers, interspersed with activation functions.

CONV, CONV, CONV,


ReLU ReLU ReLU
e.g. 6 e.g. 10 ……
32 28 5*5*6 24
5*5*3
filters filters
24
32
28
3 6 10
Convolutional Neural Networks (CNN,
Convnet)
• Neural network with specialized
connectivity structure
• Stack multiple stages of feature
extractors
• Higher stages compute more global,
more invariant features
• Classification layer at the end

Slide credit: Rob Fergus


Convolutional Neural Networks (CNN,
Convnet)
Feature maps
• Feed-forward feature extraction:
1. Convolve input with learned filters
2. Non-linearity Normalization

3. Spatial pooling
4. Normalization Spatial pooling
• Supervised training of convolutional
filters by back-propagating
classification error Non-linearity

Convolution
(Learned)

Slide credit: Rob Fergus Input Image


1. Convolution
• Dependencies are local
• Translation invariance
• Few parameters (filter weights)
• Stride can be greater than 1
(faster, less memory)

.
.
.

Feature Map
Input Slide credit: Rob Fergus
2. Non-Linearity

• Per-element (independent)
• Options:
• Tanh
• Sigmoid: 1/(1+exp(-x))
• Rectified linear unit (ReLU)
• Simplifies backpropagation
• Makes learning faster
• Avoids saturation issues
→ Preferred option

Slide credit: Rob Fergus


3. Spatial Pooling
• Sum or max
• Non-overlapping / overlapping regions
• Role of pooling:
• Invariance to small transformations
• Larger receptive fields (see more of input)

Max

Sum
Slide credit: Rob Fergus
Pooling
• Create some translational invariance at each level
by averaging 4 neighboring replicated detectors to
give a single output to the next level.
• Reduces number of inputs to the next layer of feature
extraction, thus allowing us to have many more different
feature maps.
• Taking the maximum of the four works slightly better.
• Problem: After several levels of pooling, we lose
information about where objects are.
• This makes it impossible to use the precise spatial
relationships between high-level parts for recognition.
• So CNNs are good for classification, not (directly) useful for
object localization.

Slide credit: G. Hinton


Slide credit: G. Hinton
4. Normalization
• Within or across feature maps
• Before or after spatial pooling

Feature Maps
Feature Maps After Contrast Normalization

Slide credit: Rob Fergus


Compare: SIFT Descriptor
Lowe
[IJCV 2004]
Image Apply
Pixels oriented filters

Spatial pool
(Sum)

Normalize to Feature
unit length Vector

Slide credit: Rob Fergus


Convnet Successes
• Handwritten text/digits
• MNIST (0.17% error [Ciresan et al. 2011])
• Arabic & Chinese [Ciresan et al. 2012]

• Simpler recognition benchmarks


• CIFAR-10 (9.3% error [Wan et al. 2013])
• Traffic sign recognition
• 0.56% error vs 1.16% for humans
[Ciresan et al. 2011]

• But until recently, less good at more


complex datasets
• Caltech-101/256 (few training examples)
Slide credit: Rob Fergus
ImageNet Challenge 2012
Validation classification

Validation classification • ~14 million labeled images, 20K


Validation classification classes
• Images gathered from Internet
• Human labels via Amazon Turk
• Challenge: 1.2 million training
images, 1000 classes

[Deng et al. CVPR 2009]

A. Krizhevsky, I. Sutskever, and G. Hinton, ImageNet Classification with Deep Convolutional Neural Networks,
NIPS 2012
Slide credit: Rob Fergus
ImageNet Challenge 2012
• Similar framework to LeCun’98 but:
• Bigger model (7 hidden layers, 650,000 units, 60,000,000 params)
• More data (106 vs. 103 images)
• GPU implementation (50x speedup over CPU)
• Trained on two GPUs for a week
• Better regularization for training (DropOut)

A. Krizhevsky, I. Sutskever, and G. Hinton, ImageNet Classification with Deep Convolutional Neural Networks,
NIPS 2012
Slide credit: Rob Fergus
ImageNet Challenge 2012
• Krizhevsky et al. -- 16.4% error (top-5)
• Next best (non-convnet) – 26.2% error
35

30

25
Top-5 error rate %

20

15

10

0
SuperVision ISI Oxford INRIA Amsterdam

Slide credit: Rob Fergus


Tools for deep learning
• Torch
• NYU, framework in Lua, supported by Facebook
• Caffe
• Berkeley, C++ with Python and Matlab wrappers, very active open-source
community
• Theano/Pylearn2
• U. Montreal, framework in Python, symbolic computation and automatic
differentiation
• Cuda-Convnet2
• Alex Krizhevsky, Very fast on state-of-the-art GPUs with Multi-GPU
parallelism, C++ / CUDA library
• MatConvNet
• CXXNet
• Mocha
• …
Slide credit: X. Chen
Training data
• For 60 million parameters, how much training data do we
need?

• How do we get that much data?

• How do we avoid over-fitting?


Tricks to improve generalization
• Train on random 224x224 patches from the 256x256
images to get more data. Also use left-right reflections of
the images.

• Use “dropout” to regularize the weights in the globally


connected layers (which contain most of the
parameters).
– Dropout means that half of the hidden units in a layer are
randomly removed for each training example.
– This stops hidden units from relying too much on other hidden
units.

Slide credit: G. Hinton

You might also like