0% found this document useful (0 votes)
97 views

Bag of Feature

This document discusses bag-of-features models for image classification. It describes how bag-of-features models represent images as histograms of visual word frequencies. The process involves extracting local image features, quantizing them into visual words via clustering, and encoding each image as a histogram of visual word counts. These histograms can then be classified using techniques like support vector machines. Nonlinear kernels allow the histograms to be separated in higher-dimensional feature spaces for improved classification performance.

Uploaded by

Budi Purnomo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
97 views

Bag of Feature

This document discusses bag-of-features models for image classification. It describes how bag-of-features models represent images as histograms of visual word frequencies. The process involves extracting local image features, quantizing them into visual words via clustering, and encoding each image as a histogram of visual word counts. These histograms can then be classified using techniques like support vector machines. Nonlinear kernels allow the histograms to be separated in higher-dimensional feature spaces for improved classification performance.

Uploaded by

Budi Purnomo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 75

Bag-of-features models

for category classification

Cordelia Schmid
Category recognition
• Image classification: assigning a class label to the image
Car: present
Cow: present
Bike: not present
Horse: not present

Category
Tasks
recognition
• Image classification: assigning a class label to the image
Car: present
Cow: present
Bike: not present
Horse: not present

• Object localization: define the location and the category

L
Location
ti
Car Cow
Category
Difficulties: within object variations

Variability: Camera position, Illumination,Internal parameters


Within-object variations
Difficulties: within class variations
Image classification
• Given
Positive training images containing an object class

Negative training images that don’t

• Classify
A test image as to whether it contains the object class or not

?
Bag-of-features
Bag of features – Origin: texture recognition

• Texture is characterized by the repetition of basic elements


or textons

Julesz, 1981; Cula & Dana, 2001; Leung & Malik 2001; Mori, Belongie & Malik, 2001;
Schmid 2001; Varma & Zisserman, 2002, 2003; Lazebnik, Schmid & Ponce, 2003
Bag-of-features
Bag of features – Origin: texture recognition

histogram

Universal texton dictionary

Julesz, 1981; Cula & Dana, 2001; Leung & Malik 2001; Mori, Belongie & Malik, 2001;
Schmid 2001; Varma & Zisserman, 2002, 2003; Lazebnik, Schmid & Ponce, 2003
Bag-of-features
Bag of features – Origin: bag
bag-of-words
of words (text)
• Orderless document representation: frequencies of words
from a dictionary
• Classification to determine document categories

Bag-of-words
Co
Commono 2 0 1 3
People 3 0 0 2
Sculpture 0 1 3 0
… … … … …
Bag-of-features
Bag of features for image classification

SVM

Extract regions Compute Find clusters Compute distance Classification


descriptors and frequencies matrix

[Csurka et al., ECCV Workshop’04], [Nowak,Jurie&Triggs,ECCV’06],


[Zhang,Marszalek,Lazebnik&Schmid,IJCV’07]
Bag-of-features
Bag of features for image classification

SVM

Extract regions Compute Find clusters Compute distance Classification


descriptors and frequencies matrix
Step 1 Step 2 Step 3
Step 1: feature extraction
• Scale-invariant
Scale invariant image regions + SIFT (see previous lecture)
– Affine invariant regions give “too” much invariance
– Rotation invariance for many realistic collections “too”
too much
invariance

• Dense descriptors
– Improve results in the context of categories (for most categories)
– Interest
I t t points
i t do
d nott necessarily
il capture
t “all”
“ ll” features
f t

• Color-based
Color based descriptors

• Shape-based
Shape based descriptors
Dense features

- Multi-scale dense grid: extraction of small overlapping patches at multiple scales


Computation of the SIFT descriptor for each grid cells
-Computation
-Exp.: Horizontal/vertical step size 3 pixel, scaling factor of 1.2 per level
Bag-of-features
Bag of features for image classification

SVM

Extract regions Compute Find clusters Compute distance Classification


descriptors and frequencies matrix
Step 1 Step 2 Step 3
Step 2: Quantization

Visual vocabulary

Clustering
Examples
p for visual words

Airplanes

Motorbikes

Faces

Wild Cats

Leaves

People

Bikes
Step 2: Quantization
• Cluster descriptors
– K-means
– Gaussian mixture model

• Assign
g each visual word to a cluster
– Hard or soft assignment

• Build frequency histogram


K-means
K means clustering
• Minimizing
g sum of squared
q Euclidean distances
between points xi and their nearest cluster centers

• Algorithm:
– Randomly y initialize K cluster centers
– Iterate until convergence:
• Assign each data point to the nearest center
• R
Recomputet eachh cluster
l t center t as th
the mean off allll points
i t
assigned to it

• Local minimum, solution dependent on initialization

• Initialization important, run several times, select best


Gaussian mixture model (GMM)
• Mixture of Gaussians: weighted sum of Gaussians

where
ee
Hard or soft assignment
• K-means
K means  hard assignment
– Assign to the closest cluster center
– Count number of descriptors assigned to a center

• Gaussian mixture model  soft assignment


g
– Estimate distance to all centers
– Sum over number of descriptors

• Represent image by a frequency histogram


cy
frrequenc Image representation

…..
codewords

• each image is represented by a vector, typically 1000-4000 dimension,


normalization with L1/L2 norm
• fine grained – represent model instances
• coarse grained – represent object categories
Bag-of-features
Bag of features for image classification

SVM

Extract regions Compute Find clusters Compute distance Classification


descriptors and frequencies matrix
Step 1 Step 2 Step 3
Step 3: Classification

• Learn a decision rule (classifier) assigning bag-of-


bag of
features representations of images to different classes

Decision Zebra
boundary
Non-zebra
Training data
Vectors are histograms, one from each training image

positive negative

Train classifier,e.g.SVM
Linear classifiers
• Find linear function (hyperplane) to separate positive and
negative
i examples l

x i positive : xi  w  b  0
x i negative : xi  w  b  0

Which hyperplane
is best?
Linear classifiers - margin
x2
(color)

• G
Generalization
li ti iis nott
good in this case:
x1 (roundness)

x2
(color)
• Better if a margin
is introduced: b/|w|

x1 (roundness)
Nonlinear SVMs
• Datasets that are linearly separable work out great:

0 x

• But what if the dataset is just too hard?

0 x

• We can map it to a higher


higher-dimensional
dimensional space:
x2

0 x
Nonlinear SVMs
• General idea: the original input space can always be
mapped to some higher-dimensional feature space
where the training set is separable:

Φ: x → φ(x)
Nonlinear SVMs

• The kernel trick: instead of explicitly computing the lifting


transformation φ(x), define a kernel function K such that
K(xi ,xj ) = φ(xi ) · φ(xj)
j

• This gives a nonlinear decision boundary in the original


feature
eatu e space:
space

  y K ( x , x)  b
i
i i i
Kernels for bags of features
N
• Histogram intersection kernel: I (h1 , h2 )   min(h (i), h (i))
i 1
1 2

• Generalized Gaussian kernel:


 1 2
K (h1 , h2 )  exp  D(h1 , h2 ) 
 A 
• D can be Euclidean distance  RBF kernel

• D can be χ2 distance
N
D(h1 , h2 )  
h1 (i)  h2 (i) 2
i 1 h1 (i )  h2 (i )
Combining features
•SVM with multi-channel chi-square kernel

● Channel c is a combination of detector, descriptor

● Dc (Hi , Hj ) is the chi-square distance between histograms


1 m
Dc ( H1 , H 2 ) 
2
i 1
[ ( h1i  h2i ) 2
(h1i  h2i )]

● Ac is the mean value of the distances between all training sample

● Extension: learning of the weights, for example with Multiple


Kernel Learning (MKL)

[J. Zhang, M. Marszalek, S. Lazebnik and C. Schmid. Local features and kernels for
classification of texture and object categories: a comprehensive study, IJCV 2007]
Combining features
• For linear SVMs
– Early fusion: concatenation the descriptors
– Late fusion: learning weights to combine the classification scores

• Theoreticallyy no clear winner

• In p
practice late fusion g
give better results
– In particular if different modalities are combined
Multi-class
Multi class SVMs
• Various direct formulations exist
exist, but they are not widely
used in practice. It is more common to obtain multi-class
SVMs by combining two-class
two class SVMs in various ways

• One versus all:


– Training: learn an SVM for each class versus the others
– Testing: apply each SVM to test example and assign to it the
class of the SVM that returns the highest decision value

• One
O versus one:
– Training: learn an SVM for each pair of classes
– Testing: each learned SVM “votes”
votes for a class to assign to the test
example
Why does SVM learning work?

• Learns foreground and background visual words

foreground words – high weight

background words – low weight


Illustration

Localization according to visual word probability


Correct − Image: 35 Correct − Image: 37

20 20

40 40

60 60

80 80

100 100

120 120

50 100 150 200 50 100 150 200

Correct − Image: 38 Correct − Image: 39

20 20

40 40

60 60

80 80

100 100

120 120

50 100 150 200 50 100 150 200

foreground word more probable

background word more probable


Illustration
A linear SVM trained from positive and negative window descriptors

A few of the highest weighted descriptor vector dimensions (= 'PAS + tile')

+ lie on object boundary (= local shape structures common to many training exemplars)
Bag-of-features
Bag of features for image classification
• Excellent results in the presence of background clutter

bikes books building cars people phones trees


Examples for misclassified images

Books- misclassified into faces, faces, buildings

Buildings- misclassified into faces, trees, trees

Cars- misclassified into buildings, phones, phones


Bag of visual words summary

• Advantages:
– largely unaffected by position and orientation of object in image
– fixed length vector irrespective of number of detections
– veryy successful in classifying
y g images
g according g to the objects
j they
y
contain

• Disadvantages:
– no explicit use of configuration of visual word positions
– no model of the object location
Evaluation of image classification
• PASCAL VOC [05
[05-12]
12] datasets

• PASCAL VOC 2007


– Training and test dataset available
– Used to report state
state-of-the-art
of the art results
– Collected January 2007 from Flickr
– 500 000 images downloaded and random subset selected
– 20 classes
– Class labels per image + bounding boxes
– 5011 ttraining
i i iimages, 4952 ttestt iimages

• Evaluation measure: average precision


PASCAL 2007 dataset
PASCAL 2007 dataset
Evaluation
Precision/Recall

• Ranked list for category A :

A, C, B, A, B, C, C, A ; in total four images with category A


Results for PASCAL 2007
• Winner of PASCAL 2007 [[Marszalek et al.]] : mAP 59.4
– Combination of several different channels (dense + interest
points, SIFT + color descriptors, spatial grids)
– Non-linear
N li SVM with
ith G
Gaussian
i kkernell

• Multiple kernel learning [Yang et al


al. 2009] : mAP 62
62.2
2
– Combination of several features
– Group-based
p MKL approach
pp

• Combining object localization and classification


[Harzallah et al.’09] : mAP 63.5
– Use detection results to improve classification

• Adding objectness boxes [Sanchez at al.’12] : mAP 66.3


Spatial pyramid matching
• Add spatial information to the bag
bag-of-features
of features

• Perform
P f matching
t hi ini 2D iimage space

[Lazebnik, Schmid & Ponce, CVPR 2006]


Related work
Similar approaches:
Subblock description [Szummer & Picard, 1997]
SIFT [Lowe, 1999]
GIST [Torralba et al., 2003]

SIFT Gist

Szummer & Picard (1997) Lowe (1999


(1999, 2004) Torralba et al
al. (2003)
Spatial pyramid representation

Locally orderless
representation
i at
several levels of
spatial resolution

level 0
Spatial pyramid representation

Locally orderless
representation
i at
several levels of
spatial resolution

level 0 level 1
Spatial pyramid representation

Locally orderless
representation
i at
several levels of
spatial resolution

level 0 level 1 level 2


Spatial pyramid matching
• Combination of spatial levels with pyramid match kernel
[Grauman & Darell’05]
• Intersect histograms, more weight to finer grids
Scene dataset [Labzenik et al.’06]
Coast Forest Mountain Open country Highway Inside city Tall building Street

Suburb Bedroom Kitchen Living room Office

Store Industrial

4385 images
155 categories
c ego es
Scene classification

L Single-level
Single level Pyramid
0(1x1) 72.2±0.6
1(2x2) 77.9±0.6 79.0 ±0.5
2(4x4) 79.4±0.3 81.1 ±0.3
3(8x8) 77.2±0.4 80.7 ±0.3
Retrieval examples
Category classification – CalTech101

L Single-level Pyramid
0(1x1) 41.2±1.2
1(2x2) 55.9±0.9 57.0 ±0.8
2(4x4) 63.6±0.9 64.6 ±0.8
3(8x8) 60 3±0 9
60.3±0.9 64 6 ±0.7
64.6 ±0 7
Evaluation BoF – spatial
Image classification results on PASCAL
PASCAL’07
07 train/val set

(SH, Lap, MSD) x (SIFT,SIFTC) AP


spatial layout
1 0.53

2x2
3x1

1,2x2,3x1
Evaluation BoF – spatial
Image classification results on PASCAL
PASCAL’07
07 train/val set

(SH, Lap, MSD) x (SIFT,SIFTC) AP


spatial layout
1 0.53

2x2 0.52
3x1 0.52

1,2x2,3x1 0.54

Spatial layout not dominant for PASCAL’07 dataset


C bi i iimproves average results,
Combination l ii.e., iit iis appropriate
i ffor
some classes
Evaluation BoF - spatial

Image classification results on PASCAL’07 train/val set


for individual categories
g

1 3x1
Sheep 0.339 0.256
Bird 0.539 0.484
DiningTable 0.455 0.502
Train 0.724 0.745

Results are category


g y dependent!
p
 Combination helps somewhat
Discussion

• Summary
– Spatial pyramid representation: appearance of local image
patches
t h + coarse global
l b l position
iti iinformation
f ti
– Substantial improvement over bag of features
– Depends on the similarity of image layout

• Recent extensions
– Flexible, object-centered grid
• Shape
p masks [[Marszalek’12]] => additional annotations
– Weakly supervised localization of objects
• [Russakovsky et al.’12]
Recent extensions

• Efficient Additive Kernels via Explicit Feature Maps


[Perronnin et al.
al.’10,
10, Maji and Berg’09,
Berg 09, A. Vedaldi and Zisserman’10]
Zisserman 10]

• Recently improved aggregation schemes


– Fisher vector [Perronnin & Dance ‘07]
– VLAD descriptor [Jegou, Douze, Schmid, Perez ‘10]
– Supervector [Zhou et al. ‘10]
– Sparse coding [Wang et al. ’10, Boureau et al.’10]

• Improved performance + linear SVM


Fisher vector

 Use a Gaussian Mixture Model as vocabulary


 Statistical measure of the descriptors of the image w.r.t the GMM
 D i ti off likelihood
Derivative lik lih d w.r.t.
t GMM parameterst

GMM parameters:
weight
mean
co-variance (diagonal)

Translated cluster →
large derivative on for this
component

[Perronnin & Dance 07]


Fisher vector

For image retrieval in our experiments:


- only
l ddeviation
i ti wrtt mean, di
dim: K*D [K number
b off Gaussians,
G i D di
dim off d
descriptor]
i ]
- variance does not improve for comparable vector length
Image classification with Fisher vector
• Dense SIFT
• Fisher vector (k=32 to 1024, total dimension from approx.
5000 to 160000)
• Normalization
– square-rooting
– L2 normalization
– [Perronnin’10], [Image categorization using Fisher kernels of non-iid
image models, Cinbis, Verbeek, Schmid, CVPR’12]

• Classification approach
– Linear classifiers
– One versus
ers s rest classifier
Image classification with Fisher vector
• Evaluation on PASCAL VOC’07 linear classifiers with
– Fisher vector
– Sqrt transformation of Fisher vector
– Latent GMM of Fisher vector

• Sqrt transform + latent MOG


models lead to improvement
p

• State-of-the-art performance
obtained
bt i d with
ith lilinear classifier
l ifi
Evaluation image description
Fisher versus BOF vector + linear classifier on Pascal Voc’07

•Fisher improves over BOF


•Fisher comparable
p to BOF +
non-linear classifier
•Limited gain due to SPM
on PASCAL
•Sqrt helps for Fisher and BOF
•[Chatfield et al
al. 2011]
Large-scale
Large scale image classification
has 14M images
g from 22k classes

Standard Subsets
– ImageNet
I N t Large
L S
Scale
l Vi
Visuall R
Recognition
iti ChChallenge
ll 2010 (ILSVRC)
• 1000 classes and 1.4M images
– ImageNet10K dataset
• 10184 classes and ~ 9 M images
Large-scale
Large scale image classification
• Classification approach
– One-versus-rest classifiers
– Stochastic gradient descent (SGD)
– At each step choose a sample at random and update the
parameters using a sample-wise estimate of the regularized risk

• Data reweighting
– Wh
When some classes
l are significantly
i ifi tl more populated
l t d than
th others,
th
rebalancing positive and negative examples
– Empirical
p risk with reweighting
g g

Natural rebalancing, same weight to positive and negatives


Importance of re-weighting
re weighting

• Plain lines correspond to w-OVR,


d h d one tto u-OVR
dashed OVR

• ß is number of negatives samples


for each positive, β=1 natural
rebalancing

• Results for ILSVRC 2010

• Significant impact on accuracy


• For very high dimensions little impact
Impact of the image signature size
• Fisher vector (no SP) for varying number of Gaussians +
different classification methods, ILSVRC 2010

• Performance
P f improves
i f higher
for hi h di
dimensional
i l vectors
t
Experimental results
• Features: dense SIFT,
SIFT reduced to 64 dim with PCA

• Fisher vectors
– 256 Gaussians, using mean and variance
– Spatial pyramid with 4 regions
– Approx. 130K dimensions (4x [2x64x256])
– Normalization: square-rooting and L2 norm

• BOF: dim 1024 + R=4


– 4960 dimensions
– Normalization: square-rooting and L2 norm
Experimental results for ILSVRC 2010

• Features
F t : dense
d SIFT,
SIFT reduced
d d to
t 64 dim
di with
ith PCA
• 256 Gaussian Fisher vector using mean and variance + SP
(3x1) (4x [2x64x256] ~ 130k dim), square-root + L2 norm
• BOF dim=1024 + SP (3x1) (dim 4000), square-root + L2 norm
• Different classification methods
Large-scale
Large scale experiment on ImageNet10k

16.7

Top-1 accuracy

• Significant gain by data re-weighting, even for high-


dimensional Fisher vectors
• w-OVR > u-OVR
• Improves
Impro es oover
er state of the art
art: 6
6.4%
4% [Deng et
et. al] and
WAR [Weston et al.]
Large-scale
Large scale experiment on ImageNet10k
• Illustration of results obtained with w
w-OVR
OVR and 130K
130K-dim
dim
Fisher vectors, ImageNet10K top-1 accuracy
Conclusion

• Stochastic training: learning with SGD is well


well-suited
suited for
large-scale datasets

• One-versus-rest: a flexible option for large-scale image


classification

• Class imbalance: optimize the imbalance parameter in


one-versus-rest strategy is a must for competitive
p
performance
Conclusion

• State-of-the-art performance for large-scale image


classification

• Code on
on-line
line available at http://lear
http://lear.inrialpes.fr/software
inrialpes fr/software

• Future work
– Beyond a single representation of the entire image
– Take into account the hierarchical structure

You might also like