Information - Theory - in - Computer - Vision - and - Pattern - Recognition 2009
Information - Theory - in - Computer - Vision - and - Pattern - Recognition 2009
Information Theory
in Computer Vision
and Pattern Recognition
Foreword by
Alan Yuille
123
Francisco Escolano Boyán Bonev
Universidad Alicante Universidad Alicante
Depto. Ciencia de la Depto. Ciencia de la
Computación e Computación e
Inteligencia Artificial Inteligencia Artificial
Campus de San Vicente, s/n Campus de San Vicente, s/n
03080 Alicante 03080 Alicante
Spain Spain
[email protected] [email protected]
Pablo Suau
Universidad Alicante
Depto. Ciencia de la
Computación e
Inteligencia Artificial
Campus de San Vicente, s/n
03080 Alicante
Spain
[email protected]
Springer
c Verlag London Limited 2009
Apart from any fair dealing for the purposes of research or private study, or criticism or review, as
permitted under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced,
stored or transmitted, in any form or by any means, with the prior permission in writing of the publish-
ers, or in the case of reprographic reproduction in accordance with the terms of licenses issued by the
Copyright Licensing Agency. Enquiries concerning reproduction outside those terms should be sent to
the publishers.
The use of registered names, trademarks, etc., in this publication does not imply, even in the absence of a
specific statement, that such names are exempt from the relevant laws and regulations and therefore free
for general use.
The publisher makes no representation, express or implied, with regard to the accuracy of the information
contained in this book and cannot accept any legal responsibility or liability for any errors or omissions
that may be made.
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Measures, Principles, Theories, and More . . . . . . . . . . . . . . . . . . . 1
1.2 Detailed Organization of the Book . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 The ITinCVPR Roadmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353
1
Redundancy reduction revisited, Network: Comput. Neural Syst. 12 (2001)
241–253.
2
Principal Component Analysis, Independent Component Analysis.
F. Escolano et al., Information Theory in Computer Vision and Pattern Recognition, 1
c Springer-Verlag London Limited 2009
2 1 Introduction
pattern classes are too close). For the zero value, there is a phase transition.
These results come from the analysis of other tasks like detecting a path (edge)
among clutter, and the existence of order parameters, which allow to quantify
the degree of success of related tasks like contour linking.
In addition to measures, principles and theories, there is a fourth dimen-
sion, orthogonal to the other three, to explore: the problem of estimating
entropy, the fundamental quantity in IT. Many other measures (mutual in-
formation, Kullback–Leibler divergence, and so on) are derived from entropy.
Consequently, in many cases, the practical application of the latter elements
relies on a consistent estimation of entropy. The two extreme cases are the
plug-in methods, in which the estimation of the probability density precedes
the computing of entropy, and the bypass methods where entropy is estimated
directly. For example, as the Gaussian distribution is the maximum entropy
one among all the distributions with the same variance, the latter consid-
eration is key, for instance, in Gaussian Mixture Models, typically used as
classifiers, and also in ICA methods, which usually rely on the departure from
Gaussianity. Thus, entropy estimation will be a recurrent topic along the book.
3
Data driven Markov Chain Monte Carlo.
1.2 Detailed Organization of the Book 5
starting with a unique kernel and decomposing it when needed. This is the
entropy-based EM (EBEM) approach, and it consists of splitting a Gaussian
kernel when bi-modality in the underlying data is suspected. As Gaussian dis-
tributions are the maximum-entropy ones among all distributions with equal
variance, it seems reasonable to measure non-Gaussianity as the ratio between
the entropy of the underlying distribution and the theoretical entropy of the
kernel, when it is assumed to be Gaussian. As the latter criterion implies
measuring entropy, we may use either a plug-in method or a bypass one (e.g.
entropic graphs). Another trend in this chapter is the Information Bottleneck
(IB) method, whose general idea has been already introduced. Here, we start
by introducing the measure of distortion between examples and prototypes.
Exploiting Rate Distortion Theory in order to constrain the distortion (oth-
erwise information about examples is lost – maximum compression), a varia-
tional formulation yields an iterative algorithm (Blahut–Arimoto) for finding
the optimal partition of the data given the prototypes, but not for obtain-
ing the prototypes themselves. IB comes from another fundamental question:
distortion measure to use. It turns out that trying to preserve the relevant
information in the prototype about another variable, which is easier than
finding a good distortion measure, leads to a new variational problem and
the Kullback–Leibler divergence emerges as natural distortion measure. The
new algorithm relies on deterministic annealing. It starts with one cluster and
progresses by splitting. An interesting variation of the basic algorithm is its
agglomerative version (start from as many clusters as patterns and build a
tree through different levels of abstraction). There is also recent work (RIC4 )
that addresses the problem of model-order selection through learning theory
(VC-dimension) and eliminates outliers by following a channel-capacity cri-
terion. Next item is to analyze how the yet classic and efficient mean-shift
clustering may be posed in IT terms. More precisely, the key is to minimize
the Rényi’s quadratic entropy. The last topic of the chapter, clustering ensem-
bles, seeks for obtaining combined clusterings/partitions in an unsupervised
manner, so that the resulting clustering yields better quality than individ-
ual ones. Here, combination means some kind of consensus. There are several
definitions of consensus, and one of them, median consensus, can be found
through maximizing the information sharing between several partitions.
After reviewing IT solutions to the clustering problem, Chapter 6 deals
with a fundamental question, feature selection, which has deep impact both
in clustering and in classifier design (which will be tackled in Chapter 7). We
review filters and wrappers for feature selection under the IT appeal. Con-
sidering wrappers, where the selection of a group of features conforms to the
performance induced in a supervised classifier (good generalization for unseen
patterns), mutual information plays an interesting role. For instance, max-
imizing mutual information between the unknown true labels associated to
a subset of features and the labels predicted by the classifier seems a good
4
Robust Information Clustering.
1.2 Detailed Organization of the Book 7
criterion. However, the feature selection process (local search) is complex, and
going far from a greedy solution, gets more and more impractical (exponential
cost), though it can be tackled through genetic algorithms. On the other hand,
filters rely on statistical tests for predicting the goodness of the future classi-
fier for a given subset of features. Recently, mutual information has emerged
as the source of more complex filters. However, the curse of dimensionality
preclude an extended use of such criterion (maximal dependence between fea-
tures and classes), unless a fast (bypass) method for entropy estimation is
used. Alternatively, it is possible to formulate first-order approximations via
the combination of simple criteria like maximal relevance and minimal redun-
dancy. Maximal relevance consists of maximizing mutual information between
isolated features and target classes. However, when this is done in a greedy
manner it may yield redundant features that should be removed in the quasi-
optimal subset (minimal redundancy). The combination is dubbed mRMR.
A good theoretical issue is the connection between incremental feature selec-
tion and the maximum dependency criterion. It is also interesting to combine
these criteria with wrappers, and also to explore their impact on classifica-
tion errors. Next step in Chapter 5 is to tackle feature selection for generative
models. More precisely, texture models presented in Chapter 3 (segmentation)
may be learned through the application of the minimax principle. Maximum
entropy has been introduced and discussed in Chapter 3 as a basis for model
learning. Here we present how to specialize this principle to the case of learning
textures from examples. This may be accomplished by associating features to
filter responses histograms and exploiting Markov random fields (the FRAME
approach: Filters, Random Fields and Maximum Entropy). Maximum entropy
imposes matching between filter statistics of both the texture samples and the
generated textures through Gibbs sampling. On the other hand, filter selection
attending minimal entropy should minimize the Kullback–Leibler divergence
between the obtained density and the unknown density. Such minimization
may be implemented by a greedy process focused on selecting the next feature
inducing maximum decrease of Kullback–Leibler divergence with respect to
the existing feature set. However, as the latter divergence is complex to com-
pute, the L1 norm between the observed statistics and those of the synthesized
texture, both for the new feature, is finally maximized. Finally in Chapter 6,
we cover the essential elements necessary for finding an adequate projection
basis for vectorial data and, specially, images. The main concern with respect
to IT is the choice of the measures for quantifying the interest of a given pro-
jection direction. As we have referred to at the beginning of this introduction,
projection bases whose components are as much independent as possible seem
more interesting, in terms of pattern recognition, than those whose compo-
nents are simply decorrelated (PCA). Independence may be maximized by
maximizing departure from Gaussianity, and this is what many ICA algo-
rithms do. Thus, the concept of neg-entropy (difference between entropy of a
Gaussian and that of the current outputs/components distribution) arises. For
instance, the well-known FastICA algorithm is a fixed-point method driven by
8 1 Introduction
H entropy
estimation
Fig. 1.1. The ITinCVPR tube/underground (lines) communicating several problems (quarters) and stopping at several stations. See
9
Color Plates.
10 1 Introduction
We finish this chapter, and the book, with an in depth review of maximum
entropy classification, the exponential distributions family, and their links with
information projection, and, finally, with the recent achievements about the
implications of Bregman divergences in classification.
The main idea of this section is to graphically describe the book as the map of
a tube/underground communicating several CVPR problems (quarters). The
central quarter is recognition and matching, which is adjacent to all the others.
The rest of adjacency relations () are: interest points and edges segmenta-
tion clustering classifier design feature selection and transformation.
Tube lines are associated to either measures, principles, and theories. Stations
are associated significant concepts for each line. The case of transfer stations
is especially interesting. Here one can change from one line to another carrying
on the concept acquired in the former line and the special stations associated
to entropy estimation. This idea has been illustrated in Fig. 1.1 and it is a
good point to revisit as the following chapters are understood.
2
Interest Points, Edges, and Contour Grouping
2.1 Introduction
This chapter introduces the application of information theory to the field of
feature extraction in computer vision. Feature extraction is a low-level step in
many computer vision applications. Its aim is to detect visual clues in images
that help to improve the results or speed of this kind of algorithms. The first
part of the chapter is devoted to the Kadir and Brady scale saliency algo-
rithm. This algorithm searches the most informative regions of an image, that
is, the regions that are salient with respect to their local neighborhood. The
algorithm is based on the concept of Shannon’s entropy. This section ends
with a modification of the original algorithm, based on two information the-
oretical divergence measures: Chernoff information and mutual information.
The next section is devoted to the statistical edge detection work by Kon-
ishi et al. In this work, Chernoff information and mutual information, two
additional measures that will be applied several times through this book, are
applied to evaluate classifiers performance. Alternative uses of information
theory include the theoretical study of some properties of algorithms. Specif-
ically, Sanov’s theorem and the theory of types show the validity of the road
tracking detection among clutter of Coughlan et al. Finally, the present chap-
ter introduces an algorithm by Cazorla et al., which is aimed at detecting
another type of image features: junctions. The algorithm is based on some
principles explained here.
images) in order to produce a result. Repeating the same operations for all
pixels in a whole set of images may produce an extremely high computational
burden, and as a result, these applications may not be prepared to operate
in real time. Feature extraction in image analysis may be understood as a
preprocessing step, in which the objective is to provide a set of regions of the
image that are informative enough to successfully complete the previously
mentioned tasks. In order to be valid, the extracted regions should be invari-
ant to common transformations, such as translation and scale, and also to
more complex transformations. This is useful, for instance, when trying to
recognize an object from different views of the scene, or when a robot must
compare the image it has just took with the ones in its database in order to
determine its localization on a map.
An extensive number of different feature extraction algorithms have been
developed during the last few years, the most known being the multiscale gen-
eralization of the Harris corner detector and its multiscale modification [113],
or the recent Maximally Stable Extremal Regions algorithm [110], a fast, ele-
gant and accurate approach. However, if we must design an algorithm to search
informative regions on an image, then it is clear that an information theory-
based solution may be considered. In this section we explain how Gilles’s first
attempt based on entropy [68] was first generalized by Kadir and Brady [91]
to be invariant to scale transformations, and then it was generalized again to
be invariant to affine transformations. We will also discuss about the compu-
tational time of this algorithm, its main issue, and how it can be reduced by
means of the analysis of the entropy measure.
Fig. 2.1. Comparison of Gilles and scale saliency algorithms. Left: original synthetic
image. Center: results for Gilles algorithm, using |Rx | = 9, and showing only one
extracted region for each black circle. Right: scale saliency output, without clustering
of overlapping results.
14 2 Interest Points, Edges, and Contour Grouping
The main drawback of this method is its time complexity, due to the fact
that each pixel at each scale must be computed. As a consequence, Kadir and
Brady scale saliency feature extractor is the slowest of all modern feature
extraction algorithms, as recent surveys suggest. However, complexity may
be remarkably decreased if we consider a simple idea: given a pixel x and two
different neighborhoods Rx and Rx , if Rx is homogeneous and |Rx | > |Rx |,
2.2 Entropy and Interest Points 15
Fig. 2.2. Left: original synthetic image, formed by anisotropic regions. Center:
output of the scale saliency algorithm. Right: output of the affine invariant scale
saliency algorithm.
Z
8
15
y 10
250 300
5 150 200
50 100 x
0
Fig. 2.3. Evolution of saliency through scale space. The graph on the right shows
how the entropy value (z axis) of all pixels in the row highlighted on the left image
(x axis) varies from smin = 3 to smax = 20 (y axis). As can be seen, there are not
abrupt changes for any pixel in the scale range.
then the probability of Rx of also being homogeneous is high. Figure 2.3
shows that this idea is supported by the evolution of entropy through scale
space; the entropy value of a pixel smoothly varies through different scales.
Therefore, a preprocess step may be added to the original algorithm in
order to discard several pixels from the image, based on the detection of
homogeneous regions at smax . This previous stage is summarized as follows:
1. Calculate the local entropy HD for each pixel at scale smax .
2. Select an entropy threshold σ ∈ [0, 1].
3. X = {x | maxH D (x,smax )
x {HD (x,smax )}
> σ}.
4. Apply scale saliency algorithm only to those pixels x ∈ X.
It must be noted that the algorithm considers the relative entropy with re-
spect to the maximum entropy value at smax for the image in step 3. This way,
16 2 Interest Points, Edges, and Contour Grouping
0.07
θ1 θ2 θ3
0.06
0.05
0.04
0.03
0.02
0.01
0
0 0.2 0.4 0.6 0.8 1
Fig. 2.4. pon (θ) (solid plot) and poff (θ) (dashed plot) distributions estimated from
all pixels of a set of images belonging to a same image category. Threshold θ1 ensures
that the filter does not remove any pixel from training images that are not part of
the most salient regions; however, in the case of new input images, we may find
salient regions, the relative entropy of which is lower than this threshold. Choosing
threshold θ2 increases the amount of filtered points in exchange for increasing the
probability of filtering salient regions of the image. Finally, threshold θ3 assumes a
higher risk, as far as more salient and not salient regions of the image will be filtered.
However, the probability of a nonsalient pixel to be removed from the image is still
higher than in the case of a salient one.
Information. The expected error rate of a likelihood test based on pon (φ)
and poff (φ) decreases exponentially with respect to C(pon (φ), poff (φ)), where
C(p, q) is the Chernoff Information between two probability distributions p
and q, and is defined by
⎛ ⎞
J
C(p, q) = − min log ⎝ pλ (yj )q 1−λ (yj )⎠ (2.3)
0≤λ≤1
j=1
where {yj : j = 1, . . . , J} are the variables that the distributions are de-
fined over (in this case, the probality of each relative entropy value in [0, 1]).
Chernoff Information quantifies the easiness of knowing from which of the two
distributions came a set of values. This measure may be used as an homogene-
ity estimator for an image class during training. If the Chernoff Information
of a training set is low, then images in that image class are not homogeneous
enough and it must be splitted into two or more classes. A related measure
is Bhattacharyya Distance. Bhattacharyya Distance is a particular case of
Chernoff Information in which λ = 1/2:
⎛ ⎞
J
BC(p, q) = − log ⎝ p 2 (yj )q 2 (yj )⎠
1 1
(2.4)
j=1
18 2 Interest Points, Edges, and Contour Grouping
2. Evaluate C(poff (θ), poff (θ)). A low value means that the image class is not
homogeneous enough, and it is not possible to learn a good threshold. In
this case, split the image class into new subclasses and repeat the process
for each of them.
3. Estimate the range limits −D(pon (θ)||poff (θ)) and D(poff (θ)||pon (θ)).
4. Select a threshold in the range given by Kullback–Leibler divergence
−D(poff (θ)||pon (θ)) < T < D(pon (θ)||poff (θ)). The minimum T value in this
range is a conservative good trade-off between efficiency and low error rate.
Higher T values will increase error rate accordingly to C(poff (θ), poff (θ)).
Then, new images belonging to the same image category can be filtered
before applying the scale saliency algorithm, discarding points that probably
are not part of the most salient features:
1. Calculate the local relative entropy θx = HDx /Hmax at smax for each pixel
x, where Hmax is the maximum entropy value for any pixel at smax .
2. X = {x| log ppoff
on (θ)
(θ) > T }, where T is the learned threshold for the image
class that the input image belongs to.
3. Apply the scale saliency algorithm only to pixels x ∈ X.
In order to demonstrate the validity of this method, Table 2.1 shows some
experimental results, based on the well-known Object Categories dataset from
Visual Geometry Group, freely available on the web. This dataset is composed
of several sets of images representing different image categories. These results
were extracted following these steps: first, the training process was applied to
each image category, selecting a random 10% of the images as training set.
The result of the first step is a range of valid thresholds for each image cate-
gory. Chernoff Information was also estimated. Then, using the rest of images
from each category as test set, we applied the filtering algorithm, using two
different thresholds in each case: the minimum valid value for each category
and T = 0. Table 2.1 shows the results for each image category, including
the mean amount of points (% points) filtered and the mean amount of time
(% time) saved for each image category, depending on the used threshold. The
last column shows the mean localization error of the extracted features ():
1 d(Ai , Bi ) + d(Bi , Ai )
N
= (2.7)
N i=1 2
where N is the number of images in the test set, Ai represents the clustered
most salient regions obtained after applying the original scale saliency algo-
rithm to image i, Bi represents the clustered most salient regions obtained
from the filtered scale saliency algorithm applied to image i, and
d(A, B) = min a − b (2.8)
b∈B
a∈A
20 2 Interest Points, Edges, and Contour Grouping
Table 2.1. Application of the filtered scale saliency to the Visual Geometry Group
image categories database.
Test set Chernoff T % Points % Time
Airplanes side 0.415 −4.98 30.79 42.12 0.0943
0 60,11 72.61 2.9271
Background 0.208 −2.33 15.89 24.00 0.6438
0 43.91 54.39 5.0290
Bottles 0.184 −2.80 9.50 20.50 0.4447
0 23.56 35.47 1.9482
Camel 0.138 −2.06 10.06 20.94 0.2556
0 40.10 52.43 4.2110
Cars brad 0.236 −2.63 24.84 36.57 0.4293
0 48.26 61.14 3.4547
Cars brad bg 0.327 −3.24 22.90 34.06 0.2091
0 57.18 70.02 4.1999
Faces 0.278 −3.37 25.31 37.21 0.9057
0 54.76 67.92 8.3791
Google things 0.160 −2.15 14.58 25.48 0.7444
0 40.49 52.81 5.7128
Guitars 0.252 −3.11 15.34 26.35 0.2339
0 37.94 50.11 2.3745
Houses 0.218 −2.62 16.09 27.16 0.2511
0 44.51 56.88 3.4209
Leaves 0.470 −6.08 29.43 41.44 0.8699
0 46.60 59.28 3.0674
Motorbikes side 0.181 −2.34 15.63 27.64 0.2947
0 38.62 51.64 3.7305
Fig. 2.6. Left: an example of natural scene containing textured regions. Center:
output of the Canny algorithm applied to the left image. A high amount of edges
appear as a consequence of texture and clutter; these edges do not correspond to
object boundaries. Right: an example of ideal edge detection, in which most of the
edges are part of actual object boundaries.
But more interesting is that they use Chernoff Information and conditional
entropy to evaluate the effect on edge extraction performance of several as-
pects of their approach: the use of a set of scales rather than only one scale,
the quantization of histograms used to represent probability densities, and the
effect of representing the scale space as an image pyramid, for instance. Thus,
information theory is introduced as a valid tool for obtaining information
about statistical learning processes.
Conditional entropy H(Y |X) is a measure of the remaining entropy or
uncertainty of a random variable Y , given another random variable X, the
value of which is known. A low value of H(Y |X) means that the variable X
yields high amount of information about variable Y , making easier to predict
its value. In the discrete case, it can be estimated as
N
N
M
H(Y |X) = p(xi )H(Y |X = xi ) = − p(xi ) p(yj |xi ) log p(yj |xi )
i=1 i=1 j=1
(2.9)
We will explain below how this measure may be applied to the context of
classification evaluation.
The Kadir and Brady feature extractor filter described above is inspired in
statistical edge detection. It implies learning two probability distributions
named pon (φ) and poff (φ), giving the probability of a pixel that is part of an
edge or not, depending on the response of a certain filter φ. In the work by
Konishi et al., the set of filters used to edge detection was a simple one (gradi-
ent magnitude, Nitzberg, Laplacian) applied to different data obtained from
the images (gray-scale intensity, complete color intensities and chrominancy).
Then, the log-likelihood ratio may be used to categorize any pixel I(x) on
an image I, classifying it as an edge pixel if this log-likelihood ratio is over a
given threshold:
pon (φ(I(x)))
log >T (2.10)
poff (φ(I(x)))
These distributions may be learnt from a image dataset that includes a
groundtruth edge segmentation indicating the real boundaries of the objects
in a sequence of images (like, for instance, the right image in Fig. 2.6). Two
well-known examples are the South Florida and the Sowerby datasets, which,
unfortunately, are not freely available. The presence of poff (φ) is an improve-
ment over traditional edge detection methods. The only information used by
edge detection algorithms based on convolution masks is local information
close to the edges; poff (φ) may help to remove edges produced by background
clutter and textures.
2.3 Information Theory as Evaluation Tool 23
0.08 0.08
0.07 0.07
0.06 0.06
0.05 0.05
0.04 0.04
0.03 0.03
0.02 0.02
0.01 0.01
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Fig. 2.7. pon (φ) (solid line) and poff (φ) (dashed line) for two example filter based
classifiers. The overlap of the distributions on the left is lower than in the case on
the right. In the first case, Chernoff Information value is 0.508, while in the second
case, the value is 0.193. It can be clearly seen that the first classifier will distinguish
better between on-edge and off-edge pixels.
The edge localization problem can be posed as the classification of all pixels on
an image as being an edge or not depending on their distance to their nearest
edge. This problem may be seen as a binary or a multiple class classification.
Binary classification means that pixels are splitted into two categories: pixels
whose nearest edge is below or over a certain distance. In the case of multiple
class classification, each pixel is assigned a category depending on the distance
to its nearest edge.
Let us first deal with binary classification. Given the function w(x), that
assigns to each pixel the distance to its nearest edge, and a threshold w, pixels
can be splitted into two groups or classes:
α1 = {x : w(x) ≤ w}
(2.11)
α2 = {x : w(x) > w}
24 2 Interest Points, Edges, and Contour Grouping
Fig. 2.8. Evolution of Chernoff Information for an edge classifier based on different
filters when applied to full color, gray scale and chrominance information extracted
from the images of the South Florida dataset, when using scale σ = 1, two scales
σ = {1, 2}, and three scales σ = {1, 2, 4}. Chernoff Information increases as more
scales are included; the conclusion is that a multiscale edge detection approach will
perform better than a monoscale one. Furthermore, color information yields even
higher Chernoff Information values, so this information is useful for this task. Further
experiments of Chernoff Information evolution depending on scale in [98] show that
if only one scale can be used, it would be better to choose an intermediate one.
(Figure by Konishi et al. [99] 2003
c IEEE.)
Using the training groundtruth edge information, the p(φ|α1 ) and p(φ|α2 )
conditional distributions, as well as prior distributions p(α1 ) and p(α2 ) must
be estimated. The classification task now is simple: given a pixel x and the
response of the filter φ for that pixel φ(x) = y, Bayes rules yield p(α1 |φ(x) = y)
and p(α2 |φ(x) = y), allowing the classification algorithm to decide on the class
of pixel x. Chernoff Information may also be used to evaluate the performance
of the binary edge localization. A summary of several experiments of Konishi
et al. [98] supports the coarse to fine edge localization idea: coarse step consists
of looking for the approximate localization of the edges on the image using
only an optimal scale σ ∗ . In this case, w = σ ∗ . Then, in the fine step, filters
based on lower scales are applied in order to refine the search. However, the
parameter σ ∗ depends on the dataset.
Multiclass classification differs from the previous task in the number of
classes considered, which is now being greater than two. For instance, we
could split the pixels into five different classes:
⎧
⎪
⎪ α1 = {x : w(x) = 0}
⎪
⎪
⎨ α2 = {x : w(x) = 1}
α3 = {x : w(x) = 2} (2.12)
⎪
⎪
⎪
⎪ α4 = {x : 2 < w(x) ≤ 4}
⎩
α5 = {x : w(x) > 4}
2.3 Information Theory as Evaluation Tool 25
Once again, and using the training groundtruth, we can estimate p(φ|αi )
and p(αi ) for i = 1 . . . C, C being the number of classes (5 in our example).
Then, any pixel x is assigned class α∗ , being
C
H(φ|y) = − p(αi |φ = y)p(y) log p(αi |φ = y) (2.14)
y i=1
Fig. 2.9. Left: two examples of how Chernoff Information evolves depending on
the number of cuts of the decision tree, that will define the probability distribution
quantization. An asymptote is reached soon; most of the information can be obtained
using a low amount of histogram bins. Right: example of overlearning. As long as the
number of cuts increases an asymptote is reached, but, at certain point, it starts to
increase again. This is an evident sign of overlearning. (Figure by Konishi et al. [99]
2003
c IEEE.)
reach an asymptote, that is, only a relative low amount of bins are needed to
obtain a good quantization. On the other hand, in Fig. 2.9 (right), we can see
overlearning effect. If the number of cuts is high enough, an abrupt increase of
Chernoff Information value appears. This abrupt increase is produced by the
fact that the chosen quantization is overfitted to the training data, for which
the classification task will be remarkably easier; however, the obtained classi-
fier will not be valid for examples different to those in the training dataset.
J
≤ p(φ ∈ E) ≤ (N + 1)J 2−N D(φ ||Ps ) (2.16)
(N + 1)
2.4 Finding Contours Among Clutter 27
Fig. 2.10. The triangle represents the set of probability distributions, E being a
subset within this set. Ps is the distribution which generates the samples. Sanov’s
theorem states that the probability that a type lies within E is determined by
distribution P ∗ in E which is closer to Ps .
We base our discussion on the road tracking work of Coughlan et al. [42], which
in turn, is based on a previous work by Geman and Jedymak. The underlying
problem to solve is not the key question here, as long as the main objective
of their work is to propose a Bayesian framework, and from this framework,
analyze when this kind of problems may be solved. They also focus on studying
the probability that the A∗ algorithm yields an incorrect solution.
Geman and Jedymack’s road tracking is a Bayesian Inference problem
in which only a road must be detected in presence of clutter. Rather than
detecting the whole road, it is splitted into a set of equally long segments.
From an initial point and direction, and if road’s length is N , all the possible
QN routes are represented as a search tree (see Fig. 2.11), where Q is the
tree’s branching factor. There is only a target route, and thus in the worst
case, the search complexity is exponential. The rest of paths in the tree are
considered distractor paths.
We now briefly introduce problem’s notation. Each route in the tree is
represented by a set of movements {ti }, where ti ∈ {bv }. The set {bv } forms
an alphabet Q corresponding to the Q possible alternatives at each segment’s
28 2 Interest Points, Edges, and Contour Grouping
F2 F3 True F3
F2
F1 F1
Fig. 2.11. Left: search tree with Q = 3 and N = 3. This search tree represents all
the possible routes from initial point (at the bottom of the tree) when three types of
movement can be chosen at the end of each segment: turning 15◦ to the left, turning
15◦ to the right and going straight. Right: Search tree divided into different sets: the
target path (in bold) and N subsets F1 , . . . , FN . Paths in F1 do not overlap with the
target path, paths in F2 overlap with one segment, and so on.
end (turn 15◦ left, turn 15◦ right, and so on). Each route has an associated
prior probability given by
N
p({ti }) = pG (ti ) (2.17)
i=1
where pG is the probability of each transition bi . From now on, we assume
that all transitions are equiprobable. A set of movements {ti } is represented
by a set of tree segments X = {x1 , . . . , xN }. Considering X the set of all
QN tree segments, it is clear that X ∈ X . Moreover, an observation yx is
made for each x ∈ X , being Y = {yx : x ∈ X }. In road tracking systems,
observation values are obtained from a filter previously trained with road and
non-road segments. Consequently, distributions pon (yx ) and poff (yx ) apply to
the probability of yx of being obtained from a road or not road segment. Each
route {ti } with segments {xi } is associated with a set of observations {yxi }
that get values from an alphabet {aμ } with size J.
Geman and Jedymak formulate road tracking in Bayesian Maximum a
Posteriori (MAP) terms:
p(Y |X)p(X)
p(X|Y ) = (2.18)
p(Y )
where prior is given by
N
p(X) = pG (ti ) (2.19)
i=1
and
p(Y |X) = pon (yx ) poff (yx )
x∈X x∈X /X
pon (yx )
i
= poff (yx )
poff (yxi )
i=1..N x∈X
pon (yx )
i
= F (Y )
poff (yxi )
i=1..N
2.4 Finding Contours Among Clutter 29
In the latter equation, F (Y ) = x∈X poff (yx ) is independent of X, and can
be ignored. In order to find the target route, p(Y |X)p(X) must be maximized.
N
N
pon (yi ) pG (ti )
r({ti }, {yi }) = log + log (2.20)
i=1
poff (yi ) i=1
U (ti )
N
log U (ti ) = −N log Q (2.21)
i=1
pon (αμ )
αμ = log , μ = 1, . . . , J (2.23)
poff (αμ )
pG (bv )
βv = log , v = 1...Q (2.24)
U (bv )
1
N
φμ = δy ,α , M = 1, . . . , J (2.25)
M i=1 i μ
1
N
ψv = δt ,b , v = 1, . . . , Q (2.26)
N i=1 i v
It must be noted that the Kronecker delta function δi,j is present in the
latter equation.
Coughlan et al. address this question: given a specific road tracking prob-
lem, may the target road be found by means of MAP? Answering this question
is similar to estimating the probability that the reward of any distractor path
is higher than the one of the target path. In this case, the problem cannot be
solved by any algorithm. For instance, let us consider the probability distri-
bution p1,max (rmax /N ) of the maximum normalized reward of all paths in F1
30 2 Interest Points, Edges, and Contour Grouping
0 0 0
−60 −40 −20 0 20 40 60 −60 −40 −20 0 20 40 60 −60 −40 −20 0 20 40 60
Fig. 2.12. Three cases of p1,max (rmax /N ) (solid line) vs. p̂T (rT /N ) (dashed line).
In the first case, the reward of the target path is higher than the largest distractor
reward, and as a consequence the task is straightforward. In the second case, the
task is more difficult due to the overlapping of both distributions. In the last case,
the reward of the target path is lower than the largest distractor rewards. Thus, it
is impossible to find the target path.
Fig. 2.13. Road tracking (black pixels) from initial point (at the top of each graph)
in presence of increasing clutter. The value of the parameter K will increase from
left to right, being the first example a simple problem and the last one an almost
impossible one.
(see Fig. 2.11) and the probability distribution p̂T (rT /N ) of the normalized
reward of the target path. Figure 2.12 compares both. Using a similar method
to Sanov’s Theorem, it is possible to obtain a parameter K given by
that clarifies if the task can be solved. When K > 0, p̂T (rT /N ) lies at the
right of p1,max (rmax /N ), thus finding the target path is straightforward. If
K ≈ 0, both distributions overlap and detecting the target path is not as
simple. Finally, when K < 0, p̂T (rT /N ) lies at the left of p1,max (rmax /N ), and
it is impossible to find the target path. An example is shown in Fig. 2.13.
The parameter K intuitively describes the difficulty of the task. For in-
stance, the term D(pon ||poff ) is directly related to the filter quality. The higher
this divergence is, the easier is to discriminate road and non-road segments.
Clearly a better filter facilitates the road tracking task. The term D(pG ||U ),
on the other hand, refers to the a priori information that is known about
the shape of the road. If pG = U , there is no prior information. Finally, Q
measures the amount of present distractor paths. Therefore, K increases in
the case of having a low error road detection filter, high a priori information
and few distractor paths.
2.4 Finding Contours Among Clutter 31
where
1 1 1 1
{pon (y)} 2 {poff (y)} 2 {pG (t)} 2 {U (t)} 2
φBh (y) = , ψBh (t) = (2.32)
Zφ Zψ
is based on the fact that A∗ algorithm searches the segment with the highest
reward. This first theorem states that the probability that A∗ searches the
last segment of a particular subpath An,i is less or equal to P {∃m : Soff (n) ≥
Son (m)}. The second theorem bounds this probability by something that can
be evaluated, and states that:
∞
P {∃m : Soff (n) ≥ Son (m)} ≤ P {Soff ≥ Son (m)} (2.33)
m=0
+ γ{m{φ α − HL + ψ β − HP∗ }
on ∗ on
(2.41)
− n{φ αoff
− HL∗ + ψ off β − HP∗ }} (2.42)
where the τ ’s and γ are Lagrange multipliers. The function f is convex, and
so it has an unique minimum. It mustbe noted that f can be splitted into
four terms of form nD(φoff ||Poff ) + τ1 { y φoff (y) − 1} − nγφoff α, coupled by
shared constants. These terms can be minimized separately:
subject to the constraint given by Eq. 2.36. The value γ = 1/2 yields the
unique solution. Moreover, γ = 1/2 yields
C2 (Ψ2 )2−nΨ1
2
Q2
≤ (n + 1)J (2.46)
where
∞
2
Q2 −mΨ2
C2 (Ψ2 ) = (m + 1)J 2 (2.47)
m=0
1
N
I˜θ = li Ii (2.48)
r i=1
34 2 Interest Points, Edges, and Contour Grouping
T2 q
q2 q1 I7
T1 I8
I5 I6
Xc, Yc I3 I4
q3
I1 I2
T3 R
(xc, yc)
Fig. 2.14. Left: junction parametric model. Right: radius along direction θi
discretized as a set of segments li . (Figure by Cazorla and Escolano 2003
c IEEE.)
Fig. 2.15. Top left: an example image. Top right: value of the log-likelihood ratio
log(pon /poff ) for all pixels. Bottom: magnitude and orientation of the gradient. In
the case of orientation, gray is value 0, white is π and black is −π. (Courtesy of
Cazorla.)
orientation θ∗ in which lies the pixel. Now, the orientations for which Eq.
2.50 is peaked and over a given threshold are selected as wedge limits:
1
N
pon (Ii |θ∗ )
I˜θ = li log (2.50)
r i=1 poff (Ii )
where U is the uniform distribution and pang (θi − θ∗ ) is the probability that
θi is the correct orientation. Although pang may be empirically estimated, in
this case, its maximum when the difference is 0 or π. Finally, junctions given
by Eq. 2.50 are pruned if M < 2 or M = 2, and the two wedges relative
orientation is close to 0 or π. Figure 2.16 shows some examples of application.
L
pon (pj )
L−1
pG (αj+1 − αj)
E({pj , αj }) = log + log (2.53)
j=1
poff (pj ) j=1 U (αj+1 − αj)
the first term being the intensity reward and the second term the geometric
reward. The log-likelihood in the first term for a fixed length F segment is
given by
1
N
pon (pj ) pon (Ii |θ∗ )
log = li log (2.54)
poff (pj ) F i=1 poff (Ii )
36 2 Interest Points, Edges, and Contour Grouping
Regarding the second term, pG (αj+1 − αj) models a first-order Markov
chain of orientation variables αj :
C
pG (αj+1 − αj) ∝ exp − |αj+1 − αj | (2.55)
2A
(z+1)L0 −1
1 pon (pj )
log <T (2.56)
L0 poff (pj )
j=zL0
(z+1)L0 −1
1 pG (αj+1 − αj)
log < T̂ (2.57)
L0 U (αj+1 − αj)
j=zL0
Fig. 2.17. Examples of application of the connecting path search algorithm. (Cour-
tesy of Cazorla.)
Problems
2.1 Understanding entropy
The algorithm by Kadir and Brady (Alg. 1) relies on a self-dissimilarity mea-
sure between scales. This measure is aimed at making possible the direct
comparison of entropy values, in the case of different image region sizes. In
general, and given a pixel on an image, how does its entropy vary with re-
spect to the pixel neighborhood size? In spite of this general trend, think of
an example in which the entropy remains almost constant in the range of
2.5 Junction Detection and Grouping 39
scales between smin and smax . In this last case, would the Kadir and Brady
algorithm select this pixel to be part of the most salient points on the image?
2.2 Symmetry property
Symmetry is one of the properties of entropy. This property states that entropy
remains stable if data ordering is modified. This property strongly affects
the Kadir and Brady feature extractor, due to the fact that it may assign
the same saliency value to two visually different regions, if they share the
same intensity distribution. Give an example of two visually different and
equally sized isotropic regions, for which p(0) = p(255) = 0.5. For instance,
a totally noisy region and a region splitted into two homogeneous regions.
Although Eq. 2.2 assigns the same saliency to both regions, which one may
be considered more visually informative? Think of a modification of Kadir
and Brady algorithm that considers the symmetry property.
2.3 Entropy limits
Given Eq. 2.2, show why homogeneous image regions have minimum saliency.
In which cases will the entropy value reach its maximum?
2.4 Color saliency
Modify Alg. 1 in order to adapt it to color images. How does this modification
affect to the algorithm complexity? Several entropy estimation methods are
presented through next chapters. In some of these methods, entropy may be
estimated without any knowledge about the underlying data distribution (see
Chapter 5). Is it possible to base Alg. 1 on any of these methods? And how
this modification affects complexity?
2.5 Saliency numeric example
The following table shows the pixel intensities in a region extracted from
an image. Apply Eq. 2.2 to estimate the entropy of the three square shape
regions (diameters 3, 5 and 7) centered on the highlighted pixel. Did you find
any entropy peak? In this case, apply self-dissimilarity to weight this value
(see Alg. 1).
table shows the real label of six pixels, and the output of both filters for each
of them. Evaluate φ1 and φ2 by means of conditional entropy. Which filter
discriminates better between on edge and off edge pixels?
α φ1 φ2
0 0 0
0 1 2
0 1 2
1 1 1
1 2 0
1 2 1
3.1 Introduction
One of the most complex tasks in computer vision is segmentation. Seg-
mentation can be roughly defined as optimally segregating the foreground
from the background, or by finding the optimal partition of the image into
its constituent parts. Here optimal segregation means that pixels (or blocks
in the case of textures) in the foreground region share common statistics.
These statistics should be significantly different from those corresponding to
the background. In this context, active polygons models provide a discrimina-
tive mechanism for the segregation task. We will show that Jensen–Shannon
(JS) divergence can efficiently drive such mechanism. Also, the maximum en-
tropy (ME) principle is involved in the estimation of the intensity distribution
of the foreground.
It is desirable that the segmentation process achieves good results (com-
pared to the ones obtained by humans) without any supervision. However,
such unsupervision only works in limited settings. For instance, in medical
image segmentation, it is possible to find the contour that separates a given
organ in which the physician is interested. This can be done with a low de-
gree of supervision if one exploits the IT principle of minimum description
length (MDL). It is then possible to find the best contour, both in terms of
organ fitting and minimal contour complexity. IT inspires methods for finding
the best contour both in terms of segregation and minimal complexity (the
minimum description length principle).
There is a consensus in the computer vision community about the fact
that the maximum degree of unsupervision of a segmentation algorithm is
limited in the purely discriminative approach. To overcome these limitations,
some researchers have adopted a mixed discriminative–generative approach to
segmentation. The generative aspect of the approach makes hypotheses about
intensity or texture models, but such hypotheses are contrasted with discrimi-
native (bottom-up) processes. Such approaches are also extended to integrate
where ds is the Euclidean arc-length element, and the circle in the right-
hand integral indicates that the curve is closed (the usual assumption), that
is, if we define p as a parameterization of the curve, then p ∈ [a, b] and
Γ (a) = Γ (b). Being t = (dx, dy)T a vector pointing tangential along the
curve, and assuming
that the curve is positively-oriented (counterclockwise),
we have ds = dx + dy 2 . As the unit normal vector is orthogonal to the
2
However, we need an expression for the gradient-flow of E(.), that is, the
partial derivatives indicating how the contour is changing. In order to do so,
we need to compute the derivative of E(.), denoted by Et (.), with respect to
the contour. This can be usually done by exploiting the Green’s theorem (see
Prob. 3.2.). Setting a = 0, b = 1:
1
1
Et (Γ ) = (Ft · JΓp + F · JΓpt )dp (3.6)
2 0
where Γpt is the derivative of Γp . Integrating by parts the second term and
applying the chain rule on the derivatives of F, we remove Γpt :
1
1
Et (Γ ) = ((DF)Γt · JΓp − (DF)Γp · JΓp )dp (3.7)
2 0
and exploiting the fact that the matrix [.] is antisymmetric (A is antisym-
metric when A = −AT ), its form must be ωJ due to the antisymmetric
operator J:
1 1 1
Et (Γ ) = (Γt · ωJΓp )dp = (Γt · ωJΓp )ds (3.9)
2 0 2 Γ
which implies that the gradient flow (contour motion equation) is given by
Γt = f n (3.11)
nk,k−1 and nk+1,k being the outward normals to edges Vk−1 , Vk and
Vk , Vk+1 , respectively. These equations allow to move the polygon as shown
in Fig. 3.1 (top). What is quite interesting is that the choice of a speed function
f adds flexibility to the model, as we will see in the following sections.
The key element in the vertex dynamics described above is the definition of
the speed function f (.). Here, it is important to stress that such dynamics
are not influenced by the degree of visibility of an attractive potential like
the gradient in the classical snakes. On the contrary, to avoid such myopic
behavior, region-based active contours are usually driven by statistical forces.
More precisely, the contour Γ encloses a region RI (inside) of image I whose
complement is RO = I \ RI (outside), and the optimal placement Γ ∗ is the
one yielding homogeneous/coherent intensity statistics for both regions. For
instance, assume that we are able to measure m statistics Gj (.) for each region,
and let uj and vj , j = 1, . . . , m be respectively the expectations E(Gj (.))
of such statistics for the inside and outside regions. Then, the well-known
functional of Chan and Vese [37] quantifies this rationale:
m
1
E(Γ ) = (Gj (I(x, y) − uj )2 dxdy + (Gj (I(x, y) − vj )2 dxdy
2|I| j=1 RI RO
(3.14)
3.2 Discriminative Segmentation with Jensen–Shannon Divergence 47
0
-50
-100
-150
-200
0 50 100 150
Fig. 3.1. Segmentation with active polygons. Top: simple, un-textured object.
Bottom: complex textured object. Figure by G. Unal, A. Yezzi and H. Krim (2005
c
Springer).
which is only zero when both terms are zero. The first term is nonzero when
the background (outer) intensity/texture dominates the interior of the con-
tour; the second term is nonzero when the foreground (inner) texture domi-
nates the interior of the contour; and both terms are nonzero when there is
an intermediate domination. Thus, the latter functional is zero only in the
optimal placement of Γ (no-domination).
From the information-theory point of view, there is a similar way of for-
mulating the problem. Given N data populations (in our case N = 2 inten-
sity/texture populations: in and out), their disparity may be quantified by the
generalized Jensen–Shannon divergence [104]:
N
N
JS = H ai pi (ξ) − ai H(pi (ξ)) (3.15)
i=1 i=1
N
H(.) being the entropy, ai the prior probabilities of each class, i=1 ai = 1,
pi (ξ) the ith pdf (corresponding to the ith region) of the random variable ξ
(in this case pixel intensity I). Considering the Jensen’s inequality
48 3 Contour and Region-Based Image Segmentation
ax a γ(xi )
γ i i ≤ i (3.16)
ai ai
γ(.) being a convex function and the ai positive weights, it turns out that, if
X is a random variable, γ(E(X)) ≤ E(γ(X)). In the two latter inequalities,
changing to ≥ turns
convexity into concavity. Let then P be the mixture of
distributions P = i ai pi . Then, we have
H(P ) = − P log P dξ
=− ai pi log ai pi dξ
i i
1
= ai pi dξ log
i
P
1
= ai pi log dξ
i
P
1 1
= ai pi log dξ + pi log dξ
i
pi P
= ai (H(pi ) + D(pi ||P )) (3.17)
i
and exploiting Jensen’s inequality (Eq. 3.16), we obtain the concavity of the
entropy. In addition, the KL-based definition of JS provides an interpreta-
tion of the Jensen–Shannon divergence as a weighted sum of KL divergences
between individual distributions and the mixture of them.
For our particular case with N = 2 (foreground and background distri-
butions), it is straightforward that JS has many similarities with Eq. 3.14
in terms of providing a useful divergence for contour-driving purposes. Fur-
thermore, JS goes theoretically beyond Chan and Vese functional because
the entropy quantifies higher order statistical interactions. The main prob-
lem here is that the densities pi (ξ) must be estimated or, at least, properly
3.2 Discriminative Segmentation with Jensen–Shannon Divergence 49
approximated. Thus, the theoretical bridge between JS and the Chen and
Vese functional is provided by the following optimization problem:
∗
p (ξ) = arg max − p(ξ) log p(ξ)dξ
p(ξ)
s.t. p(ξ)Gj (ξ)dξ = E(Gj (ξ)), j = 1, . . . , m
p(ξ)dξ = 1 (3.20)
where the density estimation p∗ (ξ) is the maximum entropy constrained to the
verification of the expectation equations by each of the statistics. Thus, this
is the first application in the book of the principle of maximum entropy [83]:
given some information about the expectations of a set of statistics character-
izing a distribution (this is ensured by the satisfaction of the first constraint),
our choice of the pdf corresponding to such distribution (the second constraint
ensures that the choice is a pdf) must be the most neutral one among all the
p(ξ) satisfying the constraints, that is, the more uninformed one that is equiv-
alent to the maximum entropy distribution. Otherwise, our choice will be a
pdf with more information than we have actually specified in the expectations
constraints, which indicates all what is available for making the inference: our
measurements E(Gj (ξ)).
The shape of the optimal pdf is obtained by constructing and deriving the
Lagrangian (after assuming the natural logarithm):
L(p, Λ) = − p(ξ) log p(ξ)dξ + λ0 p(ξ)dξ − 1
m
+ λj p(ξ)Gj (ξ)dξ − E(Gj (ξ))
j=1
∂L m
= − log p(ξ) − 1 + λ0 + λj Gj (ξ) (3.21)
∂p j=1
The orthogonalization
assumption yields the linearization of the constraints.
For instance, for p∗ (ξ)dξ = 1, we have
⎛
⎜
−1 ⎜
1 = Z̃ ⎝ ϕ(ξ)dξ +λm+1 ϕ(ξ)ξdξ +
1 0
⎞
1
m
⎟
+ λm+2 + ϕ(ξ)ξ 2 dξ + λj ϕ(ξ)Gj (ξ)dξ ⎟ ⎠ (3.28)
2
j=1
1 0
therefore
1
1 = Z̃ −1 1 + λm+2 + (3.29)
2
and, similarly, we also obtain
p∗ (ξ)ξdξ = Z̃ −1 λm+1 = 0
p∗ (ξ)ξ 2 dξ = Z̃ −1 (1 + 3(λm+2 + 1/2)) = 1
p∗ (ξ)Gj (ξ)dξ = Z̃ −1 λj = E(Gj (ξ)), j = 1, . . . , m (3.30)
where ν = ϕ(ξ). At this point of the section, we have established the math-
ematical basis for understanding how to compute efficiently the Jensen–
Shannon (JS) divergence between two regions (N = 2 in Eq. 3.15):
JS = H(a1 p∗1 (ξ) + a2 p∗2 (ξ)) − a1 H(p∗1 (ξ)) − a2 H(p∗2 (ξ)) (3.33)
52 3 Contour and Region-Based Image Segmentation
1
m
H(P ) ≈ H(ν) − (a1 uj + a2 vj )2 (3.35)
2 i=1
1 m
= a1 a2 (uj − vj )2 (3.36)
2 j=1
1 |RI ||RO |
m
ˆ = ∂Γ
∇JS = (2(uj − vj )(∇uj − ∇vj ))n
∂t 2 j=1 |I|2
1 |RI | 1 |RO |
m m
+ (uj − vj )2 n − (uj − vj )2 n, (3.37)
2 j=1 |I|2 2 j=1 |I|2
m |RI |−|RO |
1
2 j=1 |I|2
(uj −vj )2 n
3.3 MDL in Contour-Based Segmentation 53
Gj (I(x, y) − uj )
∇uj = n
|RI |
Gj (I(x, y) − vj )
∇vj = − n, (3.38)
|RO |
and n the outward unit normal. Then, replacing the latter partial derivatives
in Eq. 3.37, taking into account that |I| = |RI | + |RO |, and rearranging terms,
we finally obtain
∇JSˆ = ∂Γ = f n (3.39)
∂t
where
1
m
f= (uj − vj )((Gj (I(x, y)) − uj ) + (Gj (I(x, y)) − vj )) (3.40)
2|I| j=1
f being the gradient flow of the Chan and Vese functional (Eq. 3.14). Thus f is
defined in the terms described above, and the connection between JS and the
contour dynamics is established. This f is the one used in Fig. 3.1 (bottom)
when we use as generator functions such as G1 (ξ) = ξe−ξ /2 , G2 (ξ) = e−ξ /2
2 2
M −1 t − tk −1 tk+M − t
BM
k (t) = Bk (t) + BM
k+1 (t)
tk+M −1 − tk tk+M − tk+1
1 if tk ≤ t ≤ tk+1
B1k (t) = (3.41)
0 otherwise
whose support is [tk , tk+M ] and are smoothly joined at knots, that is, M − 2
continuous derivatives exist at the joints, and the (M − 2)th derivative must
be equal in the
joint (C M −2 continuity). B-splines satisfy BM
k (t) ≥ 0 (non-
negativity), k Bk (t) = 1 for t ∈ [tM , tNB ] (partition of unity), and period-
M
icity. Finally, the NB B-splines define a nonorthogonal basis for a linear space.
With regard to the latter, one-dimensional functions are uniquely defined as
B −1
N
y(t) = k (t) t ∈ [tM −1 , tNB ]
ck BM (3.42)
k=0
where ck ∈ R are the so-called control points, where the kth control point
influences the function only for tk < t < tk+M . If we have NB = K −M control
points and K knots, then the order of the polynomials of the basis is exactly
K −NB = M , that is, we have B M (for instance M = 4 yields cubic B-splines).
For instance, in Fig. 3.2 (left), we have a B-spline composed from NB = 7
cubic (M = 4) basis functions and K = 11 knots (K − NB = M ), where
t0 = t1 = t2 = t3 = 0, t4 = 1.5, t5 = 2.3, t6 = 4, and t7 = t8 = t9 = t10 = 5.
All the control points are 0 except c3 = 1. Therefore, we have a plot of B43 (t)
Furthermore, as the multiplicity of 0 and 5 is both 4, in practice, we have
0.7 9
8
0.6 P2
7
0.5 6
P3
5
0.4
4
y
0.3 3
2
0.2
1 Bspline
controls
0.1 P1 0 polygon
P4 samples
−1
0 −4 −3 −2 −1 0 1 2 3 4 5
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 x
Fig. 3.2. Left: a degenerated one-dimensional example y(t) = B43 (t) with the four
cubic polynomials P1 . . . P4 describing it. Right: a 2D B-spline basis with M = 4
(cubic) for describing a contour (in bold). Such continuous contour is given by the
interpolation of N = 11 samples (first and last coincide) that are indicated by
∗ markers. There are NB = 11 control points, although two points of them are
(0, 0)T , indicated by circular markers and we also draw the control polygon.
3.3 MDL in Contour-Based Segmentation 55
only NB = 4 basis functions. If we do not use the B-form, we can also see
the four polynomials P1 . . . P4 used to build the function and what part of
them is considered within each interval. As stated above, the value of y(t) is
0 for t < t0 = 0 and t > t10 = 5, and the function is defined in the interval
[tM −1 = t3 = 0, tNB = t7 = 5].
Following [58], we make the cubic assumption and drop M for the sake
of clarity. Thus, given a sequence of N 2D points Γ = {(xi , yi )T : i =
0, 1, . . . , N − 1}, and imposing several conditions, such as the periodicity cited
above, the contour Γ (t) = (x(t), y(t))T is inferred through cubic-spline inter-
polation, which yields both the knots and the 2D control points ck = (cxk , cyk )T .
Consequently, the basis functions can be obtained by applying Eq. 3.41. There-
fore, we have
B −1
N NB −1 x
x(t) ck
Γ (t) = = ck Bk (t) = Bk (t) t ∈ [tM −1 , tNB ]
y(t) cyk
k=0 k=0
(3.43)
In Fig. 3.2 (right), we show the continuous 2D contour obtained by interpolat-
ing N = 11 samples with a cubic B-spline. The control polygon defined by the
control points is also showed. The curve follows the control polygon closely.
Actually, the convex hull of the control points contains the contour. In this
case, we obtain 15 nonuniformly spaced knots, where t0 = . . . = t3 = 0 and
t11 = . . . = t15 = 25.4043.
In the latter example of 2D contour, we have simulated the period-
icity of the (closed) contour by adding a last point equal to the first.
Periodicity in B-splines is ensured by defining bases satisfying BM k (t) =
+∞
B M
j=−∞ k+j(tK −t0 ) (t) : j ∈ Z, being t K − t 0 the period. If we have K+1
knots, we may build a B-splines basis of K functions B0 , . . . , BK−1 simply by
constructing B0 and shifting this function assuming a periodic knot sequence,
that is, tj becomes tjmodK . Therefore, tK = t0 , so we have K distinct knots.
Here, we follow the simplistic assumption that knots are uniformly spaced.
The periodic basis functions for M = 2 and M = 4 are showed in Fig. 3.3. In
the periodic case, if we set K, a closed contour can be expressed as
K−1
x(t)
Γ (t) = = ck Bk (t) t ∈ R (3.44)
y(t)
k=0
0.6
0.4
B(t)
B(t)
0.5
0.4 0.3
0.3 0.2
0.2
0.1
0.1
0 0
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
tk tk
Fig. 3.3. Periodic basic functions for M = 2 (left) and for M = 4 (right). In both
cases, the K = 5, distinct knots are t0 = 1 . . . t4 = 4, and we show the K − 1
functions; those with a nonzero periodic fragment are represented in dashed lines.
B†(K) being the pseudo-inverse, and the inverse (BT(K) B(K) )−1 always exists.
Furthermore, an approximation of the original x can be obtained by
x x
x̂(K) = B(K) θ̂ (K) = B(K) B†(K) θ̂ (K) (3.48)
B⊥
(K)
where B⊥(K) is the so-called projection matrix because x̂(K) is the projection
of x onto the range space of K dimensions of B(K) : R(B(K) ). Such space is
3.3 MDL in Contour-Based Segmentation 57
1
0.2
0.5
0
0 −0.2
9 9
8 8
7 7
6 6
5 5
4 4
3 3 0
0 40 20
2
40 20 2 80 60
80 60 120 100
1 140 120 100 1 140
180 160 180 160
0.8
0.8
0.7 K=N−1
K=19
0.7 K=9
0.6
0.6
0.5
0.5
yi
y(t)
0.4
0.4
0.3
0.3
0.2 0.2
0.1 0.1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
xi x(t)
Fig. 3.4. Top: structure of B(K) (left) for N = 174 samples and K − 1 = 9 basis,
and structure of its orthogonal matrix spanning the range space. In both cases, each
row is represented as a strip. Bottom: contours inferred for different values of K(see
text).
the one expanded by an orthonormal basis with the same range of B(K) . The
structure of both B(K) and the corresponding orthonormal matrix is showed
in Fig. 3.4 (top-left and top-right), respectively.
Applying to y and θ x(K) the above rationale and fusing the resulting equa-
tions, we have that
x y
θ̂ (K) = (θ̂ (K) θ̂ (K) ) = (B†(K) x B†(K) y) = B†(K) Γ
Γ̂ = (x̂ ŷ) = (B⊥ ⊥ ⊥
(K) x B(K) y) = B(K) θ̂ (K) (3.49)
smoothed, with K = 9. Thus, two key questions remain: What is the optimal
value of K?, and What is the meaning of optimal in the latter question? As
we will see next, information theory will shed light on these questions.
and conversely, given a set of codewords satisfying this inequality, there exists
a prefix code with these lengths. Moreover, when there is a probability dis-
tribution associated to the alphabet P (a), the codewords’ lengths
satisfying
the Kraft inequality and minimizing the expected length a∈A (a)LC (a)
P
are LC (a) = − logD P (a) = logD P (a)
1
, and the expected length is exactly
the entropy under logD if we drop the rounding up to achieve integer lengths.
Such code, the Shannon–Fano one (see the allocation of codewords in [43] –
pp. 101–103), allows to relate lengths and probabilities, which, in turn, is
key to understand MDL in probabilistic terms. Therefore, Eq. 3.50 can be
rewritten in the following terms:
2
This is, in the Grünwald terminology [67], the crude two-part version of MDL.
3.3 MDL in Contour-Based Segmentation 59
(3.55)
although we may rewrite L(θ (K) , σx2 , σy2 ) = L(θ (K) ) if we assume that the
variances have constant description lengths. The MDL model order selection
problem consists in estimating
! "
K ∗ = arg min L(θ (K) ) + min
2
min
x
(− log P (x|θ x(K) , σx2 ))
K σx θ (K)
! "
+ min min (− log P (y|θ y(K) , σy2 )) (3.56)
2σy θy
(K)
where σ̂z2 (K) = arg minσz2 f (σz2 ) ≡ ||z − ẑ(K) ||2 /N , which is consistent with
being the optimal variance dependant on the approximation error. Therefore,
defining σ̂x2 (K) and σ̂y2 (K) in the way described below, we have
N
K ∗ = arg min L(θ (K) ) + (log(2πσ̂x2 (K)e) + log(2πσ̂y2 (K)e))
K 2
N
= arg min L(θ (K) ) + (log(σ̂x2 (K)) + log(σ̂y2 (K)))
K 2
% & '(
= arg min L(θ (K) ) + N log σ̂x2 (K)σ̂y2 (K) (3.58)
K
with ξ = 1 for discrete curves, reflects the fact that an increment of image
dimensions is translated into a smaller fitting precision: the same data in a
larger image need less precision. Anyway, the latter definition of λ imposes a
logarithmically smoothed penalization to the increment of model order.
Given the latter B-spline adaptive (MDL-based) model for describing con-
tours, the next step is how to find their ideal placement within an image,
for instance, in the ill-defined border of a ultrasound image. Given an image
I of Wx × Wy pixels containing an unknown contour Γ = B(K) θ (K) , pixel
intensities are the observed data, and, thus, their likelihood, given a contour
(hypothesis or model), is defined as usual P (I|θ (K) , Φ), Φ being the (also un-
known) parameters characterizing the intensity distribution of the image. For
3.3 MDL in Contour-Based Segmentation 61
where Ip is the intensity at pixel p = (i, j), and I(Γ ) and O(Γ ) denote,
respectively, the regions inside and outside the closed contour Γ . Therefore,
the segmentation problem can be posed in terms of finding
) *
(θ̂ K ∗ , Φ̂) = arg min − log P (I|θ (K) , Φ) + K log(Wx W y) (3.61)
K,θ (K) ,Φ
Algorithm 2: GPContour-fitting
Input: I, K, Φ, a valid contour Γ̂ (0) ∈ R(B(K) ), and a stepsize
Initialization Build B(K) , compute B⊥ (K) , and set t = 0.
while ¬ Convergence(Γ̂ (t) ) do
Compute the gradient: δΓ ← ∇ log P (I|θ (K) , Φ)|Γ =Γ̂ ( t)
Project the gradient onto R(B(K) ): (δΓ )⊥ ← B⊥ (K) δΓ
Update the contour (gradient ascent): Γ̂ (t+1) ← Γ̂ (t) + (δΓ )⊥
t←t+1
end
Output: Γ̂ ← Γ̂ (t)
62 3 Contour and Region-Based Image Segmentation
Algorithm 3: MLIntensity-inference
Input: I, K, and a valid contour Γ̂ (0) ∈ R(B(K) )
Initialization Set t = 0.
while ¬ Convergence(Φ̂(t) , Γ̂ (t) ) do
Compute the ML estimation
% Φ̂(t) given Γ̂ (t) : (
(t)
Φ̂in = arg maxΦin p∈I(Γ̂ (t) ) P (Ip |θ (K) , Φin )
% (
(t)
Φ̂out = arg maxΦout p∈O(Γ̂ (t) ) P (Ip |θ (K) , Φout )
Contour-fitting Algorithm
that is, we must maximize the likelihood, but the solutions must be con-
strained to those contours of the form Γ = B(K) θ (K) , and thus, belong to
the range space of B(K) . Such constrained optimization method can be solved
with a gradient projection method (GPM) [20]. GPMs consist basically in
projecting successively the partial solutions, obtained in the direction of the
gradient, onto the feasible region. In this case, we must compute in the t−th
iteration the gradient of the likelihood δΓ = ∇ log P (I|θ (K) , Φ)|Γ =Γ̂ ( t) . Such
gradient has a direction perpendicular to the contour at each point of it (this
is basically the search direction of each contour point, and the size of this
window one-pixel wide defines the short-sightness of the contour). Depend-
ing on the contour initialization, and also on , some points in the contour
may return a zero gradient, whereas others, closer to the border between the
foreground and the background, say p pixels, may return a gradient of mag-
nitude p (remember that we are trying to maximize the likelihood along the
contour). However, as this is a local computation for each contour point, the
global result may not satisfy the constraints of the problem. This is why δΓ is
projected onto the range space through (δΓ )⊥ = B⊥ (K) δΓ , and then we apply
the rules of the usual gradient ascent. The resulting procedure is in Alg. 2.
Intensity-inference Algorithm
This second algorithm must estimate a contour Γ̂ in addition to the region pa-
rameters Φ, all for a fixed K. Therefore, it will be called Alg. 3. The estimation
of Φ depends on the image model assumed. For instance, if it is Gaussian, then
3.4 Model Order Selection in Region-Based Segmentation 63
2 2
we should infer Φin = (μin , σin ) and Φout = (μout , σout ); this is easy to do if
we compute these parameters from the samples. However, if the Rayleigh dis-
tribution is assumed, that is, P (Ip |θ (K) , Φ = σ 2 ) = (Ip /σ 2 ) · exp{−Ip /(2σ 2 )}
(typically for modelling speckle noise in ultrasound images), then we should
2 2
infer Φin = (σin ) and Φout = (σout ) (also from the samples). Given Alg. 3, we
are able to obtain both θ̂ (K) and Φ̂ for a fixed K. Then, we may obtain the
second term of Eq. 3.62:
) *
max log P (I|θ (K) , Φ) = log P (I|θ̂ (K) , Φ̂)
θ (K) ,Φ
|I(Γ̂ )| |O(Γ̂ )|
∝− 2
log(σ̂in (K)) − 2
log(σ̂out (K))
2 2
1 % (
=− 2
log(σ̂in 2
(K)σ̂out (K))|I|
2
1
= 2 2 (K))Wx Wy
(3.64)
log(σ̂in (K)σ̂out
Once we have an algorithm for solving a fixed K, the MDL solution may be
arranged as running this algorithm for a given range of K and then selecting
the K ∗ . Thus, it is important to exploit the knowledge not only about the type
of images to be processed (intensity), but also about the approximate com-
plexity of their contours (in order to reduce the range of exploration for the
optimal K). In Fig. 3.5, we show some summarizing results of the technique
described above.
Fig. 3.5. Results of the MDL segmentation process. Top: synthetic image with
same variances and different means between the foreground and background, and
the optimal K (Courtesy of Figueiredo). Bottom: experiments with real medical
images and their optimal K. In both cases, initial contours are showed in dashed
lines. Figure by M.A.T. Figueiredo, J.M.N. Leitao and A.K. Jain (2000
c IEEE).
3.4 Model Order Selection in Region-Based Segmentation 65
0.5
0.4
0.3
0.2
0.5
0.4
0.3
0.2
0.5
0.4
0.3
0.2
0.5
0.4
0.3
0.2
The global model I0 (x; li , θi ) of the signal I(x), or “world scene” is therefore
described by the vector of random variables:
Assuming that
• In this 1D example the individual likelihoods for each region I0 (x, li , θi ),
xi − 1 ≤ x < xi decay exponentially with the squared error between the
model and the actual signal:
& xi '
− 2σ12 xi −1 (I(x)−I0 (x;li ,θi ))
2
dx
P (I|li , θi ) = e (3.66)
The maximum a posteriori (MAP) solution comes from maximizing the pos-
terior probability (Eq. 3.67). In energy minimization terms, the exponent of
the posterior is used to define an energy function:
k
1 xi k
E(W ) = (I(x) − I0 (x; l i , θ i ))2
dx + λ 0 k + λ |θi | (3.68)
2σ 2 i=1 xi−1 i=1
Reversible jumps
Jump of ’split’ type, iteration 459 Jump of ’split’ type, iteration 460
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Fig. 3.8. A jump of “split” type. Left: plot of W and I. Right: plot of W and I.
68 3 Contour and Region-Based Image Segmentation
Table 3.1. Energies E(W ) and probabilities P (W |I) of the destination states of
the jumps from state W (Eq. 3.71 and Fig. 3.8, left). The jumps considered are
splitting region 1, splitting region 2, merging regions 1 and 2, and remaining in the
same state.
W Wactual Wsplit1 Wsplit2 Wmerge12
E(W ) 0.1514 0.1987 0.0808 13.6823 14.1132
P (W |I) 0.2741 0.2089 0.5139 0.0030 1.0000
For the signal I(x) of the example, we obtain the energy values E(W ) = 0.0151
and E(W ) = 0.0082. The probabilities p(W |I) and p(W |I) are given by the
normalization of the inverse of the energy. There are several possible moves
from the state W : splitting region 1, splitting region 2, merging regions 1
and 2, or remaining in the same state. In Table 3.1, the energies and prob-
abilities of the considered destination states are shown. It can be seen that,
considering uniform densities q(ψ|n) and q(φ|m), the most probable jump is
to the state in which region 2 is split.
Stochastic diffusions
dE(W )
dψ(t) = − dt + 2T (t)N (0, (dt)2 ) (3.74)
dψ
Given this definition, let us obtain the motion equations for the variables of
the toy-example, which are the change points xi and the parameters θi = (a, b)
of the linear region models. We have to obtain the expression of the derivative
of the energy E(W ) with respect to the time t for each one of these variables.
Then, the motion equation for a change point xi is calculated as
null. On the other hand, the summation adds k terms and only two of them
contain xi , the rest are independent, so they are also null in the derivative.
For compactness, let us denote the error between model and signal as f and
its indefinite integral as F :
fi (x) = (I(x) − I0 (x; li , θi ))2
(3.76)
Fi (x) = (I(x) − I0 (x; li , θi ))2 dx
Then the derivative of the energy is calculated as
k
dE(W ) d 1 xi
= fi (x)dx + c
dxi dxi 2σ 2 i=1 xi−1
xi xi+1
1 d
= ··· + fi (x)dx + fi (x)dx + · · ·
2σ 2 dxi xi−1 xi
1 d (3.77)
= 2
(Fi (xi ) − Fi (xi+1 ) + Fi+1 (xi+1 ) − Fi+1 (xi ))
2σ dxi
1 d
= (Fi (xi ) − Fi+1 (xi ))
2σ 2 dxi
1 + ,
= 2
I(xi ) − I0 (xi ; li , θi ))2 − I(xi ) − I0 (xi ; li−1, θi−1 ))2
2σ
Finally, the expression obtained for dE(W
dxi
)
is substituted in the xi motion
equation, Eq. 3.75. The resulting equation is a 1D case of the region compe-
tition equation [184], which also moves the limits of a region according to the
adjacent region models’ fitness to the data.
The motion equations for the θi parameters have an easy derivation in the
linear model case. The motion equation for the slope parameter ai results in
dai (t) dE(W )
= + 2T (t)N (0, 1)
dt dai
k
1 d xi
= 2
(I(x) − ai x + bi )2 dx + 2T (t)N (0, 1) (3.78)
2σ dai i=1 xi−1
1 + ,
xi
= 2
2 ax2 + bx − xI(x) + 2T (t)N (0, 1)
2σ x=x
i−1
In Fig. 3.9 the result of applying the motion equations over the time t,
or number of iterations is represented. It can be seen that motion equations
3.4 Model Order Selection in Region-Based Segmentation 71
Diffusion of ’b’ parameter, iteration 10 Diffusion of ’a’ parameter, iteration 30
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Diffusion of both ’a’ and ’b’ parameters, iteration 50 Diffusion of the limits, iteration 50
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Fig. 3.9. Diffusion of ‘b’, ‘a’, both parameters together and the limits.
modify the parameters in the direction that approximates the model closer to
the data. The parameters increments get larger when the model is far from
the data and smaller when the model is close to the data.
As we already explained, diffusions only work in a definite subspace Ωn
and jumps have to be performed in some adequate moment, so that the pro-
cess does not get stuck in that subspace. In the jump-diffusion algorithm,
the jumps are performed periodically over time with some probability. This
probability, referred to the waiting time λ between two consecutive jumps,
follows a Poisson distribution, being κ the expected number of jumps during
the given interval of time:
λκ e−λ
fP oisson (κ; λ) = (3.80)
κ!
A simulation of the random jumps with Poisson probability distribution in
time is shown in Fig. 3.10.
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
time (iterations)
Data-driven techniques
In [159], the data-driven Markov chain Monte Carlo scheme is explained and
exploited. The idea is to estimate the parameters of the region models, as well
as their limits or changepoints, according to the data that these models have
to describe. These estimations consist of bottom-up strategies, for example,
a simple strategy for estimating the changepoints xi is to place them where
edges are detected. Actually, taking edges as those points with a gradient
above some threshold is not a good strategy. A better way is to take the
edgeness measure as probabilities, and sample this distribution for obtaining
the changepoints. For example, in Fig. 3.11, we show the edgeness of the
toy-example data. If a split has to be performed of a region in the interval
[0.2, 0.4) to two new region models for the intervals [0.2, x) and [x, 0.4), the
new changepoint x would be placed with a high probability on 0.3 because the
edgeness corresponding to this interval defines such probability distribution.
The estimation of the parameters θi of the linear model of the toy-example
can be performed by taking the slope and the intersect of the most voted line
of the Hough transformation of the underlying data. However, a probabilis-
tic approach is more suitable for the jump-diffusion process. It consists of
computing the importance proposal probability by Parzen windows centered
at the lines of the Hough transformation. When a new model is proposed in
the interval [xi−1 , xi ), its important proposal probability is
N
q(θi |li , [xi−1 , xi ) = ωj G(θi − θj ) (3.83)
j=1
Jump−diffusion convergence
Pure generative
Data−driven (edges)
Data−driven (Hough transform)
Data−driven (Hough and edges)
Energy
Algorithm 4: K-adventurers
Input: I, successive solutions (ωK+1 , WK+1 ) generated by a jump-diffusion
process
Initialize S ∗ˆ with one initial (ω1 , W1 ) solution K times
while ∃ (ωK+1 , WK+1 ) ← jump-diffusion do
S+ ← S ∗ˆ ∪ {(ωK+1 , WK+1 )}
for i = 1, 2, . . . , K + 1 do
S−i ← S+ /{(ωK+1 , WK+1 )}
p̂ ← S−i
di = D̂(p||p̂)
end
i∗ = arg min di
i
S ∗ˆ ← S−i∗
end
Output: S ∗ˆ
over time and there will be a convergence to the p distribution. The num-
ber of iterations necessary for a good approximation of S ∗ depends on the
complexity of the search space. A good stop criterion is to observe whether
S ∗ undergoes important changes, or on the contrary, remains similar after a
significant number of iterations.
In each while iteration, a for sentence iterates the estimation of the KL-
divergence D̂(p||p̂−i ) between p and each possible set of K solutions, consid-
ering the new one, and subtracting one of the former. The interesting point
is the estimation of the divergence, provided that p consists of a solutions set
whose size increases with each new iteration. The idea which Tu and Zhu pro-
pose in [159] is to represent p(W |I) by a mixture of N Gaussians, where N is
the number of solutions returned by the jump-diffusion process, which is the
same as the number of iterations of the while sentence. These N solutions are
partitioned into K disjoint groups. Each one of these groups is represented by
a dominating solution, which is the closest one to some solution from the S ∗
set of selected solutions. The name of the algorithm is inspired by this basic
idea: metaphorically, K adventurers want to occupy the K largest islands in
an ocean, while keeping apart from each other’s territories.
As shown in Eq. 3.85, the probability distributions are modelled with a sum
of Gaussians centered at the collected solutions. These solutions are largely
separated because of the high dimensionality of the solutions space. This is the
reason for forming groups with dominating solutions and ignoring the rest of
the solutions. More formally, for selecting K << N solutions from the initial
set of solutions S0 , a mapping function from the indexes of S ∗ˆ to the indexes
of S0 is defined τ : {1, 2, . . . , K} → {1, 2, . . . , N }, so that
In the experiments performed in [159], the same variance is assumed for all
Gaussians.
Given the former density model, the approximation of D(p||p̂) can be
defined. From the definition of the KL-divergence, we have
N
p(W )
D(p||p̂) = p(W ) log dW
n=1 Dn p̂(W )
N
N
= ωi G(W − Wi ; σ 2 )
n=1 Dn i=1
N
i=1 ωi G(W − Wi ; σ )
2
× log N dW (3.90)
j=1 ωτ (j) G(W − Wτ (j) ; σ )
N 1 2
j=1 ωτ (j)
Provided that the energy of each mode p(Wi |I) is defined as E(Wi ) =
− log p(Wi ), the approximation of D(p||p̂) is formulated as
N
D̂(p||p̂) = ωn G(W − Wn ; σ 2 )
#
n=1 Dn
$
N ωn G(W − Wn ; σ 2 )
· log j=1 ωτ (j) + log dW
⎡ ωτ (c(n)) G(W − Wτ (c(n)) ; σ 2 ) ⎤
N N
ωn (Wn − Wτ (c(n)) )2
= ⎣log ωτ (j) + log + ⎦
ω 2σ 2
n=1 j=1 τ (c(n))
N N # $
(Wn − Wτ (c(n)) )2
= log ωτ (j) + ωn E(Wτ (c(n)) − E(Wn ) +
j=1 n=1
2σ 2
(3.92)
actual data. When several region models are considered, the model type has
to be considered in the distance measure too. Also, the number of regions is
another important term to be considered in the measure.
In Fig. 3.13, we illustrate a solution space of the 1D toy-example. The
K = 6 selected solutions are marked with red rectangles. These solutions are
not necessarily the best ones, but they represent the probability distribution of
the solutions yielded by the jump-diffusion process. In Fig. 3.14, the energies
of these solutions are represented. Finally, in Fig. 3.15 the models of the
← solution 100
← solution 86
←solution 76
← solution 58
← solution 36
← solution 14
180
160
140
120
100
80
60
40
# solution 20
0
Fig. 3.13. The solution space in which the K-adventurers algorithm has selected
K = 6 representative solutions, which are marked with rectangles. See Color Plates.
↑ ↑ ↑ ↑ ↑ ↑
14 36 58 76 86 100
Fig. 3.14. The energies of the different solutions. The representative solutions se-
lected by the K-adventurers algorithm are marked with arrows.
3.5 Model-Based Segmentation Exploiting The Maximum Entropy Principle 79
Solution #14 Solution #36
Fig. 3.15. Result of the K-adventurers algorithm for K = 6: the six most represen-
tative solutions of the solution space generated by jump-diffusion.
In Section 3.2, we have introduced the maximum entropy (ME) principle and
how it is used to find the approximated shape, through the less biased pdf,
given the statistics of the sample. Here, we present how to use it for segmenting
parts of a given image whose colors are compatible with a given ME model.
Therefore, ME is the driving force of learning the model from the samples
and their statistics. Such model is later used in classification tasks, such as
the labeling of skinness (skin-color) of pixels/regions in the image [84]. This
is a task of high practical importance like blocking adult images in webpages
(see for instance [180]). A keypoint in such learning is the trade-off between
the complexity of the model to learn and its effectiveness in ROC terms. Let
X = {(Ii , yi ) : i = 1, . . . , M } be the training set, where Ii is a color image
that is labeled as yi = 1 if it contains skin, and yi = 0 otherwise. In [84], the
80 3 Contour and Region-Based Image Segmentation
q(xs , ys ) being the proportion of pixels in the training set with color xs and
skinness ys (two tridimensional histograms, one for each skinness, typically
quantized to 32 bins, or a four dimensional histogram in the strict sense).
Using a modified version of the usual expectation constraints in ME, we have
that
p(xs , ys ) = Ep [δxs (xs )δys (ys )] (3.94)
where δa (b) = 1 if a = b and 0 otherwise. Then, the shape of the ME distri-
bution is
p(x, y) = eλ0 + s∈S λ(s,xs ,ys ) . (3.95)
Thus, assuming q(xs , ys ) > 0, we find the following values for the multipliers:
λ0 = 0 and λ(s, xs , ys ) = log q(xs , ys ). Consequently, the ME distribution is
p(x, y) = q(xs , ys ) (3.96)
s∈S
Fig. 3.16. Skin detection results. Comparison between the baseline model (top-
right), the tree approximation of MRFs with BP (bottom-left), and the tree
approximation of the first-order model with BP instead of Alg. 5. Figure by B.
Jedynak, H. Zheng and M. Daoudi (2003
c IEEE). See Color Plates.
the four quantities: q(ys = 0, yt = 0), q(ys = 0, yt = 1), q(ys = 1, yt = 0), and
q(ys = 1, yt = 1). Here, the aggregation of horizontal and vertical quantities
yields an implicit assumption of isotropy, which is not an excessive unrealistic
simplification. Let D be the following model:
D : ∀ < s, t >∈ S × S p(ys = 0, yt = 0) = q(0, 0), p(ys = 1, yt = 1) = q(1, 1)
(3.98)
82 3 Contour and Region-Based Image Segmentation
∀ys ∈ {0, 1} , ∀yt ∈ {0, 1} , p(ys , yt ) = Ep [δys (ys )δyt (yt )] (3.99)
thus, the solution has the yet familiar exponential shape, but depending on
more multipliers enforcing the constraints, labels of neighbors < s, t > are
equal:
p(xs , ys )
p(xs |ys ) =
p(ys )
eλ0 +λ1 (s,xs ,ys ) g(s, ys )
=
eλ0 g(s, ys ) xs eλ1 (s,xs ,ys )
eλ1 (s,xs ,ys ) q(xs |ys )
= = (3.103)
xs q(xs |ys )
e λ1 (s,xs ,ys )
xs
because p(xs |ys ) = q(xs |ys ), and λ1 (s, xs , ys ) = log q(xs |ys ) when positivity
is assumed. Consequently, the resulting model is
p(x, y) ≈ q(xs |ys )e <s,t> a0 (1−ys )(1−yt )+a1 ys yt (3.104)
s∈S
1
p(y) = e[ <s,t> a0 (1−ys )(1−yt )+a1 ys yt ] (3.106)
Z(a0 , a1 )
being Z(a0 , a1 ) the normalization (partition) function
% (
Z(a0 , a1 ) = e[ <s,t> a0 (1−ys )(1−yt )+a1 ys yt ] (3.107)
y
Thus, the prior model p(y) enforces that two neighboring pixels have the same
skinness, which discards isolated points and, thus, smooths the result of the
classification. Actually, such model is a version of the well-known Potts model.
An interesting property of the latter model is that for any < s, t >, we
have p(ys = 1, yt = 0) = p(ys = 0, yt = 1).
q(xs , xt , ys , yt ) being the expected proportion of times in the training set that
the two 4-neighboring pixels have the realization (xs , xt , ys , yt ) independently
on their orientation. Therefore, the ME pdf must satisfy
p(xs , xt , ys , yt ) = Ep [δxs (xs )δxt (xt )δys (ys )δyt (yt )] (3.109)
whose evaluation requires six histograms of three dimensions. The above sim-
plifications result in the following model:
p∗ (xs , xt , ys , yt ) = Pλ ∩ C ∗ (3.116)
that is, by the pdfs satisfying the above constrains in C ∗ and having the form
1 [λ(xs ,ys )+λ(xs ,ys )+xt ,yt )+λ(xs −xt ,ys ,yt )]
Pλ = e (3.117)
Zλ
[λ(xs ,ys )+λ(xs ,ys )+xt ,yt )+λ(xs −xt ,ys ,yt )]
being Zλ = xs ,xt ,ys ,ys e the partition.
In the latter formulation, we have reduced significantly the number of mul-
tipliers to estimate, but estimation cannot be analytic. The typical solution
is to adapt to this context the iterative scaling method [47] (we will see an
advanced version in the last section on the book, where we will explain how
to build ME classifiers). The algorithm is summarized in Alg. 5.
Finally, there is an alternative mechanism based on belief propagation [177]
using Bethe trees. Such a tree is rooted at each pixel whose color we want
to infer, and we have trees Tk of different depths where k denotes the depth.
Here, we assume a 4 neighborhood so that the root generates a child for each
of its four neighbors. Then the children generate a node for each of its four
3.5 Model-Based Segmentation Exploiting The Maximum Entropy Principle 85
neighbors that is not yet assigned to a node, and so on. We may have the
following general description for a pairwise model:
p(y|x) = ψ(xs , xt , ys , yt ) φ(xs , ys ) (3.118)
<s,t> s∈S
that is, messages mab is sent from a to b by informing about the consistency of
a given label. These messages are initialized with value 1. There is one message
86 3 Contour and Region-Based Image Segmentation
per value of the label at a given node of the tree. Then, the probability of
labeling a pixel as skin at the root of the tree is given by
p(ys = 1|xs , s ∈ Tk ) ≈ φ(xs , ys ) mts (ys ) (3.120)
t∈N (s)
This method is quite faster than the Gibss sampler. In Fig. 3.16 we show
different skin detection results for the methods explored in this section.
What is image parsing? It is more than segmentation and more than recogni-
tion [154]. It deals with their unification in order to parse or decompose the
input image I into its constituent patterns, say texture, person and text, and
more (Fig. 3.17). Parsing is performed by constructing a parsing graph W. The
graph is hierarchical (tree-like) in the sense that the root node represents the
complete scene and each sibling represents a pattern which that be, in turn,
decomposed. There are also horizontal edges between nodes in the same level
of hierarchy. Such edges define spatial relationships between patterns. Hierar-
chical edges represent generative (top-down) processes. More precisely, a graph
W is composed of the root node representing the entire scene and a set of K
siblings (one per pattern). Each of these siblings i = 1, . . . , K (intermediate
nodes) is a triplet of attributes (Li , ζi , Θi ) consisting of the shape descriptor Li
determining the region R(Li ) = Ri (all regions corresponding to the patterns
must be disjoint and their union must be the scene); the type (family) of visual
pattern ζi (faces, text characters, and so on); and the model parameters Θi (see
Fig. 3.21). Therefore, the tree is given by W = (K, {(Li , ζi , Θi ) i = 1, . . . , K})
where K is, of course, unknown (model order selection). Thus, the posterior
of a candidate generative solution W is quantified by
where it is reasonable to assume that p(K) and p(Θi |ζi ) are uniform and the
term p(ζi |Li ) allows to penalize high model complexity and may be estimated
from the training samples (learn the best parameters identifying samples of
a given model type, as in the example of texture generation described in
Chapter 5). In addition, the model p(Li ), being Li = ∂R(Li ) the contour, is
assumed to decay exponentially with its length and the enclosed area, when
3.6 Integrating Segmentation, Detection and Recognition 87
point process
Fig. 3.17. A complex image parsing graph with many levels and types of patterns.
(Figure by Tu et al. 2005
c Springer.) See Color Plates.
G
n
p2 (IR(L) |L, ζ = 2, Θ) = hj j , Θ = (h1 , . . . , hG )
j=0
p3 (IR(L) |L, ζ ∈ {3, C}, Θ) = G(Ip − Jp ; σ 2 ) , Θ = (a, . . . , f, σ)
p∈R(L)
a binary-valued strong classifier h(T (I)) for test T (I) = (h1 (I), . . . , hn (I))
composed of n binary-valued weak classifiers (their performance is slightly
better than chance). Regarding shape-affinity cues and region-affinity ones,
they propose matches between shape boundaries and templates, and estimate
the likelihood that two regions were generated by the same pattern family
and model parameters. Model parameter and pattern family cues are based
on clustering algorithms (mean-shift, for instance see Chapter 5), which de-
pend on the model types. For the case of boosting, the hard classifier T (I)
can be learnt as a linear combination of the weak ones hi (I):
n
hf (T (I)) = sign hi (I) = sign(α · T (I)) (3.123)
i=1
M
(α∗ , T ∗ ) = arg min e−i (α·T (I)) (3.124)
α,T ⊂D
i=1
that is, an exponential loss is assumed when the proposed hard classifier works
incorrectly and the magnitude of the decay is the dot product. The connection
between boosting and the learning of discriminative probabilities q(wj |Tj (I))
is the Friedman’s theorem, which states that with enough training samples
X = M selected features n, Adaboost selects the weights and tests satisfying
eζ(α·T (I))
q( = ζ|I) = (3.125)
e(α·T (I)) + e−(α·T (I))
and the strong classifier converges asymptotically to the ratio test
q( = +1|I)
hf (T (I)) = sign(α · T (I)) = sign (3.126)
q( = −1|I)
Given this theorem, it is then reasonable to think that q(|T (I)) converge
to approximations of the marginals p(|I) because a limited number of tests
(features) are used. Regarding the training process for Adaboost, it is effec-
tive for faces and text. For learning texts, features of different computational
complexities are used [38]. The simplest ones are means and variances of in-
tensity and vertical or horizontal gradients, or of gradient magnitudes. More
complex features are histograms of intensity, gradient direction, and intensity
gradient. The latter two types of features may be assimilated to the statistical
edge detection framework (Chapter 2) so that it is straightforward to design a
90 3 Contour and Region-Based Image Segmentation
Fig. 3.18. Results of face and text detection. False positive and negative appear.
(Figure by Tu et al. 2005
c Springer.)
weak classifier as whether the log-likelihood ratio between the text and non-
text distributions is above a given threshold or not (here, it is important to
consider Chernoff information and obtain peaked on empirical distributions
as we remark in Prob. 3.12). More complex features correspond, for instance,
to edge detection and linkage. When a high number of features are considered,
weak classifiers rely either on individual log-likelihood ratios or on ratios over
pairwise histograms. Anyway, it is impossible to set the thresholds used in
the weak classifiers to eliminate all false positives and negatives at the same
time (see Fig. 3.18). In a DDMCMC approach, such errors must be corrected
by generative processes, as occurs in the detection of number 9, which will be
detected as a shading region and latter recognized as a letter. Furthermore,
in order to discard rapidly many parts of the image that do not contain a
text/face, a cascaded classifier is built. A cascade is a degenerated tree with
a classifier at each level; if a candidate region succeeds at a given level, it is
passed to the following (deeper) one, and otherwise, is discarded. Considering
the number of levels unknown beforehand, such number may be found if one
sets the maximum acceptable false positive rate per layer, the minimum ac-
ceptance rate per layer, and the false overall positive rate [170]. Furthermore,
as the classifiers in each layer may have different computational complexities,
it is desirable to allocate the classifiers to the levels by following also this cri-
terion. Although the problem of finding the optimal cascade, given the false
positive rate, zero false negatives in the training set, and the average com-
plexity of each classifier, is NP-complete, a greedy (incremental) solution has
been proposed in [39]. The rationale of such approach is that, given a maxi-
mum time to classify, it is desirable to choose for a given layer the classifier
maximizing the expected remaining time normalized by the expected number
of regions remaining to be rejected. If the fixed time is not enough to succeed
for a given maximum rate of false positives, additional time is used. Anyway,
this strategy favors the positioning of simple classifiers at the first levels and
speeds up 2.5 times the uniform-time cascade (see results in Fig. 3.19).
Once we have presented both the generative models and the discrimina-
tive methods, it is time to present the overall structure of the bidirectional
3.6 Integrating Segmentation, Detection and Recognition 91
Fig. 3.19. Results of text detection in a supermarket. Good application for the
visually impaired. (Figure by Chen et al. 2005
c IEEE.)
algorithm (see Fig. 3.20). Starting bottom-up, there are four types of com-
putations of q(w|T (I) (one for each type of discriminative task associated to
a node w in the tree). The key insight of the DDMCMC is that these com-
putations are exploited by the top-down processes, that is, these generative
processes are not fully stochastic. More precisely, the top-down flow, that is
the state transitions W → W , is controlled by a Markov chain K(W, W ),
which is the core of the Metropolis–Hastings dynamics. In this case, such ker-
nel is decomposed into four subkernels Ka : a = 1, . . . , 4, each one activated
with a given probability ρ(a, I). In turn, each of the subkernels that alters
the structure of the parsing tree (all except the model switching kernel and
the region-competition kernel moving the borders, which is not included in the
figure) is subdivided into two moves Kar and Kal (the first one for node cre-
ation, and the second for node deletion). The corresponding probabilities ρar
and ρal are also defined.
The computational purpose of the main kernel K with respect to the parse tree
is to generate moves W → W of three types (node creation, node deletion,
92 3 Contour and Region-Based Image Segmentation
K(W,W⬘)
Markov Kernel
K1 K2 K3 K4
text face generic
sub-kernel sub-kernel sub-kernel
r 1l r 1r r 2l r 2r r 3l r 3r model
switching
K1l K1r K2l K2r K3l K3r sub-kernel
birth death birth death split merge
generative
inference
discriminative
inference
I
input image
Fig. 3.20. Bidirectional algorithm for image parsing. (Figure by Tu et al. 2005
c
Springer.)
and change of node attributes) and drive the search toward sampling the
posterior p(W|I). The main kernel is defined in the following terms:
K(W |W : I) = ρ(a : I)Ka (W |W : I) where ρ(a : I) = 1 (3.127)
a a
and ρ(a : I) > 0. In the latter definition, the key fact is that both the activa-
tion probabilities and the subkernels depend on the information in the image
I. Furthermore, the subkernels must be reversible (see Fig. 3.21) so that the
main kernel satisfies such property. Reversibility is important to ensure that
the posterior p(W|I) is the stationary (equilibrium) distribution. Thus, keep-
ing in mind that Ka (W |W : I) is a transitionmatrix representing the prob-
ability of the transition W → W (obviously W Ka (W |W : I) = 1 ∀ W),
kernels with creation/deletion moves must be grouped into reversible pairs
(creation with deletion): Ka = ρar Kar (W |W : I) + ρal Kal (W |W : I), being
ρar + ρal = 1. With this pairing, it is ensured that Ka (W |W : I) = 0 ⇔
Ka (W|W : I) = 1 ∀ W, W ∈ Ω, and after that pairing Ka is built in order
to satisfy
p(W|I)Ka (W |W : I) = p(W |I)Ka (W|W : I) (3.128)
which is the so-called detailed balance equation [175] whose fulfillment ensures
reversibility. If all subkernels are reversible, the main one is reversible too.
3.6 Integrating Segmentation, Detection and Recognition 93
Another key property to fulfill is ergodicity (it is possible to go from one state
to every other state, that is, it is possible to escape from local optima). In
this case, ergodicity is ensured provided that enough moves are performed.
Reversibility and ergodicity ensure that the posterior p(W|I) is the invariant
probability of the Markov chain. Being μt (W) the Markov chain probability
of state W, we have
μt+1 (W) = Ka(t) (W |W)μt (W) (3.129)
W
and δ(Ka ) > 0, being only zero when the Markov chain becomes stationary,
that is, when p = μ. Denoting by μt (Wt ) the state probability at time t, the
one at time t + 1 is given by
μt+1 (Wt+1 ) = μt (Wt )Ka (Wt+1 |Wt ) (3.132)
Wt
μ(Wt , Wt+1 ) = μt (Wt )Ka (Wt+1 |Wt ) = μt+1 (Wt+1 )pM C (Wt |Wt+1 )
(3.133)
94 3 Contour and Region-Based Image Segmentation
= D(p(Wt+1 )||μ(Wt+1 ))
Ka (Wt |Wt+1 )
+ p(Wt+1 ) Ka (Wt |Wt+1 ) log
pM C (Wt |Wt+1 )
Wt+1 Wt
= D(p(Wt+1 )||μ(Wt+1 ))
+Ep(Wt+1 ) [D(Ka (Wt |Wt+1 )||pM C (Wt |Wt+1 ))] (3.136)
Therefore, we have that
δ(Ka ) ≡ D(p(Wt )||μ(Wt )) − D(p(Wt+1 )||μ(Wt+1 ))
= Ep(Wt+1 ) [D(Ka (Wt |Wt+1 )||pM C (Wt |Wt+1 ))] ≥ 0 (3.137)
that is, δ(Ka ) measures the amount of decrease of divergence for the kernel Ka ,
that is, the convergence power of such kernel. It would be interesting to con-
sider this information to speed up the algorithm sketched in Fig. 3.20, where
3.6 Integrating Segmentation, Detection and Recognition 95
each kernel is activated with probability ρ(.). What is done for the moment
is to make the activation probability dependent on the bottom-up processes.
For texts and faces, ρ(a ∈ 1, 2 : I) = {ρ(a : I) + kg(N (I))/Z, being N (I)
the number of text/faces proposals above a threshold ta , g(x) = x, x ≤ Tb ,
g(x) = Tb , x ≥ Tb , and Z = 1 + 2k (normalization). For the rest of kernels,
there is a fixed value that is normalized accordingly with the evolution of ac-
tivation probabilities of the two first kernels ρ(a ∈ 3, 4 : I) = ρ(a ∈ 3, 4 : I)/Z.
Once we have specified the design requirements of the sub-kernels and
characterized them in terms of convergence power, the next step is to design
them according to the Metropolis–Hastings dynamics:
The key elements in the latter definition are the proposal probabilities
Qa , which consist of a factorization of several discriminative probabilities
q(wj |Tj (I)) for the elements wj changed in the proposed transition W → W .
Thus, we are assuming implicitly that Qa are fast to compute because many
of them rely on discriminative process. For the sake of additional global effi-
ciency, it is desirable that Qa proposes transitions where the posterior p(W |I)
is very likely to be high, and at the same time, that moves are as larger as pos-
sible. Therefore, let Ωa (W) = {W ∈ Ω : Ka (W |W : I) > 0} be the scope,
that is, the set of reachable states from W in one step using Ka . However,
not only large scopes are desired, but also scopes containing states with high
posteriors. Under this latter consideration, the proposals should be designed
as follows:
p(W |I)
Qa (W |W : Ta (I)) ∼ if W ∈ Ωa (W) (3.140)
W ∈ Ωa (W)p(W |I)
and should be zero otherwise. For that reason, the proposals for creat-
ing/deleting texts/faces can consist of a set of weighted particles (Parzen
windows were used in DDMCMC for segmentation in Section 3.4). More pre-
cisely, Adaboost, assisted by a binarization process for detecting character
boundaries, yields a list of candidate text shapes. Each particle z (shape) is
weighted by ω. We have two sets, one for creating and the other for deleting
text characters:
where a = 1 for text characters. Weights ωar for creating new characters
are given by a similarity measure between the computed border and the de-
formable template. Weights ωal for deleting characters are given by their pos-
teriors. The idea behind creating and deleting weights is to approximate the
ratios p(W |I)/p(W|I) and p(W|I)/p(W |I), respectively. Anyway, the pro-
posal probabilities given by weighted particles are defined by
being EI the expectation with respect to p(I) and ET,T+ the one with respect
to the probability of test responses (T, T+ ) induced by p(I). The key insight
behind the equalities above is twofold. Firstly, the fact that the divergence
of q(.) with respect to the marginal p(.) decreases as new tests are added.
Secondly, the degree of decreasing of the average divergence yields a useful
measure for quantifying the effectiveness of a test with respect to other choices.
Regarding the proof of the equalities in Eq. 3.143, let us start by finding a
more compact expression for the difference of expectations EI . For the sake of
clarity, in the following we are going to drop the dependency of the tests on I,
while keeping in mind that such dependency actually exists, that is, T = T (I)
and T+ = T+ (I):
I(w; T, T+ ) − I(w; T )
! " ! "
q(w, T, T+ ) q(w, T )
= q(w, T, T+ ) log − q(w, T ) log
w
q(T, T+ )q(w) w
q(T )q(w)
T,T+ T
! "
q(w, T, T+ ) q(w, T )
= q(w, T, T+ ) log − q(w, T ) log
w
q(T, T+ )q(w) q(T )q(w)
T,T+
! "
= {q(w, T, T+ ) log q(w|T, T+ ) − q(w, T ) log q(w|T )}
T,T+ w
! "
p(w, I) p(w, I)
= log q(w|T, T+ ) − log q(w|T )
w
q(x|T, T+ ) q(x|T )
T,T+
! "
p(w, I) p(w, I)
= log q(w|T, T+ ) − log q(w|T )
w
q(x) q(x)
T,T+
! "
1
= {p(w, I) log q(w|T, T+ ) − p(w, I) log q(w|T )}
q(Ix ) w
I
! "
q(w|T, T+ )
= p(w, I) log (3.146)
w
q(w|T )
I
As we have seen, discriminative processes are fast but prone to error, whereas
generative ones are optimal but too slow. Image parsing enables competitive
and cooperative processes for patterns in an efficient manner. However, the
algorithm takes 10–20 min to process images with results similar to those pre-
sented in Fig. 3.22, where the advantage is having generative models for syn-
thesizing possible solutions. However, additional improvements, like a better
management of the segmentation graph, may reduce the overall computation
time under a minute.
Bottom-up/top-down integration is not new either in computer vision or
in biological vision. For instance, in the classical of Ullman [161], the counter-
stream structure is a computational model applied to pictorial face recognition,
where the role of integrating bottom-up/top-down processes is compensating
for image-to-model differences in both directions: from pictures to models
(deal with variations of position and scale) and from models to pictures (solve
differences of viewing direction, expression and illumination). However, it is
very difficult and computational intensive to build a generative model for
faces, which takes into account all these sources of variability, unless sev-
eral subcategories of the same face (with different expressions) are stored and
queried as current hypothesis. Other key aspects related to the combination of
the recognition paradigms (pure pictorial recognition vs from parts to objects)
and the capability of generalization from a reduced number of views are clas-
sical topics both in computer and biological vision (see for instance [152]).
Recently emerged schemas like bag of words (BoW) are in the beginning
of incorporating information-theoretic elements (see for instance [181] where
100 3 Contour and Region-Based Image Segmentation
Fig. 3.22. Results of image parsing (center column) showing the synthesized images.
(Figure by Tu et al. 2005
c Springer.)
Problems
3.1 Implicit MDL and region competition
The region competition approach by Zhu and Yuille [184] is a good example
of implicit MDL. The criterion to minimize, when independent probability
models are assumed for each region, is the following:
K
μ
E(Γ, {Θi }) = ds − log p(I(x, y)|Θi )dxdy + λ
i=1
2 ∂Ri Ri
where Γ = ∪K i=1 ∂Ri , the first term is the length of curve defining the boundary
∂Ri , mu is the code length for a unit arc (is divided by 2 because each edge
fragment is shared by two regions). The second term is the cost of coding each
pixel inside Ri the distribution specified by Θi . The minimization attending
MDL is done in two steps. In the first one, we estimate
the optimal parameters
by solving Θi∗ as the parameters are maximizing (x,y)∈Ri p(I(x, y)|Θi ). In the
3.6 Integrating Segmentation, Detection and Recognition 101
second phase, we re-estimate each contour following the motion equation for
the common contour Γij between two adjacent regions Ri and Rj :
dΓij p(I(x, y)|Θi )
= −μκi ni + log ni
dt p(I(x, y)|Θj )
where κi is the curvature, and ni = −nj is the normal (κi ni = κj nj ). Find an
analytical expression for the two steps assuming that the regions are character-
ized by Gaussian distributions. Then, think about the role of the log-likelihood
ratio in the second term of the motion equation. For the Gaussian case, give
examples of how decisive is the log-ratio when the distributions have the same
variance, but closer and closer averages. Hint: evaluate the ratio by computing
the Chernoff information.
3.2 Green’s theorem and flow
The derivation of energy functions of the form E(Γ ) = R
f (x, y)dxdy is
usually done by exploiting the Green’s theorem, which states that for any
vector (P (x, y), Q(x, y)) we have
l
∂Q ∂P
− dxdy = (P dx + Qdy) = (P ẋ + Qẏ)ds
R ∂x ∂y ∂R 0
We must choose P and Q so that ∂Q∂x − ∂y = f (x, y). For instance, setting
∂P
1 x 1 y
Q(x, y) = f (t, y)dt P (x, y) = − f (x, t)dt
2 0 2 0
Using L(x, ẋ, y, ẏ) = Q(x, y)ẏ + P (x, y)ẋ, show that we can write E(Γ ) =
l
0
L(x, ẋ, y, ẏ)ds. Prove also that using the Euler–Lagrange equations we
finally find that
EΓ (Γ ) = f (x, y)n
being n = (ẏ, −ẋ) the normal, and (P, Q) = (F1 , F2 )
3.3 The active square
Given an image with a square foreground whose intensities follow a Gaussian
distribution with (μ1 , σ1 ) over a background also Gaussian (μ2 , σ2 ). Initialize a
square active polygon close to the convergence point and make some iterations
to observe the behavior of the polygon. Test two different cases: (i) quite
different Gaussians, and (ii) very similar Gaussians.
3.4 Active polygons and maximum entropy
What is the role of the maximum entropy principle in the active polygons
configuration? Why maximum entropy estimation is key in this context? What
is the main computational advantage of active polygons vs active contours?
3.5 Jensen–Shannon divergence
Why is the Jensen–Shannon divergence used for discriminating the foreground
from the background?
102 3 Contour and Region-Based Image Segmentation
the latter types (Chapter 1). Therefore, the weak classifiers are of the form
ptext
hi (I) = log pnon−text > ti . As we will see in Chapter 7, the αi are computed
sequentially (greedily) within Adaboost (choosing a predefined order). What
is more important is that such values depend on the effectiveness of the clas-
sifier hi (the higher the number of misclassified examples, the lower becomes
αi ). This virtually means that a classifier with αi = 0 is not selected (has no
impact in the strong classifier). In this regard, what types of features will be
probably excluded for text detection?
Considering now the thresholds ti , incrementing them usually leads to a
reduction of false positives, but also to an increment of false negatives. A more
precise threshold setting would depend on the amount of overlap of the on and
off distributions, that is, Chernoff information. How would you incorporate
this insight into the current threshold setting?
3.13 Proposal probabilities for splitting and merging regions
Estimate the weights ω3r and ω3l for splitting and merging regions in the image
parsing approach. Consider that these weights approximate (efficiently) the
ratios p(W|I)/p(W |I) and p(W |I)/p(W|I), respectively. Consider that in a
splitting move, region Rk existing in state W is decomposed into Ri and Rj
in W, whereas two adjacent regions Ri and Rj existing in W are fused into
Rk . Remember that in the approximation, a key ratio is the one between the
compatibility measure of Ri and Rj and the likelihood of the Rk region for
the split case, and the one of the two regions (assumed independent) for the
fusion case.
3.14 Markov model and skin detection
What is the main motivation of using a Markov random field for modeling the
skin color in images? How is the estimated extra computational cost of using
this kind of models?
3.15 Maximum Entropy for detection
The maximum entropy principle is introduced in the context of skin color de-
tection in order to introduce pairwise dependencies independently of the ori-
entation of two 4-neighbors. Explain how the original ME formulation evolves
to the consideration of color gradients and how the ME algorithm estimates
the Lagrange multipliers.
4.1 Introduction
This chapter mainly deals with the way in which images and patterns are
compared. Such comparison can be posed in terms of registration (or align-
ment). Registration is defined as the task of finding the minimal amount of
transformation needed to transform one pattern into another (as much as pos-
sible). The computational solution to this problem must be adapted to the
kind of patterns and to the kind of transformations allowed, depending on
the domain. This also happens with the metric or similarity measure used for
quantifying the amount of transformation. In both cases, information theory
plays a fundamental role, as we will describe along the present chapter.
In the case of images, it is reasonable to exploit the statistics of inten-
sities. The concept of mutual information (MI), which quantifies statistical
dependencies between variables, is a cornerstone here because such variables
are instantiated to intensity distributions. Therefore, image registration can
be posed as finding the (constrained) transformation that holds the maximal
dependency between distributions. This rationale opens the door to the quest
for new measures rooted in mutual information.
In the case of contours, registration stands for finding minimal deforma-
tions between the input and the target contours. Here, the space of defor-
mations is less constrained, but these should be as smooth as possible. It is
possible to represent a shape as a mixture or a distribution of keypoints. Such
representation enables the use of other types of information theoretic mea-
sures, like the Jensen–Shannon (JS) divergence. Furthermore, it is possible to
exploit interesting information geometry concepts, like the Fisher–Rao metric
tensor.
Finally, in the case of structural patterns, like graphs, information theory
is a key for driving the unsupervised learning of structural prototypes. For
instance, binary shapes can be described by a tree-like variant of the skeleton,
known as shock-tree. Then, shape information is collapsed into the tree, and
shape registration can be formulated as tree registration using a proper tree
F. Escolano et al., Information Theory in Computer Vision and Pattern Recognition, 105
c Springer-Verlag London Limited 2009
106 4 Registration, Matching, and Recognition
distance. What is more important is that the MDL principle can be applied
to find a prototypical tree for each class of shape.
However, conditional entropy also has the constancy problem: it would be low
for an image v(T (x)), which is predictable from the model u(x), and it would
also be low for an image that is predictable by itself, which is the case of
constant images.
Adding a penalization to simple images solves the constancy problem. The
first term of the following minimization objective is the conditional entropy.
The second term awards images with a higher entropy than images with low
entropy: & + , + ,'
arg min H v(T (x))|u(x) − H v(T (x)) (4.4)
T ∈Ω
Given that H(v(T (x))|u(x)) = H(v(T (x)), u(x))−H(u(x)), the latter formula
(Eq. 4.4) corresponds with mutual information maximization:
& + , + , + ,'
arg min H v(T (x)), u(x) − H u(x) − H v(T (x)) (4.5)
T ∈Ω
& + , + , + , '
= arg max H u(x) + H v(T (x)) − H v(T (x)), u(x) (4.6)
T ∈Ω
+ ,
= arg max I u(x), v(T (x)) (4.7)
T ∈Ω
The third term in Eq. 4.6, H(v(T (x)), u(x)), is the one that awards transfor-
mations for which u better explains v. The term H(v(T (x))) contributes to
select transformations that make the model u correspond with more complex
parts of the image v. The term H(u(x)) remains constant for any T , and so
it does not condition the alignment.
An alignment example illustrating the meaning of conditional entropy can
be seen in Fig. 4.1. In this simple example, the images were obtained from
the same sensor, from two slightly different positions. Therefore, due to the
differences of perspective, alignment cannot be perfect. Two cases are repre-
sented: a misalignment and a correct alignment. For the first case, the joint
histogram is much more homogeneous than for the correct alignment, where
the joint histogram is tightly concentrated along the diagonal. For more com-
plex cases, involving different sensors, it is worth noting that even though the
representation of the model and the actual image could be very different, they
will have a higher mutual information when the alignment is the correct one.
A major problem of the latter approach is the estimation of entropy.
Viola and Wells III [171] explain in detail the use of mutual information
for alignment. They also present a method for evaluating entropy and mutual
information called Empirical entropy manipulation and analysis (EMMA),
which optimizes entropy, based on a stochastic approximation. EMMA uses
the Parzen window method, which is presented in the next section.
108 4 Registration, Matching, and Recognition
Fig. 4.1. Alignment problem example: Top: images obtained from the same sensor,
from two slightly different positions. Center-left: a misalignment. Center-right: joint
histogram of the misalignment. Bottom-left: a correct alignment. Bottom-right: joint
histogram of the alignment.
The Parzen’s windows approach [51, 122] is a nonparametric method for es-
timating probability distribution functions (pdfs) for a finite set of patterns.
The general form of these pdfs is
1
P ∗ (Y, a) ≡ Kψ (y − ya ) (4.8)
Na y ∈a
a
where a is a sample of the variable Y , Na is the size of the sample, and K(.) is
a kernel of width ψ and centered in ya . This kernel has to be a differentiable
4.2 Image Alignment and Mutual Information 109
function, so that the entropy H can be derived for performing a gradient de-
scent over it. A Gaussian kernel is appropriate for this purpose. Also, let us as-
sume that the covariance matrix ψ is diagonal, that is, ψ = Diag(σ12 , ..., σN2
a
).
This matrix indicates the widths of the kernel, and by forcing ψ to be diag-
onal, we are assuming independence among the different dimensions of the
kernel, which simplifies the estimation of these widths. Then, the kernel is
expressed as the following product:
2
1 d
1 y j − yaj
Kψ (y − ya ) = d exp − (4.9)
i=1 σi (2π)
d/2
j=1
2 σj
where y j represents the jth component of y, and yaj represents the jth com-
ponent of kernel ya . The kernel widths ψ are parameters that have to be
estimated. In [171], a method is proposed for adjusting ψ using maximum
likelihood.
The entropy of a random variable Y is the expectation of the negative
logarithm of the pdf:
where b refers to the samples used for estimating entropy, and (b) is their
likelihood.
In the alignment problem, the random variable Y has to be expressed as
a function of a set of parameters T . The derivative of H with respect to T is
∂ ∗
H (Y (T ))
∂T
1 K (y − ya ) ∂
ψ b (yb − ya ) ψ −1
T
≈ (yb − ya ) (4.11)
Nb y ∈a
yb ∈b ya ∈a Kψ (yb − ya )
a
∂T
The first factor of the double summation is a weighting factor that takes
a value close to one, if ya is much closer to yb than to the elements of a.
However, if there is an element in which a is similar to ya , the weighting factor
approaches zero. Let us denote the weighting factor as WY , which assumes a
diagonal matrix ψY of kernel widths for the random variable Y :
KψY (yb − ya )
WY (yb , ya ) ≡ (4.12)
ya ∈a KψY (yb − ya )
The weighting factor helps find a transformation T , which reduces the av-
erage squared distance between those elements that are close to each other,
forming clusters.
110 4 Registration, Matching, and Recognition
(4.14)
T
Here, vi denotes v(T (xi )) and wi denotes [u(xi ), v(T (xi ))] for the joint den-
sity. Also, the widths of the kernels for the joint density are block diagonal:
−1 −1 −1
ψuv = Diag(ψuu , ψvv ). In the last factor, (v(T (xb )) − v(T (xa ))) has to be
derived with respect to T , so the expression of the derivative depends on the
kind of transformation involved.
Given the derivatives, a gradient descent can be performed. The EMMA
algorithm performs a stochastic analog of the gradient descent. The stochas-
tic approximation avoids falling into local minima. The following algorithm
proved to find successful alignments for sample sizes Na , Nb of about 50
samples.
MI
1 2.5
NCC
0.8 2
0.6 1.5
0.4 1
0.2 0.5
0 0
600 600
400 600 400 600
200 400 200 400
200 200
tx
0 tx 0
−200 0 0
−200 tz −200 −200 tz
−400 −400 −400 −400
−600 −600 −600 −600
Fig. 4.2. Normalized Cross Correlation (left) and Mutual Information (right) of
image alignments produced by translations along both axes. It can be observed that
Mutual Information has smaller local maxima, then the maximum (in the center of
the plot) is easier to find with a stochastic search. Figure by courtesy of J.M. Sáez.
X Y
0 50
0 50
0 2
50
x
100
150
200 1 1
Fig. 4.3. How a classical joint histogram is built. Each cell of the histogram counts
the times which some definite combination of intensities happens between two pixels
in the two images, only for those pixels which have the same coordinates.
Fig. 4.4. Joint histograms of two different photographs which show the same view
of a room. The numbers of bins of the histograms are 10 (left), 50 (center ), and 255
(right).
4.2 Image Alignment and Mutual Information 113
Fig. 4.5. The joint histograms, the photographs of Fig. 4.5, with the addition of
Gaussian noise to one of the images. The numbers of bins of the histograms are 10
(left), 50 (center ), and 255 (right).
two images, which show the same scene from a very similar point of view.
The first histogram has B = 10, the second one B = 50, and the third one
B = 255 bins. We can observe that 10 bins do not provide enough information
to the histogram. However, 255 bins produce a very sparse histogram (the
smaller the image is, the sparser the histogram would be). On the other hand,
a large number of bins make the histogram very vulnerable to noise in the
image. We can see the histograms of the same images with Gaussian noise in
Fig. 4.5.
To deal with the histogram binning problem, Rajwade et al. have proposed
a new method [133,134], which considers the images to be continuous surfaces.
In order to estimate the density of the distribution, a number of samples have
to be taken from the surfaces. The amount of samples determines the precision
of the approximation. For estimating the joint distribution of the two images,
each point of the image is formulated as the intersection of two-level curves,
one-level curve per image. In a continuous surface (see Fig. 4.6), there are
infinite level curves. In order to make the method computable in a finite time,
a number Q of intensity levels has to be chosen, for example, Q = 256. A lower
Q would proportionally decrease the precision of the estimation because the
level curves of the surface would have a larger separation, and therefore, some
small details of the image would not be taken into consideration.
In order to consider the image as a continuous surface, sub-pixel inter-
polation has to be performed. For the center of each pixel, four neighbor
points at a distance of half pixel are considered, as shown in Fig. 4.7a. Their
intensities can be calculated by vertical and horizontal interpolation. These
four points form a square that is divided in two triangles, both of which
have to be evaluated in the same way. For a definite image, for example, I1 ,
the triangle is formed by three points pi , i ∈ {1, 2, 3} with known positions
(xpi , ypi ) and known intensities zpi . Assuming that the intensities within a
triangle are represented as a planar patch, it would be given by the equation
114 4 Registration, Matching, and Recognition
220
200
180
160
140
120
100
80
60
40
20
Fig. 4.6. Top: the continuous surface formed by the intensity values of the labora-
tory picture. Bottom: some of the level curves of the surface.
For each triangle, once we have its values of A1 , B1 , and C1 for the image I1
and A2 , B2 , and C2 for the image I2 , we can decide whether to add a vote
for a pair of intensities (α1 , α2 ). Each one of them is represented as a straight
4.2 Image Alignment and Mutual Information 115
1 3
2 4 2 4
2
(c) 1 (e) 1
4
2 4 2 4
Fig. 4.7. (a) Subpixel interpolation: the intensities of four points around a pixel are
calculated. The square formed is divided in two triangles. (b),(c) The iso-intensity
lines of I1 at α1 (dashed line) and of I2 at α2 (continuous line) for a single triangle.
In the first case (b), they intersect inside the triangle, so we vote for p(α1 , α2 ). In
the cases (c) and (d), there is no vote because the lines do not intersect in the
triangle; (d) is the case of parallel gradients. In (e), the iso-surfaces are represented,
and their intersection area in the triangle represents the amount of vote for the pair
of intensity ranges.
line in the triangle, as shown in Fig. 4.7b and c. The equations of these lines
are given by A,B, and C:
A1 x + B1 y + C1 = α1
x =?, y =? (4.19)
A2 x + B2 y + C2 = α2
The former equations form a system, which can be solved to obtain their
intersection point (x, y). If it results to be inside the area of the triangle
(p1 , p2 , p3 ), a vote is added for the pair of intensities (α1 , α2 ). All the triangles
in the image have to be processed in the same way, for each pair of intensities.
Some computational optimizations can be used in the implementation, in
order to avoid repeating some calculations.
The results of this first approach can be observed in Figs. 4.8 and 4.9
(third row). This method, however, has the effect of voting more to zones
with higher gradient. When the gradients are parallel, there is no vote, see
Fig. 4.7d. In [134], Rajwade et al. have presented an improved method that,
instead of counting intersections of iso-contours, sums intersection areas of iso-
surfaces. A similar approach to density estimation had been previously taken
by Kadir and Brady [90] in the field of image segmentation. An example of
intersecting iso-surfaces is shown in Fig. 4.7e. When the gradients are parallel,
their intersection with the triangle is still considered. In the case in which one
image has zero gradient, the intersection of the iso-surface area of the other
image is the considered area. This approach produces much more robust his-
tograms. Examples can be observed in Figs. 4.8 and 4.9 (fourth row). The
difference among the different methods is more emphasized in the images of
the laboratory because the images are poorly textured, but have many edges.
116 4 Registration, Matching, and Recognition
Fig. 4.8. Left: the disaligned case; right: the aligned case. First row: the alignment
of the images; second, third, and fourth row: classical, point-counting, and area-based
histograms, respectively.
4.2 Image Alignment and Mutual Information 117
Fig. 4.9. Left: the disaligned case; right: the aligned case. First row: the alignment
of the images; second, third, and fourth row: classical, point-counting, and area-based
histograms, respectively.
118 4 Registration, Matching, and Recognition
POINT COUNTING
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1 50
0 40
0 30
10 20
20
30 10
40
50 0
AREA BETWEEN ISOCONTOURS
2.5
1.5
1
50
40
0.5 30
0
10 20
20
30 10
40 0
50
because only horizontal and vertical translations were used as parameters for
the alignment. It can be seen that the method based on iso-contours, or point-
counting yields a much more abrupt search surface than the iso-surfaces, or
area-based method.
1. Nonnegativity
d(x, y) ≥ 0
2. Identity of indiscernibles
d(x, y) = 0 ⇔ x = y
3. Symmetry
d(x, y) = d(y, x)
4. Triangle inequality
d(x, y) = 0 ⇐ x = y
This implies that two different points can have a zero distance.
I(X; Y ) I(X; Y )
or
H(Y ) H(X)
120 4 Registration, Matching, and Recognition
However, the entropies H(X) and H(Y ) are not necessarily the same.
A solution to this problem would be to normalize by the sum of both
entropies. The resulting measure is symmetric:
I(X; Y )
R(X, Y ) =
H(X) + H(Y )
H(X) + H(Y ) − H(X, Y )
=
H(X) + H(Y )
H(X, Y )
= 1−
H(X) + H(Y )
This measure actually measures redundancy and is zero when X and Y are
independent. It is proportional to the entropy correlation coefficient (ECC),
also known as symmetric uncertainty:
2
ECC(X, Y ) = 2 −
N I1 (X, Y )
2H(X, Y )
= 2−
H(X) + H(Y )
= 2R(X, Y )
In [179], Zhang and Rangarajan have proposed the use of a new informa-
tion metric (actually a pseudometric) for image alignment, which is based on
conditional entropies:
ρ(X, Y )
τ (X, Y ) =
H(X, Y )
H(X|Y ) + H(Y |X)
=
H(X, Y )
H(X, Y ) − I(X; Y )
τ (X, Y ) =
H(X, Y )
I(X; Y )
= 1−
H(X, Y )
Another important difference between the metric ρ(X, Y ) and the mutual
information measure is that the new information metric is easily extensible
to the multimodal case (registration of more than two images). The mutual
information (MI) of three variables is be defined as
There is another definition that uses the same notation, I(X; Y ; Z), called
interaction information, also known as co-information [16]. It is commonly
defined in terms of conditional mutual information, I(X; Y |Z) − I(X; Y ),
which is equivalent to the negative MI as defined in Eq. 4.24. The conditional
MI is defined as
ρ(X, Y, Z)
τ (X, Y, Z) =
H(X, Y, Z)
ρ(X1 , X2 , . . . , Xn )
τ (X1 , X2 , . . . , Xn ) =
H(X1 , X2 , . . . , Xn )
For a better understanding of ρ(X, Y, Z), τ (X, Y, Z), and I(X; Y ; Z), see
Fig. 4.11, where information is represented as areas. In [179], the compu-
tational complexity of estimating the joint probability of three variables is
taken into consideration. The authors propose the use of an upper bound of
the metric.
H(X) H(Y)
H(Y|X,Z)
H(X|Y,Z)
I(X,Y,Z)
H(Z|X,Y)
H(Z) H(X,Y,Z)
Fig. 4.11. Mutual Information I(X, Y, Z), joint entropy H(X, Y, Z), conditional
entropies H(· |· , · ), and marginal entropies H(· ) for three random variables X, Y ,
and Z, represented in a Venn diagram.
be the one that maximizes or minimizes some criterion. In the case of the ρ
metric, we search for a T ∗ that minimizes it:
T ∗ = argminρ(I1 , T (I2 ))
T
{T1∗ , . . . , Tn−1
∗
}= argmin ρ(I1 , T1 (I2 ), . . . , Tn−1 (In ))
{T1 ,...,Tn−1 }
Shearing is given by b:
⎛ ⎞ ⎛ ⎞⎛ ⎞ ⎛ ⎞
xI 1b0 xI xI + byI
⎝ yI ⎠ = ⎝ 0 1 0 ⎠ ⎝ yI ⎠ = ⎝ yI ⎠
1 001 1 1
Finally, a rotation along the z axis of θ radians from the origin is given by
a, b, c and d all together:
⎛ ⎞ ⎛ ⎞⎛ ⎞ ⎛ ⎞
xI cos θ − sin θ 0 xI xI cos θ − YI sin θ
⎝ yI ⎠ = ⎝ sin θ cos θ 0 ⎠ ⎝ yI ⎠ = ⎝ xI sin θ + YI cos θ ⎠
1 0 0 1 1 1
Alfréd Rényi (20 March 1921–1 February 1970) was a Hungarian mathemati-
cian who mainly contributed in probability theory [137], combinatorics and
graph theory. One of his contributions in probability theory is a generaliza-
tion of the Shannon entropy, known as alpha-entropy (or α-entropy), as well
as Rényi entropy.
A random variable X with possible n values x1 , . . . , xm , has some
distribution P = (p1 , p2 , . . . , pn ) with k=1 pk = 1. The probability of
X to be xi is usually denoted as P (X = xi ) and sometimes it is abbreviatedly
denoted as P (xi ). The Shannon entropy definition can be formulated for the
random variable, or for its probability distribution. In the following definition,
E denotes the expected value function:
The base b of the logarithm logb will be omitted in many definitions in this
book because it is not significant for most pattern recognition problems.
Actually, the base of the logarithm logb determines the units of the entropy
measure:
Logarithm Units of H
log2 bit
loge nat
log10 dit/digit
Some pattern recognition approaches model the probability distributions
as a function, called probability density function (pdf). In that case, the
summation of elements of P becomes an integral. The definition of Shannon
entropy for a pdf f (z) is
H(f ) = − f (z) log f (z)dz (4.29)
z
1 n
Hα (P) = log pα (4.30)
1−α i=1
j
α ∈ (0, 1) ∪ (1, ∞)
In Fig. 4.12, it can be seen that for the interval α ∈ (0, 1), the entropy function
Hα is concave, while for the interval α ∈ (1, ∞), it is neither concave, nor con-
vex. Also, for the first interval, it is smaller or equal to the Shannon entropy,
Hα (P) ≥ H(P), ∀α ∈ (0, 1), given that Hα is a nonincreasing function of α.
The discontinuity at α = 1 is very significant. It can be shown, analytically
and experimentally, that as α approaches 1, Hα tends to the value of the
Shannon entropy (see the proof in Chapter 5). This relation between both
entropies has been exploited in some pattern recognition problems. As shown
in the following subsection, some entropy estimators provide an approximation
to the value of Rényi entropy. However, Shannon entropy could be necessary
in some problems. In Chapter 5, we present an efficient method to approximate
Shannon entropy from Rényi entropy by finding a suitable α value.
126 4 Registration, Matching, and Recognition
0.8
0.6
α=0
Hα(p)
α=0.2
0.4 α=0.5
Shannon
α=2
0.2
α=10
α→∞
0
0 0.2 0.4 0.6 0.8 1
p
Fig. 4.12. Rényi and Shannon entropies of a Bernoulli distribution P = (p, 1 − p).
is asymptotically unbiased and consistent with the PDF of the samples. Here,
the function Lγ (Xn ) is the length of the MST, and γ depends on the order α
and on the dimensionality: α = (d − γ)/d. The bias correction βLγ ,d depends
on the graph minimization criterion, but it is independent of the PDF. There
are some approximations which bound the bias by (i) Monte Carlo simulation
4.3 Alternative Metrics for Image Alignment 127
Fig. 4.13. Minimal spanning trees of samples with a Gaussian distribution (top)
and samples with a random distribution (bottom). The length Lγ of the first MST
is visibly shorter than the length of the second MST.
of uniform random samples on unit cube [0, 1]d and (ii) approximation for
large d: (γ/2) ln(d/(2πe)) in [21].
The length Lγ (Xn ) of the MST is defined as the length of the acyclic
spanning graph with minimal length of the sum of the weights (in this case
the weights are defined as | e |γ ) of its edges {e}:
LM
γ
ST
(Xn ) = min | e |γ (4.33)
M (Xn )
e∈M (Xn )
where γ∈ (0, d). Here, M (Xn ) denotes the possible sets of edges of a spanning
tree graph, where Xn = {x1 , ..., xn } is the set of vertices which are connected
by the edges {e}. The weight of each edge {e} is the distance between its
vertices, powered the γ parameter: | e |γ . There are several algorithms for
building a MST, for example, the Prim’s MST algorithm has a straightforward
implementation. There also are estimators that use other kinds of entropic
graphs, for example, the k-nearest neighbor graph [74].
Entropic spanning graphs are suitable for estimating α-entropy for 0 ≤
α < 1, so Shannon entropy cannot be directly estimated with this method.
In [185], relations between Shannon and Rényi entropies of integer order are
128 4 Registration, Matching, and Recognition
n
ωi = 1, ωi ≥ 0
i=1
In [70], Hamza and Krim define the divergence and study some interesting
properties. They also present its application to image registration and segmen-
tation. In the case of image registration with the presence of noise, the use of
the Jensen–Rényi divergence outperforms the use of mutual information, as
shown in Fig. 4.14.
For segmentation, one approach is to use a sliding window. The window is
divided in two parts W1 and W2, each one corresponding to a portion of data
(pixels, in the case of an image). The divergence between the distributions
of both subwindows is calculated. In the presence of an edge or two different
regions A and B, the divergence would be maximum when the window is
centered at the edge, as illustrated in Fig. 4.15. In the case of two distributions,
the divergence can be expressed as a function of the fraction λ of subwindow
W2, which is included in the region B.
2
Mutual Information
JR Divergence
1.8
1.6
1.4
1.2
0.8
0 2 4 6 8 10 12 14 16 18 20
θ
Fig. 4.14. Registration in the presence on noise. For the Jensen–Rényi divergence,
α = 1 and the weights are ωi = 1/n. Figure by A. Ben Hamza and H. Krim (2003c
Springer).
Region A Region B
Data
W1 W2 JR = 0.4
W1 W2 JR = 0.5
W1 W2 JR = 0.6
W1 W2 JR = 0.7
W1 W2 JR = 0.6
W1 W2 JR = 0.5
W1 W2 JR = 0.4
J-R divergence
Fig. 4.15. Sliding window for edge detection. The divergence between the distribu-
tions of the subwindows W1 and W2 is higher when they correspond to two different
regions of the image.
Fig. 4.16. Edge detection results using Jensen–Rényi divergence for various values
of α. Figure by A. Ben Hamza and H. Krim (2003
c Springer).
and
DB (f, g) = − log BC(f, g)
is the Bhattacharyya distance, which is symmetric. The Hellinger dissimilarity
is also related to the Bhattacharyya coefficient:
1
H(f, g) = 2 − 2BC(f, g)
2
Then, its relation to the α-relative divergence of order α = 12 is
7
1 1
H(f, g) = 2 − 2 exp − D 12 (f ||g)
2 2
The divergences with α = −1 and α = 2 are also worth mentioning:
2
1 (f (z) − g(z))
D−1 (f ||g) = dz
2 f (z)
and 2
1 (f (z) − g(z))
D2 (f ||g) = dz
2 g(z)
From the definition of the α-divergence (Eq. 4.35), it is easy to define a
mutual information that depends on α. The mutual information is a measure
that depends on the marginals and the joint density f (x, y) of the variables
x and y. Defining g as the product of the marginals, g(x, y) = f (x)f (y), the
expression of the α-mutual information is
α
1 f (x, y)
Iα = Dα (f ||g) = log f (x)f (y) dxdy
α−1 f (x)f (y)
which, in the limit α → 1, converges to the Shannon mutual information.
132 4 Registration, Matching, and Recognition
Fig. 4.17. Ultrasound images of breast tumor, separated and rotated. Figure by
H. Neemuchwala, A. Hero, S. Zabuawala and P. Carson (2007
c Wiley).
In [117], Neemuchwala et al. study the use of different measures for image
registration. They also present experimental results, among which the one
shown in Fig. 4.17 represented two images of a database of breast tumor
ultrasound images. Two different images are shown, one of them is rotated.
In Fig. 4.18, several alignment criteria are compared. For each one of them,
different orders α are tested and represented.
Let us focus on shape as a set (list) of points (like the rough discretization of
a contour, or even a surface) (Fig. 4.19). We have a collection S of N shapes
(data points sets): C = {Xc : c = 1, . . . , N } being Xc = {xci ∈ Rd : i =
1, . . . , nc } a given set of nc points in Rd , with d = 2 for the case of 2D shapes.
A distributional model [9, 174] of a given shape can be obtained by modelling
each point set as a Gaussian mixture. As we will see in more detail in the fol-
lowing chapter (clustering), Gaussian mixtures are fine models for encoding a
multimodal distribution. Each mode (Gaussian cluster) may be represented
as a multidimensional Gaussian, with a given average and covariance matrix,
and the convex combination (coefficients add to one) of all the Gaussians.
Gaussian centers may be placed at each point if the number of points is not
too high. The problem of finding the optimal number of Gaussians and their
placement and covariances will be covered in the first part of the next chap-
ter. For the moment, let us assume that we have a collection of cluster-center
point sets V = {Vc : c = 1, . . . , N }, where each center point sets consists of
4.4 Deformable Matching with Jensen Divergence and Fisher Information 133
Fig. 4.18. Normalized average profiles of image matching criteria for registration
of ultrasound breast tumor images. Plots are normalized with respect to the maxi-
mum variance in the sampled observations. Each column corresponds to a different
α order, and each row to a different matching criterion. (Row 1) k-NN graph-based
estimation of α-Jensen difference divergence, (row 2) MST graph-based estimation
of α-Jensen difference divergence, (row 3) Shannon Mutual Information, and (row 4)
α Mutual Information estimated using NN graphs. The features used in the exper-
iments are ICA features, except for the row 3, where the histograms were directly
calculated from the pixels. Figure by H. Neemuchwala, A. Hero, S. Zabuawala and
P. Carson (2007
c Wiley).
being αc = (α1c , . . . , αK
c
c
)T .
Let then Z denote the so called average atlas point set, that is, the point
set derived from registrating all the input point sets with respect to a com-
mon reference system. Registration means, in this context, the alignment of
all point sets Xc through general (deformable) transformations f c , each one
parameterized by μc . The parameters of all transformations must be recov-
ered in order to compute Z. To that end, the pdfs of each of the deformed
shapes may be expressed in the following terms:
c
K
c c
pc = pc (x|V , μ ) = αac p(x|f c (vca )) (4.39)
a=1
where Ad×d and td×1 denote the linear part; Wd×n encodes the nonrigid part;
and U(x)n×1 encodes n basis functions (as many as control points) Ui (x) =
U (||x − xi ||), each one centered at each xi , being U (.) a kernel function (for
instance U (r) = r2 log(r2 ) when d = 2). Thus, what is actually encoding f
is an estimation (interpolation) of the complete deformation field from the
input points. The estimation process requires the proposed correspondences
for each control point: x1 , . . . , xn . For instance, in d = 2 (2D shapes), we
have for each xi = (xi , yi )T its corresponding point xi = (xi , yi )T (known
4.4 Deformable Matching with Jensen Divergence and Fisher Information 135
The latter matrix is the building block of matrix L(n+3)×(n+3) , and the coor-
dinates of points xi are placed in the columns of matrix V2×n :
K P x1 x2 . . . xn
L= V= (4.42)
PT 03×3 y1 y2 . . . yn
being Vx = the first row of V. A similar rationale is applied for obtaining the
coefficients for the Y dimension, and then we obtain the interpolated position
for any vector x = (x, y)T :
x y x n
i=1 wi U (||x − xi ||)
x
f x (x) a1 a1 x t
f (x) = = + + n
f y (x) ax2 ay2 y ty y
i=1 wi U (||x − xi ||)
(4.44)
where Xk is one of the i.d.d. samples from the pooled distribution which is
mapped to the set of pooled samples consisting in {Xckc : kc = 1, . . . , nc ,
c = 1, . . . , N }. Then, taking the logarithm of the likelihood ratio, we have
N
M
N
nc
log Λ = log πc pc (Xk ) − log pc (Xckc ) (4.48)
k=1 c=1 c=1 kc =1
In information theory, the version of the weak law of large numbers (conver-
N
gence in probability of n1 i=1 Xi toward the average E(X) when the number
4.4 Deformable Matching with Jensen Divergence and Fisher Information 137
1
n
1
− p(X1 , . . . , Xn ) = − p(Xi ) → −E(log p(x)) = H(X) (4.49)
n n i=1
that is, we are assuming that αac = K1c , being K c the number of clusters of
the cth distribution (uniform weighting) and, also for simplicity, a diagonal
and identical covariance matrix Σa = σ 2 I. In addition, the deformed cluster
centers are denoted by uca = f c (vca ).
Then, given again one of the i.d.d. samples from the pooled distribution Xk
mapped to one element of the pooled set {Xckc : kc = 1, . . . , nc , c = 1, . . . , N }
where {Xckc } is the set of nc samples for the cth shape, the entropies of both
individual distributions and the pooled one may be estimated by applying
again the AEP. For the individual distributions, we have
! Kc
"
1 1 1
nc nc
H(pc ) = − log pc (Xkc ) = −
c
log G(Xkc − ua , σ I))
c c 2
nc nc K c a=1
kc =1 kc =1
(4.53)
138 4 Registration, Matching, and Recognition
c
Regarding the convex combination and choosing πc = K M , being K = c Kc
the total number of cluster centers in the N point sets, we have
Kc
N
N
1 1
K
πc p c = πc G(x − uc
a , σ 2
I)) = G(Xk − uj , σ 2 I)
c=1 c=1
K c a=1 K j=1
(4.54)
being Xk the pooled samples and uj , j = 1, . . . , M the pooled centers. Con-
sequently, the convex combination can be seen as a Gaussian mixture where
its Gaussians are centered on the deformed cluster centers. This simplifies the
estimation of the Shannon entropy, from AEP, for the convex combination:
⎛ ⎞
N
1 K
H πc p c =H⎝ G(Xk − uj , σ 2 I)⎠
c=1
K j=1
! "
1 1
M K
=− log G(Xk − ua , σ I)
2
(4.55)
M j=1 K a=1
being M = c nc the sum of all samples. Therefore, the JS divergence is
estimated by
N
N
JSπ (p1 , . . . , pN ) = H πc p c − πc H(pc )
c=1 c=1
! "
1 1
M K
≈− log G(Xj − ua , σ 2 I)
M j=1 K a=1
! Kc
"
N
Kc
nc
1
+ log G(Xkc − ua , σ I))
c c 2
n K
c=1 c
K c a=1
kc =1
. (4.56)
Then, after the latter approximation, we may compute the partial derivatives
needed to perform a gradient descent of JS divergence with respect to the
deformation parameters:
# $T
∂JS ∂JS ∂JS
∇JS = 1
, 2
,..., (4.57)
∂μ ∂μ ∂μN
T T
and then, for the state vector (current parameters) Θ = (μ1 , . . . , μN ) and
T T
the weighting vector Γ = (γ 1 , . . . , γ N ), we have, for instance
where ⊗ is the Kronecker product. For the sake of modularity, a key element
is to compute the derivative of a Gaussian with respect to a μc :
∂G(Xckc − uca , σ 2 I) 1 ∂uc
c
= 2 G(Xckc − uca , σ 2 I)(Xckc − uca ) ac
∂μ σ ∂μ
c
1 ∂f (vca , μc )
= 2 G(Xckc − uca , σ 2 I)(Xckc − uca )
σ ∂μc
(4.59)
Therefore
1
M K
∂JS 1 ∂G(Xj − ua , σ 2 I)
= − K
∂μc M j=1 a=1 G(Xj − ua , σ 2 I) a=1 ∂μc
c
Kc
nc K
1 ∂G(Xckc − uca , σ 2 I)
+ K c
nc K a=1 G(Xkc − ua , σ I) a=1
c c 2 ∂μc
kc =1
c
1
M K
1 ∂G(Xj − uca , σ 2 I)
=− K
M j=1 a=1 G(Xj − ua , σ 2 I) a=1 ∂μc
c
Kc
nc K
1 ∂G(Xckc − uca , σ 2 I)
+ K c
nc K a=1 G(Xkc − ua , σ I) a=1
c c 2 ∂μc
kc =1
(4.60)
Starting from estimating the cluster centers Vc for each shape Xc using a
clustering algorithm (see next chapter), next step consists of perform gradi-
ent descent (or its conjugate version) setting Θ0 with zeros. Either numeri-
cal or analytical gradient descent can be used, though the analytical version
performs better with large deformations. Recall that the expressions for the
gradient do not include the regularization term, which only depends on the
nonrigid parameters. Anyway, a re-sampling must be done every couple of iter-
ations. After convergence, the deformed point-sets are close to each other, and
then it is possible to recover Z (mean shape) from the optimal deformation
parameters. For the sake of computational efficiency, two typical approaches
may be taken (even together). The first one is working with optimization
epochs, that is, find the affine parameters before finding the nonrigid ones
together with the re-estimation of the affine parameters. The second one is
hierarchical optimization, especially proper for large sets of shapes, that is,
large N . In this latter case, M is divided into m subsets. The algorithm is
applied to each subset separately and then the global atlas is found. It can be
proved (see Prob. 4.9) that minimizing the JS divergence with respect to all
M is equivalent to minimizing it with respect to the atlases obtained for each
of the m subset. The mathematical meaning is that JS divergence is unbiased
and, thus, the hierarchical approach is unbiased too. In general terms, the
algorithm works pretty well for atlas construction (see Fig. 4.20), the regular-
ization parameter λ being relatively stable in the range [0.0001, 0.0005]. With
140 4 Registration, Matching, and Recognition
Fig. 4.20. Registering deformations with respect to the atlas and the atlas itself.
Point Set1 to Point Set7 show the point-sets in their original state (‘o’) and after
registration with the atlas (‘+’). Final registration is showed at the bottom right.
Figure by F. Wang, B.C. Vemuri, A. Rangarajan and S.J. Eisenschenk (2008 c
IEEE).
The distributional shape model described above yields some degree of by-
passing the explicit computation of matching fields. Thus, when exploiting JS
divergence, the problem is posed in a parametric form (find the deformation
parameters). Furthermore, a higher degree of bypassing can be reached if
we exploit the achievements of information geometry. Information geometry
deals with mathematical objects in statistics and information theory within
the context of differential geometry [3, 5]. Differential geometry, in turn, stud-
ies structures like differentiable manifolds, that is, manifolds so smooth that
they can be differentiable. This kind of manifolds are studied by Rieman-
nian geometry and, thus, they are dubbed Riemannian manifolds. However,
it may be a long shot to introduce them formally without any previous in-
tuition. A closer concept to the reader is the one of statistical manifold, that
is a manifold of probability distributions, like one-dimensional Gaussian dis-
tributions. Such manifolds are induced by the parameters characterizing a
given family of distributions, and in the case of one-dimensional Gaussians,
the manifold is two dimensional θ = (μ, σ). Then, if a point in the manifold is
4.4 Deformable Matching with Jensen Divergence and Fisher Information 141
√ 1 (x − μ)2
ln p(x|μ, σ) = − ln 2π + ln −
σ 2σ 2
∂ (x − μ)
ln p(x|θ) = (4.62)
∂μ σ2
∂ 1 (x − μ)2 (x − μ)2 − σ 2
ln p(x|θ) = − + 3
=
∂σ σ σ σ3
Therefore, we obtain
⎛ & '2 & '⎞
(x−μ) (x−μ)3 −(x−μ)σ 2
⎜ E σ2 E σ5 ⎟
g(μ, σ) = ⎝ & ' & '2 ⎠ (4.63)
(x−μ)3 −(x−μ)σ 2 (x−μ)2 −σ 2
E σ5 E σ3
In the general case, g(θ), the so-called Fisher–Rao metric tensor, is symmetric
and positive definite. But, in what sense g(θ) is a metric? To answer the
question we have to look at the differential geometry of the manifold because
this manifold is not necessarily flat as in Euclidean geometry. More precisely,
we have to focus on the concept of tangent space. In differential geometry, the
tangent space is defined at any point of the differential manifold and contains
the directions (vectors) for traveling from that point. For instance, in the case
of a sphere, the tangent space at a given point is the plane perpendicular to
the sphere radius that touches this and only this point. Consider the sphere
with constant curvature (inverse to the radius) as a manifold (which typically
has a smooth curvature). As we move from θ to θ + δθ, the tangent spaces
change and also the directions. Then, the infinitesimal curve length δs between
the latter to points in the manifold is defined as by the equation
δs2 = gij (θ)dδθ i δθ j = (δθ T )g(θ)δθ (4.64)
ij
which is exactly the Euclidean distance if g(.) is the identity matrix. This
is why using the Euclidean distance between the pdfs parameters does
142 4 Registration, Matching, and Recognition
not generally take into account the geometry of the subspaces where these
distributions are embedded. For the case of the Kullback–Leibler diver-
gence, let us consider a one-dimensional variable x depending also on one
parameter θ:
Therefore
# ∂p(x|θ) 1 ∂p(x|θ) 1 ∂p(x|θ)
$
D ()|=0 = +
x
∂θ p(x|θ) ∂θ p(x|θ) ∂θ
# $ # $
∂ log p(x|θ) ∂p(x|θ)
=2 (4.68)
x
∂θ ∂θ
# $2 # $2
∂ log p(x|θ) ∂ log p(x|θ)
=2 p(x|θ) = 2E
x
∂θ ∂θ
g(θ)
where g(θ) is the information that a random variable x has about a given
parameter θ. More precisely, the quantity
is called the score of a random variable. It is trivial to prove that its expec-
tation EV is 0 and, thus, g = EV 2 = var(V ) is the variance of the score.
Formally, this is a way of quantifying the amount of information about N θ in
the data. A trivial example is the fact that the sample average x̄ = n1 i=1 xi
is an estimator T (x) of the mean θ of a Gaussian distribution. Moreover, x̄
is also an unbiased estimator because the expected value of the error of the
estimator Eθ (T (x) − θ) is 0. Then, the Cramér–Rao inequality says that g(θ)
determines a lower bound of the mean-squared error var(T ) in estimating θ
from the data:
1
var(T ) ≥ (4.70)
g(θ)
The proof is trivial if one uses the Cauchy–Schwarz inequality over the prod-
uct of variances of V and T (see [43], p. 329). Thus, it is quite interesting
that D ()|=0 ∝ g(θ). Moreover, if we return to the McLauring expansion
(Eq. 4.65), we have
D(p(θ + )||p(θ)) ≈ 2 g(θ) (4.71)
whereas in the multi-dimensional case, we would obtain
1
D(p(θ + δθ)||p(θ)) ≈ (δθ T )g(θ)δθ (4.72)
2
which is pretty consistent with the definition of the squared infinitesimal arc-
length (see Eq. 4.64). Finally, the Cramér–Rao inequality for the multidimen-
sional case is
Σ ≥ g −1 (θ) (4.73)
Σ being the covariance matrix of a set of unbiased estimators for the param-
eter vector θ.
In the latter section, we have presented the Fisher–Rao metric tensor and its
connection with the infinitesimal arc-length in a Riemannian manifold. Now,
as we want to travel from a probability distribution to another one through
the Riemannian manifold, we focus on the dynamics of the Fisher–Rao metric
tensor [34]. The underlying motivation is to connect, as smoothly as possible,
two consecutive tangent spaces from the origin distribution toward the final
distribution. Such connection is the affine connection. In the Euclidean space,
as the minimal distance between two points is the straight line, there is no need
of considering change of tangents at each point in between because this change
of direction does not occur in the Euclidean space. However, and quoting Prof.
Amari in a recent videolecture [4]:These two concepts (minimality of distance,
and straightness) coincide in the Riemannian geodesic. Thus, the geodesic is
the curve sailing through the surface of the manifold, whose tangent vectors
remain parallel during such transportation, and minimizes
144 4 Registration, Matching, and Recognition
1 1
T
E= g(θ ij )θ˙i θ˙j dt = θ̇ g(θ)θ̇dt (4.74)
0 0
being θ˙i = and being t ∈ [0, 1] the parameterization of the geodesic path
dθ i
dt ,
1
θ(t). Of course, the geodesic also minimizes s = 0 g(θ ij )θ˙i θ˙j dt, the
path length. Geodesic computation is invariant to reparameterization and to
coordinate transformations. Then, the geodesic can be obtained by minimizing
E through the application of the Euler–Lagrange equations characterizing the
affine connection1
δE ∂gij ∂gik
∂gkj ˙ ˙
= −2 gki θ̈ i + − − θi θj = 0 (4.75)
δθ k i ij
∂θ k ∂θ j ∂θ i
being θ k the kth parameter. The above equations can be rewritten as a func-
tion of the Chirstoffel symbols of the affine connection
1 ∂gik ∂gkj ∂gij
Γk,ij = + − (4.76)
2 ∂θ j ∂θ i ∂θ k
Example results are shown in Fig. 4.21. When Fisher matrices are based on
other estimators like α-order entropy metric tensors, for which closed solutions
are available, the results are not so smooth than when the Fisher matrix is
used. The α-order metric tensor, for α = 2, is defined as
α ∂p(x) ∂p(x)
gij (θ) = dx (4.80)
∂θi ∂θj
3
1 2
4 3
1
3
1 2
4 3
1
Fig. 4.21. Shape matching results with information geometry. Left: Fisher-
information (top) vs α-entropy (bottom). Dashed lines are the initializations and
solid ones are the final geodesics. Note that the Fisher geodesics are smoother. Left:
table of distances comparing pairs of shapes 1–9 presented in Fig. 4.19. Figure by
A. Peter and A. Rangarajan (2006
c Springer).
146 4 Registration, Matching, and Recognition
T
E= θ̇(t)T g(θ(t))θ̇(t) (4.81)
t=1
δE P T
∇E(θ) = ≈ (∂jt E)Vj (t) (4.83)
δθ j=1 t=1
The gradient search is initialized with the linear path between the shapes and
proceeds by deforming the path until convergence.
Almost all the data sources used in this book, and particularly in the present
chapter, have a vectorial origin. Extending the matching and learning tech-
niques to the domains of graphs is a hard task, which is commencing to in-
tersect with information theory. Let us consider here the simpler case of trees
where we have a hierarchy which, in addition, is herein assumed to be sam-
pled correctly. Trees have been recognized as good representations for binary
4.5 Structural Learning with MDL 147
# #
3 002 3 001
3 001 3 002
1 001 1 002
1 003 1 001
F F F F F F F F F
1-004
1-007
1-005
1-006
3-001
3-002
1-003
1-001
Fig. 4.22. Examples of shock graphs extracted from similar shape and the result
of the matching algorithm. Figure by K. Siddiqi, A. Shokoufandeh, S.J. Dickinson
and S.W. Zucker (1999
c Springer).
shapes, and tree-versions of the skeleton concept lead to the shock tree [145].
A shock tree is rooted in the latest characteristic point (shock) found while
computing the skeleton (see Fig. 4.22). Conceptually, the shock graph is built
by reversing the grassfire transform, and as a point, is more and more inside
the shape, its hierarchy is higher and higher in the tree. Roughly speaking,
the tree is build as follows: (i) get the shock point and its branch in the skele-
ton; (ii) the children of the root are the shocks closer in time of formation
and reachable through following a path along a branch; (iii) repeat assuming
that each child is a parent; (iv) the leaves of the tree are terminals without
a specific shock label (that is, each earlier shock generates a unique child, a
terminal node). Using a tree instead of a graph for representing a shape has
clear computational advantages because they are easier to match and also to
learn than graphs.
As we have noted above, for the sake of simplicity, sampling errors affect
the nodes, but hierarchical relations are preserved. Each tree model Tc is
defined by a set of nodes Nc , a tree order Oc , and the observation probabilities
assigned to the nodes of a tree coherent with model Tc are grouped into the set
Θc = {θc,i } : i ∈ Nc . Therefore, the probability of observing a node i in the set
of trees assigned to the model (class) c is denoted by θc,i . A simple assumption
consists of considering that the nodes of t are independent Bernoulli samples
from the model Tc . The independence assumption leads to the factorization
of P (t|Tc ). If t is an observation of Tc , we have
P (t|Tc ) = θc,i (1 − θc,i ) (4.85)
i∈Nt j∈Nc /Nt
zct = 1 if the tree t is observed from model Tc , and zct = 0 otherwise. This
simplifies the cost of describing the data:
k
k
L(D|z, H, C) = − zct log P (t|Tc , C) = log P (t|Tc , C) (4.88)
t∈D c=1 c=1 t∈Dc
being Dc = {t ∈ D : zct = 1}. On the other hand, the cost of describing the full
model H is built on three components. The first one comes from quantifying
the cost of encoding the observation probabilities Θ̂c :
k
nc
L(Θ̂c |z, H, C) = log(mc ) (4.89)
c=1
2
being nc the number of nodes in the model Tc and mc = t∈D zct the num-
ber of trees assigned to the model Tc . The second component comes from
quantifying the cost of the mapping implemented by z:
k
L(z|H, C) = − log αc mc (4.90)
c=1
k
k−1
L(H|C) = g nc + log(m) + const. (4.91)
c=1
2
where nc is the number of nodes in model Tc , and the second term with
m = |D| is the cost of describing the mixing parameters α. Dropping the
constant, the MDL in this context is given by
MDL(D, H|C)
k log(mc )
= − log P (t|Tc , C) + nc + g − mc log αc
c=1
2
t∈Dc
k−1
+ log(m) . (4.92)
2
150 4 Registration, Matching, and Recognition
number of trees in the data mapping a node to i, and mc = |Dc | is the number
of trees assigned to Tc , then the maximum likelihood estimation of sampling
l
probability under the Bernouilli model is θc,i = mc,ic . Then, we have
log P (t|Tc , C)
t∈Dc
# $
lc,i lc,i lc,i lc,i
= mc log + (1 − ) log 1 −
mc mc mc mc
i∈Nc
= mc [θc,i log θc,i + (1 − θc,i ) log(1 − θc,i )]
i∈Nc
=− mc H(θc,i ) (4.93)
i∈Nc
that is, the individual log-likelihood with respect to each model is given by the
weighted entropies of the individual Bernoulli ML estimations corresponding
to each node. Furthermore, we estimate the mixing proportions αc as the ob-
k
served frequency ratio, that is, αc = m m
c
. Then, defining H(α) = − c=1 αc ,
we have the following resulting MDL criterion:
M DL(D, H|C)
k
mc k−1
= mc H(θi,c ) + log + g + mH(α) + log(m)
c=1 i∈Nc
2 2
. (4.94)
How to learn k tree models from the N input trees in D so that the MLD
cost is minimized? A quite efficient method is to pose the problems in terms
of agglomerative clustering (see a good method for this in the next chapter).
The process is illustrated in Fig. 4.22. Start by having N clusters, that is, one
cluster/class per tree. This implies assigning a tree model per sample tree. In
each tree, each node has unit sample probability. Then we create one mixture
component per sample tree, and compute the MDL cost for the complete
model assuming an equiprobable mixing probability. Then, proceed by taking
all possible pairs of components and compute tree unionstree unions. The
mixing proportions are equal to the sum of proportions of the individual
unions. Regarding the sampling probabilities of merged nodes, let mi = |Di |
4.5 Structural Learning with MDL 151
Given all the possibilities of merging two trees, choose the one which re-
duces MDL by the greatest amount. If all the proposed merges increase the
MDL, then the algorithm stops and does not accept any merging. If one of
the mergings succeeds, then we have to compute the costs of merging it with
the remaining components and proceed again to choose a new merging. This
process continues until no new merging can drop the MDL cost.
Estimating the edit distance, the cost of the optimal sequence of edit oper-
ations over nodes and edges is still an open problem in structural pattern
matching. In [156], an interesting link between edit distance and MDL is es-
tablished. Given the optimal correspondence C, the tree-edit distance between
trees t and t is given by
Dedit (t, t ) = ru + rv + muv (4.97)
u∈dom(C) v ∈ima(C) (u,v)∈C
where ru and rv are the costs of removing nodes u and v, respectively, whereas
muv is the cost of matching nodes u and v. In terms of sets of nodes of a tree
N t , the edit distance can be rewritten as
Dedit (t, t ) = ru + rv + (muv − ru − rv) (4.98)
u∈N t v∈N t (u,v)∈C
152 4 Registration, Matching, and Recognition
but what are the costs of ru , rv and muv ? Attending to the MDL criterion,
these costs are the following ones:
1
rz = (mt + mt )H(θz ) + log(mt + mt ) + g, z ∈ {u, v}
2
1
muv = (mt + mt )H(θuv ) + log(mt + mt ) (4.99)
2
Therefore, edit costs are closely related to the information of the variables as-
sociated to the sampling probabilities. Shape clustering examples using these
edit costs and spectral clustering are shown in Fig. 4.23 (bottom). In these
Fig. 4.23. Tree learning. Top: dynamic formation of the tree union. The darker
the node, the higher its sampling probability. Bottom: results of shape clustering
from (a) mixture of attributed trees; (b) weighted edit distance; and (c) union of
attributed trees. Figure by A. Torsello and E.R. Hancock (2006
c IEEE).
4.5 Structural Learning with MDL 153
Problems
4.1 Distributions in multidimensional spaces
Perform the following experiment using Octave, Matlab, or some other similar
tool. Firstly, generate 100 random points in a 1D space (i.e., 100 numbers).
Use integer values between 0 and 255. Calculate the histogram with 256 bins
and look at its values. Secondly, generate 100 random points in a 2D space
(i.e., 100 pairs of numbers). Calculate the 2D histogram with 256 bins and
look at the values. Repeat the experiment with 3D and then with some high
dimensions. High dimensions cannot be represented, but you can look at the
maximum and minimum values of the bins. What happens to the values of
the histogram? If entropy is calculated from the distributions you estimated,
what behavior would it present in a high-dimensional space? What would you
propose to deal with this problem?
4.2 Parzen window
Look at the formula of the Parzen window’s method (Eq. 4.8). Suppose we
use in it a Gaussian kernel with some definite covariance matrix ψ and we
estimate the distribution of the pixels of the image I (with 256 gray levels).
Would the resulting distribution be the same if we previously smooth the
image with a Gaussian filter? Would it be possible to resize (reduce) the
image in order to estimate a similar distribution? Discuss the differences. You
can perform experiments on natural images in order to notice the differences,
or you can try to develop analytical proofs.
4.3 Image alignment
Look at Fig. 4.2. It represents the values of two different measures for image
alignment. The x and z axes represent horizontal and vertical displacement
of the image and the vertical axis represents the value of the measure. It can
be seen that mutual information yields smoother results than the normalized
cross correlation. Suppose
asimpler measure consists in the difference of pixels
of the images I and I : x y Ix,y − Ix,y
. How would the plot of this measure
look like? Use Octave, Matlab or some other tool to perform the experiments
on a natural image. Try other simple measures and see which one would be
more appropriate for image alignment.
4.4 Joint histograms
In Fig. 4.3, we show the classical way to build a joint histogram. For two
similar or equal images (or sets of samples), the joint histogram has high
154 4 Registration, Matching, and Recognition
values in its diagonal. Now suppose we have two sets of samples generated by
two independent random variables. (If two events are independent, their joint
probability is the product of the prior probabilities of each event occurring by
itself, P (A ∩ B) = P (A)P (B).) How would their joint histogram (see Fig. 4.3)
look?
4.5 The histogram-binning problem
Rajwade et al. proposed some methods for dealing with the histogram-binning
problem. Their methods present a parameter Q, which quantifies the intensity
levels. This parameter must not be confused with the number of bins in clas-
sical histograms. Explain their difference. Why setting Q is much more robust
than setting the number of histogram bins? Think of cases when the clas-
sical histogram of two images would be similar to the area-based histogram.
Think of other cases when both kinds of histograms would present a significant
difference.
4.6 Alternative metrics and pseudometrics
Figure 4.11 represents the concept of mutual information in a Venn diagram.
Draw another one for the ρ(W, X, Y, Z) measure of four variables and shade
the areas corresponding to that measure. Think of all the possible values it
could take. Make another representation for the conditional mutual informa-
tion.
4.7 Entropic graphs
Think of a data set formed by two distributions. If we gradually increase the
distance of these distributions until they get completely separated, the en-
tropy of the data set would also increase. If we continue separating more and
more the distributions, what happens to the entropy value for each one of the
following estimators? (a) Classical histograms plugin estimation; (b) Parzen
window; (c) minimal spanning tree graph-based estimation; and (d) k-nearest
neighbors graph-based estimation, for some fixed k.
4.8 Jensen–Rényi divergence
For the following categorical data sets: S1 = {1, 2, 1},S2 = {1, 1, 3}, and
S3 = {3, 3, 3, 2, 3, 3}, calculate their distributions and then their Jensen–Rényi
divergence, giving S1 and S2 the same weight and S3 the double.
4.9 Unbiased JS divergence for multiple registration
Prove that the JS divergence is unbiased for multiple registration of M shapes,
that is, the JS divergence with respect to the complete set is equal to the JS
divergence with respect to the m << N subsets in which the complete set is
partitioned. A good clue for this task is to stress the proof of JSπ (p1 , . . . pN )−
JSβ (S1 , . . . SN ) = 0, where the Si are the subsets and β must be properly
defined.
4.10 Fisher matrix of multinomial distributions
The Fisher information matrix of multinomial distributions is the key to spec-
ify the Fisher information matrix of a mixture of Gaussians [101]. Actually,
4.5 Structural Learning with MDL 155
the parameters of the multinomial in this case are the mixing components
(which must form a convex combination – sum 1). Derive analytically the
Fisher matrix for that kind of distributions.
4.11 The α-order metric tensor of Gaussians and others
An alternative tensor to the Fisher one is the α-order metric one. When
α = 2, we obtain the expression in Eq. 4.80. Find an analytical expression of
this tensor for the Gaussian distribution. Do the same with the multinomial
distribution. Compare the obtained tensors with the Fisher–Rao ones.
4.12 Sampling from a tree model
Let Tc be a tree model consisting of one root node with two children. The
individual probability of the root is 1, the probability of the left child is 0.75
and that of the right child is 0.25 (probability is normalized at the same level).
Quantify the probability of observing all the possible subgraphs, including the
complete tree, by following the independent Bernoulli model (Fig. 4.22)
4.13 Computing the MDL of a tree merging process
Given the definition of MDL for trees, compute the sampling probabilities
and MDLs for all steps of the application of the algorithm to the example
illustrated in Fig. 4.22. Compute the MDL advantage in each iteration of the
algorithm.
4.14 Computing the MDL edit distance
Compute the MDL edit distance for all the matchings (mergings) illustrated
in Fig. 4.22.
4.15 Extending MDL for trees with weights
As we have explained along the chapter, it is possible to extend the pure struc-
tural framework to an attributed one, where attributes are assigned to nodes.
Basically this consists of modeling a weight distribution function, where most
relevant nodes have associated larger weights. From the simple assumption
that weights are Gaussian distributed, the probability of a given weight can
be defined as
⎧
(w−μc,i )2
⎨ − 12
σ2
1 √
Pw (w|μc,i , σc,i ) = θc,i σc,i 2π e c,i if w ≥ 0
⎩
0 otherwise
and the sample probability θc,i is the integral of the distribution over positive
weights:
∞ (w−μc,i )2
1 − 12 2
θc,i = √ e σ
c,i dw
0 σc,i 2π
Given the latter definitions, modify the log-likelihood by multiplying the
Bernoulli probability by Pw (.|.) only in the positive case (when we consider
θc,i ). Obtain a expression for the MDL criterion under these new conditions.
Hints: consider both the change of the observation probability and the addi-
tion of two new variables per node (which specify the Gaussian for its weight).
156 4 Registration, Matching, and Recognition
5.1 Introduction
Clustering, or grouping samples which share similar features, is a recurrent
problem in computer vision and pattern recognition. The core element of a
clustering algorithm is the similarity measure. In this regard information the-
ory offers a wide range of measures (not always metrics) which inspire clus-
tering algorithms through their optimization. In addition, information theory
also provides both theoretical frameworks and principles to formulate the clus-
tering problem and provide effective algorithms. Clustering is closely related
to the segmentation problem, already presented in Chapter 3. In both prob-
lems, finding the optimal number of clusters or regions is a challenging task.
In the present chapter we cover this question in depth. To that end we explore
several criteria for model order selection.
All the latter concepts are developed through the description and discus-
sion of several information theoretic clustering algorithms: Gaussian mixtures,
Information Bottleneck, Robust Information Clustering (RIC) and IT-based
Mean Shift. At the end of the chapter we also discuss basic strategies to form
clustering ensembles.
F. Escolano et al., Information Theory in Computer Vision and Pattern Recognition, 157
c Springer-Verlag London Limited 2009
158 5 Image and Pattern Clustering
are used [81], while mixture models have a set of parameters which can be
adjusted in a formal way. The estimation of these parameters is a clustering,
provided that each sample belongs to some kernel of the set of kernels in the
mixture. In Bayesian supervised learning the mixture models are used for rep-
resentation of class-conditional probability distributions [72] and for Bayesian
parameter estimation [46].
where K is the number of kernels, π1 , ..., πk are the a priori probabilities of each
kernel, and Θi are the parameters describing the kernel. In Gaussian mixtures
these parameters are the means and the covariance matrices, Θi = {μi , Σi }. In
Fig. 5.1, see an example of 1D data following a Gaussian mixture distribution
formed by K = 3 kernels of different parameters of each one of them.
Thus, the whole set of parameters of a mixture is Θ≡{Θ1 , ..., Θk , π1 , ..., πk }.
Obtaining the optimal set of parameters Θ∗ is usually posed in terms of max-
imizing the log-likelihood of the pdf to be estimated:
N
(Y |Θ) = log p(Y |Θ) = log p(yn |Θ)
n=1
N
K
= log πk p(yk |Θk ). (5.2)
n=1 k=1
∗
ΘM T = arg max (Θ). (5.3)
Θ
5.2 Gaussian Mixtures and Model Selection 159
0.5
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
10 15 20 25 30 35 40
cannot be determined analytically. The same happens with the Bayesian Max-
imum a Posteriori (MAP) criterion
∗
ΘMAP = arg max((Θ) + log p(Θ)). (5.4)
Θ
In this case other methods are needed, like expectation maximization (EM)
or Markov-chain Monte Carlo algorithms.
N
K
log p(Y, Z|Θ) = zkn log[πk p(yn |Θk )]. (5.5)
n=1 k=1
Fig. 5.2. The classical EM is prone to random initialization. The algorithm con-
verges to different local minima in different executions.
E Step
πk p(y(n) |k)
p(k|yn ) = (5.7)
Σj=1 πj p(y(n) |k)
K
M Step
1
N
πk = p(k|yn ), (5.8)
N n=1
N
n=1 p(k|yn )yn
μk = N , (5.9)
n=1 p(k|yn )
N
n=1 p(k|yn )(yn − μk )(yn − μk )T
Σk = N , (5.10)
n=1 p(k|yn )
which does not ensure that the pdfs of the data are properly estimated. In
addition, the algorithm requires that the number of elements (kernels) in the
mixture is known beforehand. A maximum-likelihood criterion with respect
to the number of kernels is not useful because it tends to fit one kernel for
each sample.
The model order selection problem consists of finding the most appropriate
number of clusters in a clustering problem or the number of kernels in the
mixture. The number of kernels K is unknown beforehand and cannot be
estimated through maximizing the log-likelihood because (Θ) grows with K.
In addition, a wrong K may drive the EM toward an erroneous estimation. In
Fig. 5.3 we show the EM result on the parameters of a mixture with K = 1
describing two Gaussian distributions.
The model order selection problem has been addressed in different ways.
There are algorithms which start with a few number of kernels and add new
kernels when necessary. Some measure has to be used in order to detect when
some kernel does not fit well the data, and a new kernel has to be added. For
example, in [172], a kernel is split or not, depending on the kurtosis measure,
used as a measure of non-Gaussianity. Other model order selection methods
start with a high number of kernels and fuse some of them. In [55, 56, 59]
the EM algorithm is initialized with many kernels randomly placed and then
the minimum description length (MDL) principle [140] is used to iteratively
remove some of the kernels until the optimal number of them is achieved.
Some other approaches are used both to split and fuse kernels. In [176] a
general statistical learning framework called de Bayesian Ying-Yang system is
proposed and suitable for using for model selection and unsupervised learning
of finite mixtures. Other approaches combine EM and genetic algorithms for
learning mixtures, using an MDL criterion for finding the best K.
−
− −
− −
Fig. 5.3. Two Gaussians with averages μ1 = [0, 0] y μ2 = [3, 2] (left) are erroneously
described by a unique kernel with μ = [1.5, 1] (right). (Figure by Peñalver et al. [116]
(2009
c IEEE)).
162 5 Image and Pattern Clustering
Recall that for a discrete variable Y = {y1 , ..., yN } with N values the entropy
is defined as
N
H(Y ) = −Ey [log(P (Y ))] = − P (Y = yi ) log P (Y = yi ). (5.11)
i=1
The comparison of the Gaussian entropy with the entropy of the underly-
ing data is the criterion which EBEM uses for deciding whether a kernel is
Gaussian or it should be replaced by other kernels which fit better the data.
In order to evaluate the Gaussianity criterion, the Shannon entropy of
the data has to be estimated. Several approaches to Shannon entropy estima-
tion have been studied in the past. We can widely classify them as methods
which first estimate the density function, and methods which by-pass this and
directly estimate the entropy from the set of samples. Among the methods
which estimate the density function, also known as “plug-in” entropy estima-
tors [13], there is the well-known “Parzen windows” estimation. Most of the
current nonparametric entropy and divergence estimators belong to the “plug-
in” methods. They have several limitations. On the one hand the density es-
timator performance is poor without smoothness conditions. The estimations
have high variance and are very sensitive to outliers. On the other hand, the
estimation in high-dimensional spaces is difficult, due to the exponentially in-
creasing sparseness of the data. For this reason, entropy has traditionally been
evaluated in one (1D) or two dimensional (2D) data. For example, in image
analysis, traditionally gray scale images have been used. However, there are
datasets whose patterns are defined by thousands of dimensions.
The “nonplugin” estimation methods offer a state-of-the-art [103] alterna-
tive for entropy estimation, estimating entropy directly from the data set. This
approach allows us to estimate entropy from data sets with arbitrarily high
number of dimensions. In image analysis and pattern recognition the work of
Hero and Michel [74] is widely used for Rényi entropy estimation. Their meth-
ods are based on entropic spanning graphs, for example, minimal spanning
trees (MSTs) or k-nearest neighbor graphs. A drawback of these methods for
the EBEM algorithm is that the methods based on entropic spanning graphs
do not estimate Shannon entropy directly. In the work of Peñalver et al. [116]
they develop a method for approximating the value of Shannon entropy from
the estimation of Rényi’s α-entropy, as explained in the following subsection.
In Chapter 4 (in Section 4.3.6) we explained how Michel and Hero estimate
the Rényi’s α-entropy from a minimum spanning tree (MST). Equation 4.32
showed how to obtain the Rényi entropy of order α, that is, Hα (Xn ), directly
from the samples Xn , by means of the length of the MST. There is a disconti-
nuity at α = 1 which does not allow us to use this
value of α in the expression
of the Rényi entropy: Hα (f ) = 1/(1 − α) log z f α (z) dz. It is obvious that
for α = 1 there is a division by zero in 1/(1 − α). The same happens in the
expression of the MST approximation (Eq. 4.32). However, when α → 1, the
α-entropy converges to the Shannon entropy:
The limit can be calculated using L’Hôpital’s rule. Let f (z) be a pdf of z. Its
Rényi entropy, in the limit α → 1 is
log z f α (z) dz
lim Hα (f ) = lim
α→1 α→1 1−α
1
In α = 1 we have that log z f (z) dz = log 1 = 0 (note that f (z) is a pdf,
then its integral over z is 1). This, divided by 1 − α = 0 is an indetermination
of the type 00 . By L’Hôpital’s rule we have that if
then
g(x) g (x)
lim = lim .
x→c h(x) x→c h (x)
∂
The derivate of the divisor is ∂α (1 − α) = −1, and the derivate of the divi-
dend is
∂ 1 ∂ z f α (z) dz
α
log f (z) dz =
∂α z f α (z) dz ∂α
z
1
= α f α (z) log f (z) dz.
z
f (z) dz z
The first term of this expression goes to 1 in the limit because f (z) is a pdf:
α
f (z) log f (z) dz
lim Hα (f ) = lim − z α
α→1 α→1 f (z) dz
1 z
f (z) log f (z) dz
=− z
1
=− f (z) log f (z) dz ≡ H(f )
z
α*
Hα 5,0 0,997
4,5 0,995
4,0 0,993
3,5 0,991
3,0 0,989
2,5 0,987
2,0 0,985
0,983
1,5
0,981
1,0
0,979
0,5 0,977
0,0 0,975
0,0 0,2 0,4 0,6 0,8 1,0
α 0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850
N
Fig. 5.4. Left: Hα for Gaussian distributions with different covariance matrices.
Right: α∗ for dimensions between 2 and 5 and different number of samples.
Hs
Hα 5,0
4,5
4,0
3,5
3,0
2,5
2,0
1,5
1,0
0,5
0,0
0,0 0,2 0,4 0,6 0,8 1,0 α
Fig. 5.5. The line tangent to Hα in the point α∗ gives the Shannon entropy ap-
proximated value at α = 1.
(experimental results). For any point of this function, a tangent straight line
y = mx + b can be calculated. This tangent is a continuous function and can
give us a value at its intersection with α = 1, as shown in Fig. 5.5. Only one
of all the possible tangent lines is the one which gives us a correct estimation
of the Shannon entropy at α = 1; let us say that this line is tangent to the
function at some point α∗ . Then, if we know the correct α∗ , we can obtain
the Shannon entropy estimation. As Hα is a monotonous decreasing function,
the α∗ value can be estimated by means of a dichotomic search between two
well-separated values of α, for a constant number of samples and dimensions.
It has been experimentally verified that α∗ is almost constant for diago-
nal covariance matrices with variance greater than 0.5, as shown in Fig. 5.4
(right). Figure 5.6 shows the estimation of α∗ for pdfs with different covariance
matrices and 400 samples.
Then, the optimal α∗ depends on the number of dimensions D and samples
N . This function can be modeled as
a + b · ecD
α∗ = 1 − , (5.14)
N
and its values a, b and c have been calibrated for a set of 1,000 distributions
with random 2 ≤ d ≤ 5 and number of samples. The resulting function is
166 5 Image and Pattern Clustering
Fig. 5.6. α∗ 2D for different covariance values and 400 samples. Value remains
almost constant for variances greater than 0.5.
It can be said that the EBEM algorithm performs a model order selection,
even though it has to be tuned with the Gaussianity deficiency threshold.
This threshold is a parameter which does not fix the order of the model. It
determines the degree of fitness of the model to the data, and with a fixed
threshold different model orders can result, depending on the data. However,
it might be argued that there is still a parameter which has to be set, and
the model order selection is not completely solved. If there is the need not
to set any parameter manually, the minimum description length principle can
be used.
The minimum description length principle (see Chapter 3) chooses from
a set of models the representation which can be expressed with the shortest
possible message. The optimal code-length for each parameter is 1/2 log n,
asymptotically for large n, as shown in [140]. Then, the model order selection
criterion is defined as:
N (k)
CM DL (Θ(k) , k) = −L(Θ(k) , y) + log n, (5.15)
2
where the first term is the log-likelihood and the second term penalizes an ex-
cessive number of components, N (k) being the number of parameters required
to define a mixture of k kernels.
5.3 EBEM Algorithm: Exploiting Entropic Graphs 167
When the kernel K∗, with lowest Gaussianity, has to be split into the K1 and
K2 kernels (components of the mixture), their parameters Θk1 = (μk1 , Σk1 )
and Θk2 = (μk2 , Σk2 ) have to be set. The new covariance matrices have two
restrictions: they must be definite positive and the overall dispersion must
remain almost constant:
π∗ = π1 + π2
π∗ μ∗ = π1 μ1 + π2 μ2 (5.16)
π∗ (Σ∗ + μ∗ μT∗ ) = π1 (Σ1 + μ1 μT1 ) + π2 (Σ2 + μ2 μT2 )
These constraints have more unknown variables than equations. In [48] they
perform a spectral decomposition of the actual covariance matrix and they
estimatethe new eigenvalues and eigenvectors of new covariance matrices.
T
Let ∗ = V∗ Λ∗ V∗ be 1the spectral d
decomposition of the covariance matrix
∗ , with Λ
∗ = diag(λj∗ , ..., λj∗ ) a diagonal matrix containing the eigen-
values of ∗ with increasing order, ∗ the component with the lowest entropy
ratio, π∗ , π1 , π2
the
priors
of both original and new components, μ∗ , μ1 , μ2
the means and ∗ , 1 , 2 the covariance matrices. Let also be D a d × d
rotation matrix with columns orthonormal unit vectors. D is constructed by
generating its lower triangular matrix independently from d(d − 1)/2 different
uniform U (0, 1) densities. The proposed split operation is given by
π1 = u1 π∗
π2 = (1 − u1 )π∗
d
μ1 = μ∗ − ( i=1 ui2 λi∗ V∗i ) ππ21
d
μ2 = μ∗ − ( i=1 ui2 λi∗ V∗i ) ππ12 (5.17)
Λ1 = diag(u3 )diag(ι − u2 )diag(ι + u2 )Λ∗ ππ∗1
Λ2 = diag(ι − u3 )diag(ι − u2 )diag(ι + u2 )Λ∗ ππ∗2
V1 = DV∗
V2 = D T V∗
V2* V1*
u22 l2*
u12 l*
1
Fig. 5.7. Two-dimensional example of splitting one kernel into two new kernels.
5.3.5 Experiments
4 4 4
0 0 0
−1 −1 −1
−1 0 3.0 5 7 −1 0 3.0 5 7 −1 0 3.0 5 7
K=4 K=5
7 7
11000
4 4
10000
9000
2.5 2.5
8000
7000
0 0
6000
−1 −1 5000
−1 0 3.0 5 7 −1 0 3.0 5 7 1 2 3 4 5 6 7 8 9 10
Fig. 5.8. Evolution of the EBEM algorithm from one to five final kernels. After
iteration the algorithm finds the correct number of kernels. Bottom-right: Evolution
of the cost function with MDL criterion.
need to join kernels again. Thus, convergence is achieved in very few itera-
tions. In Fig. 5.8 we can see the evolution of the algorithm on synthetic data.
The Gaussianity threshold was set to 0.95 and the convergence threshold of
the EM algorithm was set to 0.001.
Another example is shown in Fig. 5.9. It is a difficult case because there
are some overlapping distributions. This experiment is a comparison with the
MML-based method described in [56]. In [56] the algorithm starts with 20
kernels randomly initialized and finds the correct number of components after
200 iterations. EBEM starts with only one kernel and also finds the correct
number of components, with fewer iterations. Figure 5.9 shows the evolution
of EBEM in this example. Another advantage of the EBEM algorithm is that
it does not need a random initialization.
Finally we also show some color image segmentation results. Color segmen-
tation can be formulated as a clustering problem. In the following experiment
the RGB color space information is taken. Each pixel is regarded to as a sam-
ple defined by three dimensions, and no position or vicinity information is
used. After convergence of the EBEM we obtain the estimated number K of
color classes, as well as the membership of the individual pixels to the classes.
The data come from natural images with sizes of 189×189 pixels. Some results
are represented in Fig. 5.10.
170 5 Image and Pattern Clustering
Fig. 5.9. Fitting a Gaussian mixture with overlapping kernels. The algorithm starts
with one component and selects the order of the model correctly.
Fig. 5.10. Color image segmentation results. Original images (first column) and
color image segmentation with different Gaussianity deficiency levels (second and
third columns). See Color Plates. (Courtesy of A. Peñalver.)
I(T;X)
1
0.95
0.9
0.85
0.8 D constraint
0.75
0.7
0.65
0.6
0.55
0.5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Ed(X,T)
• The complexity of the model I(T ; X) or rate; that is, the information that
a representative gives of thedata in a cluster
• The distortion Ed (X, T ) = ijp(xi )p(tj |xi )d(xi , tj ); that is, the distance
from the data to its representation
and this problem can be solved minimizing the following function, where β is
a Lagrange multiplier applied for weighting the distortion term:
a distance function d(x, t) between a data sample and its representative. The
algorithm runs until convergence is achieved. Several examples are given in
Fig. 5.12. This algorithm guarantees that the global minimum is reached and
obtains a compact clustering with respect to the minimal expected distortion,
defined by parameter β. The main problem is that it requires two input pa-
rameters: β and the number of clusters, that must be fixed. However, as can
be seen in Fig. 5.12, depending on distortion the algorithm may yield p(t) = 0
for one or more clusters; that is, this input parameter actually indicates the
maximum number of clusters. Slow convergence for low β values is another
drawback of this algorithm. Finally, result will be conditioned by the initial
random initialization of p(t).
The main drawback of the Blahut–Arimoto clustering algorithm, and the Rate
Distortion Theory, is the need of defining first a distortion function in order
to compare data samples and representatives. In our previous example, this
distortion was computed as the euclidean distance between the sample and its
corresponding representative. However, in complex problems, the important
features of the input signal that define the distortion function may not be
explicitly known. However, we may be provided with an additional random
variable that helps to identify what is relevant information in the data samples
with respect to this new variable in order to avoid loosing too much of this
information when clustering data. One example is the text categorization
problem. Given a set of words x ∈ X and a set of text categories y ∈ Y , the
objective is to classify a text as belonging to a category yi depending on the
subset of words of X present in the document and their frequency. A common
approach to this problem begins with the splitting of the set X into t ∈ T
clusters, and continues with a learning step that builds document category
models from these clusters; then, any new document is categorized from the
174 5 Image and Pattern Clustering
1000
900
800
700
600
500
400
300
200
100
0
0 100 200 300 400 500 600 700 800 900 1000
1000
900
800
700
600
500
400
300
200
100
0
0 100 200 300 400 500 600 700 800 900 1000
1000
900
800
700
600
500
400
300
200
100
0
0 100 200 300 400 500 600 700 800 900 1000
frequency histogram that represents the amount of each cluster that is present
in this document. In this case X and Y are not independent variables, meaning
that I(X; Y ) > 0.
5.4 Information Bottleneck and Rate Distortion Theory 175
0.25
⎟ T⎟=3
⎟ T⎟=13
0.2
⎟ T⎟=23
0.15
I(T;Y)
0.1
0.05
0
0 0.2 0.4 0.6 0.8 1
I(T;X)
Fig. 5.14. Plot of I(T ; X) vs. I(T ; Y ) for three different values of the cardinality
of T : |T | = 3, |T | = 13 and |T | = 23 and for input data in Fig. 5.15. A part of
this plot is zoomed in order to show that when plots corresponding to different T
cardinalities arrive to a specific value, they diverge. By Altering the β parameter
we are displacing through any of these convex curves, depending on the growth rate
given in the text. This fact suggests that a deterministic annealing approach could
be applied.
in Fig. 5.15. The plot shows that the growth rate of the curve is different
depending on the β parameter. Specifically, this ratio is given by
δI(T ; Y )
= β −1 > 0 (5.22)
δI(X; T )
There is a different convex curve for each different cardinality of the set
T . Varying the value of β, we may move through any of these convex curves
on the plane Ix Iy . This fact suggests that a deterministic annealing (DA)
approach could be applied in order to find an optimal clustering.
5.5 Agglomerative IB Clustering 177
0.5
0 0
300 50
100
200 150
200
100
250
300
Fig. 5.15. Input data used to compute the plots in Figs. 5.14 and 5.16. Given two
classes Y = {y1 , y2 }, the shown distribution represents p(X, Y = y1 ) (from which
p(X, Y ) can be estimated).
Before dealing with the AIB algorithm, some concepts must be introduced:
• The merge prior distribution of tk is given by Πk ≡ (Π1 , Π2 , . . . , Πk ),
p(ti )
where Πk is the a priori probability of ti in tk , Πi = p(t ).
k
• The decrease of information in Iy = I(T ; Y ) due to a merge is
δIy (t1 , . . . , tk ) = I(TM ; Y ) − I(TM ;Y ) (5.27)
Main loop:
for t = 1 . . . (N − 1) do
Find {α, β} = argmini,j {di,j }.
Merge {zα , zβ } ⇒ t .
p(t ) = p(zα ) + p(zβ )
p(y|t ) = p(t1 ) (p(tα , y) + p(tβ , y)) for every y ∈ Y
p(t |x) = 1 if x ∈ zα ∪ zβ and 0 otherwise, for every x ∈ X
Update T = {T − {zα , zβ }} ∪ {t }.
Update di,j costs w.r.t. t , only for couples that contain zα or zβ .
end
Output: Tm : m-partition of X into m clusters, for every 1 ≤ m ≤ N
180 5 Image and Pattern Clustering
decision selects the set of clusters to join that minimizes δIy (t1 , . . . , tk ), al-
ways taking k = 2 (pairs of clusters). Clusters are joined in pairs due to a
property of the information decrease: δIy (t1 , . . . , tk ) ≤ δIy (t1 , . . . , tk+1 ) and
δIx (t1 , . . . , tk ) ≤ δIx (t1 , . . . , tk+1 ) ∀k ≥ 2; the meaning of this property is that
any cluster union (t1 , . . . , tk ) ⇒ tk can be built as (k − 1) consecutive unions
of cluster pairs; for 1 ≤ m ≤ |X|, the optimal partition may be found from
(|X|−m) consecutive union of cluster pairs. Therefore, the loss of information
δIy must be computed for each cluster pair in Tm in order to select cluster
pairs to join. The algorithm finishes when there is only one cluster left. The
result may be expressed as a tree from where cluster sets Tm may be inferred
for any m = |X|, |X| − 1, . . . , 1 (see Fig. 5.16).
In each loop iteration the best pair union must be found; thus, complex-
ity is O(m|Y |) for each cluster pair union. However, this complexity may be
decreased to O(|Y |) if the mutual information loss due to a pair of clusters
merging is estimated directly by means of one of the properties enumerated
above: δIy (t1 , . . . , tk ) = p(tk )JSΠk [p(Y |t1 ), . . . , p(Y |tk )]. Another improve-
ment is the precalculation of the cost of merging any pair of clusters. Then,
when two clusters ti and tj are joined, this cost must only be updated only
for pairs that include one of these two clusters.
A plot of I(T ; Y )/I(X; Y ) vs. I(T ; X)/H(X) is shown in Fig. 5.16. As can
be seen, the decrease of mutual information δ(m) when decreasing m, given
by the next equation, only can increase
0.8
I(Z;X)/H(X)
0.6
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
I(Z;Y)/I(X;Y)
Fig. 5.16. Results of the AIB algorithm applied to a subset of the data in Fig. 5.15.
This subset is built from 90 randomly selected elements of X. Left: tree showing
the pair of clusters joined in each iteration, starting from the bottom (one cluster
assigned to each data sample) and finishing at the top (an only cluster containing all
the data samples). Right: plot of I(T ; Y )/I(X; Y ) vs. I(T ; X)/H(X) (information
plane). As the number of clusters decreases, data compression is higher, and as
a consequence I(T ; X) tends to zero. However, less number of clusters also means
that the information that these clusters T give about Y decreases; thus, I(T ; Y ) also
tends to zero. The algorithm yields all the intermediate cluster sets; so a trade-off
between data compression and classification efficiency may be searched.
5.5 Agglomerative IB Clustering 181
I(Tm ; Y ) − I(Tm−1 ; Y )
δ(m) = (5.31)
I(X; Y )
b 0.5
1
a
0.4
0.2 3
4
5
0.1 876
10 9
18171413
19 11
12
1615
0
1400 1410 1420 1430 1440 1450 1460
Algorithm Steps
1
c 2
3
4
5 6
8 7
9
10
11
12
13
14
16 15
17
18
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
d
6
AIB - Color+XY GMM
AIB - Color GMM
AIB - Color histogram
5 AHI - Color histogram
4 e
Image Representation I (X,Y)
Color Histogram 2.08
I(C;Y)
0
8 7 6 5 4 3 2 1 0
I(C;X)
Fig. 5.17. From left to right and from top to bottom. (a) Image representation. Each
ellipsoid represents a Gaussian in the Gaussian Mixture Model of the image, with its
support region, mean color and spatial layout in the image plane. (b) Loss of mutual
information during the IAB clustering. The last steps are labeled with the number of
clusters in each step. (c) Part of the cluster tree formed during AIB, starting from 19
clusters. Each cluster is represented with a representative image. The labeled nodes
indicate the order of cluster merging, following the plot in (b). (d) I(T ; X) vs.
I(T ; Y ) plot for four different clustering methods. (e) Mutual Information between
images and image representations. (Figure by Goldberger et al. (2006
c IEEE)). See
Color Plates.
5.5 Agglomerative IB Clustering 183
1
n
f f (xt )
D(f ||g) = f log ≈ log (5.32)
g n t=1 g(xt )
k(x)
f (y|x) = αx,j N (μx,j , Σx,j ) (5.33)
j=1
where k(x) is the number of Gaussian components for image x. There are
several equations of Alg. 9 that must be adapted to Gaussian Mixture Model.
For instance, given a cluster t, f (y|t) is the mean of all image models that are
part of t:
1 1
k(x)
f (y|t) = f (y|x) = αx,j N (μx,j , Σx,j ) (5.34)
|t| x∈t |t| j=1
1 |ti |
f (y|t1 ∪ t2 ) = f (y|x) = f (y|ti ) (5.35)
|t1 ∪ t2 | x∈t ,t |t ∪ t2 |
i=1,2 1
1 2
where |X| is the size of the image database. From this formulation, AIB can
be applied as explained in previous section. In Fig. 5.17, the loss of mutual
information for a given database image is shown. In the same figure, part of
the cluster tree is also represented.
Mutual information is used in this work as a measure of quality. For in-
stance, the clustering quality may be computed as I(X; Y ), X being the unsu-
pervised clustering from AIB and Y a manual labeling. Higher values denote
better clustering quality, as the cluster gives more information about the im-
age classes. The quality of the image representation can also be evaluated by
means of I(X; Y ), in this case X being the set of images and Y the features
extracted from their pixels. Due to the fact that a closed-form expression to
calculate I(X; Y ) for Gaussian Mixtures does not exist, this mutual informa-
tion is approximated from I(T ; X) and I(T ; Y ). In Fig. 5.17, the I(T ; X) vs.
184 5 Image and Pattern Clustering
I(T ; Y ) plot is shown for AIB using Gaussian Mixture Models based on color
and localization (AIB – Color+XY GMM), Gaussian Mixture Models based
only on color (AIB – Color GMM), and without Gaussian Mixtures for AIB
and Agglomerative Histogram Intersection (AIB – Color histogram and AHI –
Color histogram). In all cases, I(X; Y ) is extracted from the first point of the
curves, due to the fact that the sum of all merges cost is exactly I(X; Y ).
Related to this plot, we show in Fig. 5.17 a table that summarizes these val-
ues. The best quality is obtained for color and localization based Gaussian
Mixture Models.
Image retrieval from this clustering is straightforward. First the image
query is compared to all cluster representatives, and then it is compared to
all images contained in the selected cluster. Not only computational efficiency
is increased, compared to an exhaustive search, but also previous experiments
by Goldberger et al. show that clustering increases retrieval performance.
However, this approach has several drawbacks, the main one being that it
is still computationally expensive (training being the most time consuming
phase). Furthermore, no high level information like shape and texture is used,
and as a consequence all images in a category must be similar in shape and
appearance.
When information rate exceeds the channel capacity, then this communi-
cation is affected by distortion. Properties of channel capacity are:
1. C ≥ 0, due to the fact that I(X|Y ) ≥ 0.
2. C ≤ log |X|, due to the fact that C = max I(X; Y ) ≤ max H(X) = log |X|.
3. C ≤ log |Y |, for the same reason.
4. I(X; Y ) is a continuous function of p(x).
5. I(X; Y ) is a concave function of p(x) (thus, a maximum value exists).
During Robust Information Clustering, and given the sample priors p(X)
and the optimal clustering p̄(W |X) obtained after a Deterministic Annealing
process, p(X) is updated in order to maximize the information that samples
x ∈ X yield about their membership to each cluster w ∈ W (see the last
channel capacity property):
l
K
D(p(X)) = p(xi )p̄(wk |xi )d(wk , xi ) (5.39)
i=1 k=1
where d(wk |xi ) is the dissimilarity between sample xi ∈ X and cluster centroid
wk ∈ W . In order to perform the maximization, a Lagrange multiplier λ ≥ 0
may be introduced:
where:
⎡ ⎤
⎢
K
p̄(wk |xi ) ⎥
ci = exp ⎣ (p̄(wk |xi ) ln l − λp̄(wk |xi )d(wk , xi )⎦
k=1 p(xj )p̄(wk |xi )
j=1
(5.42)
Empirical risk
VC dimension
Risk True risk
S3
S2
S1
Model complexity
Fig. 5.18. Left: after choosing a class of functions to fit the data, they are nested
in a hierarchy ordered by increasing complexity. Then, the best parameter config-
uration of each subset is estimated in order to best generalize the data. Right: as
complexity of the functions increases, the VC dimension also increases and the em-
pirical error decreases. The vertical dotted line represents the complexity for which
the sum of VC dimension and empirical error is minimized; thus, that is the model
order. Functions over that threshold overfit the data, while functions below that
complexity underfit it.
S1 ⊂ S2 ⊂ · · · ⊂ SK ⊂ · · · (5.43)
K
K
p(wk ) exp(−d(xi , wk )/T )
QK (xi , W ) = lim p(wk |xi ) = lim K
T →0 T →0
k=1 k=1 p(wk ) exp(−d(wk , xi )/T )
k=1
(5.44)
When T → 0, then p(wk |xi ) may be approximated as the complement of
a step function, that is linear in parameters and assigns a label to each sam-
ple depending on the dissimilarity between sample xi and cluster wk . Thus,
as stated by Vapnik [165], VC dimension may be estimated from parameter
number, being hk = (n + 1)k for each Sk . Then increment in cluster number
leads to an increment of complexity: h1 ≤ h2 ≤ · · · ≤ hk ≤ · · · . From this
starting point, model order is selected minimizing the following VC bound,
similarly to Vapnik application to Support Vector Machines:
1/2
ε 4
ps ≤ η + 1+ 1+η (5.45)
2 ε
where
m
η= (5.46)
l
hk (ln h2lk + 1) − ln 4ξ
ε=4 (5.47)
l
l and m being the number of samples and outliers, and ξ < 1 a constant.
The RIC algorithm can be seen in Alg. 10. From the dataset X = {x1 ,
. . . , xl } and a chosen maximum number of clusters Kmax , the algorithm re-
turns these data splitted into a set of clusters with centers W = {w1 , . . . , wk }
and identifies the outliers. Song provides an expression to estimate the param-
eter Kmax , but depending on data nature this expression may not be valid;
thus, we leave it as an open parameter in Alg. 10. Regarding dissimilarity
measure, the euclidean distance was the one chosen for this algorithm. Due
to this fact, RIC tends to create hyperspherical clusters around each cluster
center. Alternative dissimilarity measures may help to adapt the algorithm to
kernel based clustering or data that cannot be linearly separable.
An example of application is shown in Fig. 5.19. In this example, RIC is
applied to a dataset obtained from four different Gaussian distributions, thus,
that is the optimal number of clusters. In this example, we set Tmin to 0.1
and α to 0.9. Parameter ξ, used during order selection, is set to 0.2. Finally,
the parameters ε and λ, that affect the amount of outliers found during the
algorithm, were set to 1 and 0.05, respectively. As can be seen, although
clustering based on five clusters from deterministic annealing (K = 5) yields
no outliers, the algorithm is able to find the optimal number of clusters in
the case of K = 4, for which several outliers are detected. Note that the
model order selection step needs information about noisy data. Therefore, if
λ = 0 is used to solve the same example, the algorithm reaches Kmax without
discovering the optimal model order of the data.
188 5 Image and Pattern Clustering
j=1
p(wi ) = p̄(x)p(wi |x)
x
p̄(x)p(wi |x)
x
wi = p(wi )
end
end
3. Cooling: T = αT , being alpha < 1
4. Cluster partition step: calculate λmax (Vxw ) for each cluster k, being:
l
Vxw = p(xi |wk )(xi − wk )(xi − wk )T
i=1
Given M = min λmax for k = 1 . . . K, corresponding to cluster k̄, which may
k
be splitted during this step.
if T ≤ 2M and K < Kmax then
Outlier detection and order selection:
Initialization: p(xi ) = 1/l, lambda > 0, epsilon > 0, and p̄(w|x) being the
optimal8 clustering partition given8 by step 2.
8 l 8
8 8
while 8ln p(xi )ci − ln max ci 8 < ε do
8 i=1...l 8
i=1
for i = 1 . . . l do
K
p̄(wk |xi )
ci = exp[ (p̄(wk |xi ) ln l − λp̄(wk |xi )wk − xi 2 )]
k=1
p(xj )p̄(wk |xi )
j=1
p(xi )ci
p(xi ) =
l
p(xi )ci
i=1
end
end
(continued)
5.7 IT-Based Mean Shift 189
Information theory not only was included in clustering algorithms during last
years, but it also may help to assess or even theoretically study this kind
of methods. An example is given in this section, where Mean Shift, a popu-
lar clustering and order selection algorithm, is observed from an Information
Theoretical point of view [135].
1
N
P (x, σ) = G(||x − xi ||2 , σ) (5.48)
N i=1
estimate the pdf modes from a set of samples obtained from a multimodal
Gaussian density, the mean shift algorithm looks for stationary points where
∇P (x, σ) = 0. This problem may be solved by means of an iterative procedure
in which xt+1 at iteration t + 1 is obtained from its value xt at previous
iteration: N
G(||x − xi ||2 , σ)xi
t+1 t i=1
x = m(x ) = N (5.49)
G(||x − xi ||2 , σ)
i=1
20 20
15 15
10 10
5 5
0 0
−5 −5
−10 −10
−15 −10 −5 0 5 10 15 20 −15 −10 −5 0 5 10 15 20
25 25
20 20
15 15
10 10
5 5
0 0
−5 −5
−10 −10
−15 −10 −5 0 5 10 15 20 −15 −10 −5 0 5 10 15 20
25 25
20 20
15 15
10 10
5 5
0 0
−5 −5
−10 −10
−15 −10 −5 0 5 10 15 20 −15 −10 −5 0 5 10 15 20
Fig. 5.19. Example of application of RIC to a set of 2D samples gathered from four
different Gaussian distributions. From left to right, and from top to bottom: data
samples, optimal hard clustering for K = 1 (ps = 0.4393), optimal hard clustering
for K = 2 (ps = 0.6185), optimal hard clustering for K = 3 (ps = 0.6345), optimal
hard clustering for K = 4 (ps = 0.5140) and optimal hard clustering for K = 5
(ps = 0.5615). In all cases, each cluster is represented by a convex hull containing
all its samples, and a circle is representing the cluster representative. Outliers are
represented by star symbols. When K = 4 the minimum ps is found; thus, although
in other cases the amount of outliers is lower, K = 4 is the optimal number of
clusters returned by the algorithm.
with large steps through low density regions and with small steps otherwise;
step size estimation is not needed. However, the kernel width σ remains as an
important parameter.
Iteratively application of Eq. 5.49 until convergence is known as Gaussian
Blurring Mean Shift (GBMS). This process searches the pdf modes while
5.7 IT-Based Mean Shift 191
blurring initial dataset. First, samples evolve to the modes of the pdf while
mutually approaching. Then, and from a given iteration, data tend to collapse
fast, making the algorithm unstable. The blurring effect may be avoided if
the pdf estimation is based on the original configuration of the samples. This
improvement is known as Gaussian Mean Shift (GMS). In GMS, the pdf
estimation is obtained comparing the samples in current iteration with original
ones xoi ∈ X o :
N
G(||x − xoi ||2 , σ)xoi
t+1 t i=1
x = m(x ) = N (5.50)
G(||x − xoi ||2 , σ)
i=1
An advantage of GMS over GBMS is that samples are not blurred. Kernels
tend to converge to the pdf modes and remain stable. Therefore, a simple stop
crietrion for GMS based on a translation threshold may be applied. GMS ends
when the sum of mean shift vector magnitudes of the samples is lower than a
given threshold, that is
1 t
N
d (xi ) < δ (5.51)
N i=1
the first convergence phase to the second one can be detected when the dif-
ference of Shannon’s entropy estimated from this histogram at consecutive
iterations is approximately equal to zero:
Two examples of GMS and GBMS are shown in Figs. 5.20 and 5.21. The
data in the first example are extracted from a ring of 16 Gaussian pdfs, with
the same variance and different a priori probabilities (i.e. the reason why
Gaussians are represented with different heights in that figure). In the case
of the second example, the data were extracted from 10 Gaussian pdfs with
random mean and variance. As can be seen, in both cases GMS outperforms
GBMS, correctly locating the exact number of pdf modes and giving a more
accurate prediction of the actual mode location. An effect of GBMS is the
collapse of several modes if their separation is lower than the kernel size.
50
40
30
20
3
10 x10
0 10
8
6 50
−10 4
2
−20
50
−30 0
−40 0
−50
−50 −40 −30 −20 −10 0 10 20 30 40 50 −50 −50
50 50
40 40
30 30
20 20
10 10
0 0
−10 −10
−20 −20
−30 −30
−40 −40
−50 −50
−50 −40 −30 −20 −10 0 10 20 30 40 50 −50 −40 −30 −20 −10 0 10 20 30 40 50
Fig. 5.20. Example of GMS and GBMS application. The samples were obtained
from a ring of Gaussian distributions with equal variance and different a priori
probabilities. From left to right and from top to bottom: input data, pdfs from which
data were extracted, GBMS results, and GMS results.
5.7 IT-Based Mean Shift 193
50
40
0.07
30
0.06
20 0.05
0.04
10
0.03
0 0.02
0.01
−10 0
40
−20 40 30
30 20
−30 20 10
10
0 0
−40 −10
−40 −30 −20 −10 0 10 20 30 40 −10
−20 −20
−30 −30
−40 −40
50 50
40 40
30 30
20 20
10 10
0 0
−10 −10
−20 −20
−30 −30
−40 −40
−40 −30 −20 −10 0 10 20 30 40 −40 −30 −20 −10 0 10 20 30 40
Fig. 5.21. Example of GMS and GBMS application. The samples were obtained
from 10 Guassian distributions with random means and variances. From left to right
and from top to bottom: input data, pdfs from which data were extracted, GBMS
results, and GMS results.
This section studies the relation between Rényi entropy and pdf estimation
based on Parzen windows that will be exploited during the next section. Let
us first recall the expression of Rényi quadratic entropy:
H(X) = − log 2
P (x) dx (5.54)
where
1
N N
V (X) = G(||xi − xj ||2 , σ) (5.56)
N 2 i=1 j=1
It must be noted that this last expression considers each pair of samples.
The contribution of each sample xi ∈ X is given by
1
N
V (xi ) = G(||xi − xj ||2 , σ) (5.57)
N 2 j=1
In the original paper by Rao et al. [135], samples are described as infor-
mation particles that interact among them by means of forces similar to the
laws of physics. From this supposition emerges the association of V (xi ) to the
concept of Information potential. Thus, V (xi ) may be understood as the in-
formation potential of xi over the rest of samples in the dataset (see Fig. 5.22).
The derivative is given by
1
N
∂ ||xj − xi ||2
V (xi ) = 2 G(||xi − xj || , σ)
2
(5.58)
∂xi N j=1 σ2
Following the particle schema, this derivative F (xi ) represents the infor-
mation net force that all the samples exert over xi :
∂ N
F (xi ) = V (xi ) = F (xi |xj ) (5.59)
∂xi j=1
where
1
N M
V (X; Y ) = G(||xi − yj ||2 , σ) (5.63)
N M i=1 j=1
5.7 IT-Based Mean Shift 195
10
−5
−10
−15
−15 −10 −5 0 5 10
10
−5
−10
−15
−15 −10 −5 0 5 10
Fig. 5.22. Representation of the information force within a dataset (top) and be-
tween two different datasets (bottom).
and σ 2 = σX
2
+ σY2 . Finally, the information force that Y exerts over a xi ∈ X
sample (see Fig. 5.22) is
∂
F (xi ; Y ) = V (xi ; Y )
∂xi
= F (xi |xj )
1
M
yj − xi
= G(||xi − yj ||2 , σ)
N M j=1 σ2
1
N N
= max G(||xi − xj ||2 , σ)
X N 2 i=1 j=1
In the latter expression, the logarithm can be removed. Due to the fact
that the logarithm is a monotonic function, its optimization may be translated
to an optimization of its parameter, V (X) in this case. As stated before,
xk = {1, 2, .., n} ∈ X are modified in each iteration. In order to search a
stable configuration of X, J(X) must be differentiated and equated to zero:
2
N
||xk − xj ||2
2F (xk ) = 2 G(||xk − xj || , σ)
2
=0 (5.64)
N j=1 σ2
After rearranging Eq. 5.64, we obtain exactly the same GBMS iterative
sample updating equation in Eq. 5.49. The conclusion extracted after this
derivation is that GBMS directly minimizes the dataset’s overall Rényi’s
quadratic entropy. The cause of the unstability of GBMS is the infinite support
property of Gaussian kernel, when the only saddle point is given by H(X) = 0.
In order to avoid GBMS issues, we may choose to minimize Rényi’s cross en-
tropy rather than Rényi quadratic entropy. The new cost function is given by
1 N
N
J(X) = max V (X; X o ) = max G(||xi − xj ||2 , σ) (5.65)
X X N 2 i=1 j=1
3.8
GBMS
3.6 GMS
3.4
3.2
3
2.8
2.6
2.4
2.2
5 10 15 20 25 30 35 40 45 50 55
Iterations
Fig. 5.23. Evolution of GMS and GBMS cost function during 59 mean shift itera-
tions applied to data shown in Fig. 5.20.
cost function decreases during all the process. The GMS stop criterion is more
intuitive, due to the fact that is based on its cost function. But GBMS should
finish before H(X) = 0 is reached, or all samples would have collapsed to
an only one. The Rényi cross entropy cost function could have been applied
to GBMS. Although a minimum in this graph would be an indicative of the
transition from the first convergence phase to the second one, this approach
is only valid when pdf modes do not overlap.
X = {X1 , X2 , . . . , XN } (5.67)
Suppose we have N = 6 samples and the algorithms yield four different par-
titions C = {C1 , C2 , C3 , C4 } which contain the following labelings:
C1 = {1, 1, 2, 2, 3, 3}
C2 = {2, 2, 3, 3, 1, 1}
(5.68)
C3 = {2, 1, 1, 4, 3, 3}
C4 = {2, 1, 4, 1, 3, 3}
These partitions are a result of hard clustering, which is to say that the
different clusters of a partition are disjoint sets. Contrarily, in soft clustering
each sample can be assigned to several clusters in different degrees.
How to combine these labels for the clustering ensemble? In the first place,
the labels are nominal values and have no relation among different Ci clus-
terings. For example C1 and C2 correspond to identical partitions despite
the values of the labels. Secondly, some clusters may agree on some samples,
but disagree on other ones. Also, the fact that some clusterings may yield a
different number of clusters is an extra complexity. Therefore some kind of
consensus is needed.
At first glance, the consensus clustering C∗ should share as much informa-
tion as possible with the original clusterings Ci ∈ C. This could be formulated
as a combinatorial optimization problem in terms of mutual information [150].
A different strategy is to summarize the clustering results in a co-association
matrix whose values represent the degree of association between objects. Then,
some kind of voting strategy can be applied to obtain a final clustering. An-
other formulation of the problem [131] states that the clusterings Ci ∈ C are,
again, the input of a new clustering problem in which the different labels as-
signed to each sample become its features. In Section 5.8.1 we will explain
this formulation.
In the previous example we showed four different clusterings for the data
X = {X1 , X2 , · · · , XN } with N = 6 samples. Suppose each Xj ∈ X has
D = 2 features:
X1 = {x11 , x12 }
X2 = {x21 , x22 }
X3 = {x31 , x32 }
(5.69)
X4 = {x41 , x42 }
X5 = {x51 , x52 }
X6 = {x61 , x62 }
5.8 Unsupervised Classification and Clustering Ensembles 199
The samples Xj can be represented together with the labels of their clusterings
in C. Also, let us use different labels for each one of the partitions, to make
explicit that they have no numerical relation. Now each sample has H = |C|
new features. There are two ways for performing a new clustering: (a) using
both original and new features or (b) using only new features. Using only
the new features makes the clustering independent of the original features. At
this point the problem is transformed into a categorical clustering problem
in a new space of features and it can be solved using various statistical and
IT-based techniques. The resulting clustering C∗ is known as a consensus
clustering or a median partition and it summarizes the partitions defined by
the set of new features.
Different consensus functions and heuristics have been designed in the litera-
ture. Co-association methods, re-labeling approaches, the mutual information
approach, and mixture model of consensus are some relevant approaches [131].
Among the graph-based formulations [150] there are the instance-based, the
cluster-based and the hypergraph methods. Some of the characteristics of
these approaches are:
In this approach for consensus the probabilities of the labels for each pattern
are modeled with finite mixtures. Mixture models have already been described
in this chapter. Suppose each component of the mixture is described by the
parameters θ m , 1 ≤ m ≤ M where M is the number of components in the
mixture. Each component corresponds to a cluster in the consensus clustering,
and each cluster also has a prior probability πm , 1 ≤ m ≤ M . Then the
parameters to be estimated for the consensus clustering are:
Θ = {π1 , . . . , πM , θ 1 , . . . , θ M } (5.70)
Y = {Y1 , . . . , Y N }
(5.71)
Yi = {c1i , c2i , . . . , cHi }
M
P (Yi |Θ) = πm Pm (Yi |θm ) (5.72)
m=1
N
log (Θ|Y) = log P (Yi |Θ) (5.73)
i=1
1
However, the number of final clusters is not obvious, neither in the example, nor
in real experiments. This problem called model order selection has already been
discussed in this chapter. A minimum description length criterion can be used for
selecting the model order.
5.8 Unsupervised Classification and Clustering Ensembles 201
Now the problem is to estimate the parameters Θ which maximize the likeli-
hood function (Eq. 5.73). For this purpose some densities have to be modeled.
In the first place, a model has to be specified for the component-conditional
densities which appear in Eq. 5.72. Although the different clusterings of the
ensemble are not really independent, it could be assumed that the compo-
nents of the vector Yi (the new features of each sample) are independent,
therefore their probabilities will be calculated as a product. In second place,
a probability density function (PDF) has to be chosen for the components of
Yi . Since they consist of cluster labels in Ci , the PDF can be modeled as a
multinomial distribution.
A distribution of a set of random variates X1 , X2 , . . . , Xk is multinomial if
n!
k
P (X1 = x1 , . . . , Xk = xk ) = k ϑxi i (5.74)
i=1 xi ! i=1
k
xi = n (5.75)
i=1
k
θi = 1 (5.76)
i=1
Putting this together with the product for the conditional probability of Yi
(for which conditional independence was assumed), the probability density for
the components of the vectors Yi is expressed as
&
H K(j) 'δ(yij ,k)
Pm (Yi |Θm ) = ϑjm (k) (5.77)
j=1 k=1
where ϑjm (k) are the probabilities of each label and δ(yij , k) returns 1 when
k is the same as the position of the label yij in the labeling order; otherwise
it returns 0. K(j) is the number of different labels existing in the partition
j. For example, for the labeling of the partition C2 = {b, b, c, c, a, a}, we
have K(2) = 3, and the function δ(yi2 , k) would be evaluated to 1 only for the
parameters: δ(a, 1), δ(b, 2), δ(c, 3) and it would be 0 for any other parameters.
Also note that in Eq. 5.77, for a given mixture the probabilities for each
clustering sum 1:
K(j)
ϑjm (k) = 1, ∀j ∈ {1, · · · , H}, ∀m ∈ {1, · · · , M }, (5.78)
k=1
202 5 Image and Pattern Clustering
Table 5.1. Original feature space and transformed feature space obtained from the
clustering ensemble. Each partition of the ensemble is represented with a different
labels in order to emphasize the label correspondence problem.
In our toy-example, taking the values of Y from Table 5.1 and using
Eq. 5.77, the probability for the vector Y1 = {1, b, N, β} to be described
by the 2nd mixture component would be:
3 &
'δ(1,k) 3 & 'δ(b,k)
P2 (Yi |Θ2 ) = ϑ12 (k) · ϑ22 (k)
k=1 k=1
4 & 'δ(N,k) 4 & 'δ(β,k)
· ϑ32 (k) · ϑ42 (k)
k=1 k=1
(5.79)
= ϑ12 (1)1 ϑ12 (2)0 ϑ12 (3)0
· ϑ22 (1)0 ϑ22 (2)1 ϑ22 (3)0
· ϑ32 (1)0 ϑ32 (2)1 ϑ32 (3)0 ϑ32 (4)0
· ϑ42 (1)0 ϑ42 (2)1 ϑ42 (3)0 ϑ42 (4)0
The EM algorithm iterates two main steps. In the E step the expected val-
ues of the hidden variables are estimated, given the data Y and the current
5.8 Unsupervised Classification and Clustering Ensembles 203
estimation of the parameters Θ. The following equations are the result of the
derivation of the equations of the EM algorithm:
H K(j) & 'δ(yij ,k)
πm ϑjm (k)
j=1 k=1
E[zim ] = M (5.81)
H K(j) & 'δ(yij ,k)
πn ϑjn (k)
j=1 k=1
n=1
all partitions in the ensemble C the sum of all mutual informations has to be
maximized:
H
C∗ = arg max I(C; Ci ) (5.86)
C
i=1
Let us see an example with the data in Table 5.1. Suppose we want to
calculate the mutual information between the fourth clustering C4 and a
consensus clustering C∗ = {1, 1, 1, 2, 2, 2}. We would have the following cluster
labels:
L42 L43 L4 4
C4 = { β, α, δ, α, γ, γ }
L41 L41 (5.87)
∗
C = { 1, 1, 1, 2, 2, 2 }
L∗
1 L∗
2
Using Eq. 5.84 we calculate the mutual information between the partitions.
The logarithm log(x) is calculated in base 2.
2 4
|L∗k ∩ Lij | |L∗k ∩ Lij | · 6
∗
I(C ; C4 ) = log
j=1
k=1
N |L∗k | · |Lij |
+ , H
+ ,
I 2 {1, 2, 1, 2, 2, 2}; C = I 2 {1, 2, 1, 2, 2, 2}; Ci
i=1
+ , = 0.2222 + 0.2222 + 0.5556 + 0.8889 = 1.8889
I 2 +{2, 2, 1, 2, 2, 2}; C , = 0.2222 + 0.2222 + 0.2222 + 0.5556 = 1.2222
I 2 +{1, 1, 1, 2, 2, 2}; C , = 0.6667 + 0.6667 + 1.0000 + 0.6667 = 3.0000
I 2 +{1, 3, 1, 2, 2, 2}; C , = 0.5556 + 0.5556 + 0.8889 + 0.8889 = 2.8889
I 2 +{1, 2, 2, 2, 2, 2}; C , = 0.2222 + 0.2222 + 0.5556 + 0.5556 = 1.5556 (5.92)
I 2 +{1, 2, 1, 1, 2, 2}; C , = 0.6667 + 0.6667 + 0.6667 + 0.6667 = 2.6667
I 2 +{1, 2, 1, 3, 2, 2}; C , = 0.5556 + 0.5556 + 0.8889 + 0.8889 = 2.8889
I 2 +{1, 2, 1, 2, 1, 2}; C , = 0.0000 + 0.0000 + 0.3333 + 0.6667 = 1.0000
I 2 +{1, 2, 1, 2, 3, 2}; C , = 0.2222 + 0.2222 + 0.5556 + 0.8889 = 1.8889
I 2 +{1, 2, 1, 2, 2, 1}; C , = 0.0000 + 0.0000 + 0.3333 + 0.6667 = 1.0000
I 2 {1, 2, 1, 2, 2, 3}; C = 0.2222 + 0.2222 + 0.5556 + 0.8889 = 1.8889
Different constraints regarding the vector space and the possible changes can
be imposed. A hill climbing algorithm has been used in this example; however
it would stop at the first local maximum. Simulated annealing or a different
searching algorithm has to be used to overcome this problem.
Problems
x1 2 4
x2 5.1 0
x3 9.9 1.7
x4 12.3 5.2
t1 5 2.5
t2 8.5 3.5
if p(xi ) = p(xj )∀i = j and p(t1 |xi ) = p(t2 |xi )∀i, estimate the distortion
Ed (X, T ) and the model complexity. Think of how to decrease the distortion
without varying the number of clusters. How this affects change to the model
complexity?
X p(xi , Y = y1 )
x1 0.9218
x2 0.7382
x3 0.1763
x4 0.4057
apply the first iteration of Alg. 5.16, determining the first two samples that
are joined and the new posteriors.
208 5 Image and Pattern Clustering
1
0 0
X 1-p Y
1 1
p
6.1 Introduction
A fundamental problem in pattern classification is to work with a set of fea-
tures which are appropriate for the classification requirements. The first step
is the feature extraction. In image classification, for example, the feature set
commonly consists of gradients, salient points, SIFT features, etc. High-level
features can also be extracted. For example, the detection of the number of
faces and their positions, the detection of walls or surfaces in a structured envi-
ronment, or text detection are high-level features which also are classification
problems in and of themselves.
Once designed the set of features it is convenient to select the most in-
formative of them. The reason for this is that the feature extraction process
does not yield the best features for some concrete problems. The original fea-
ture set usually contains more features than it is necessary. Some of them
could be redundant, and some could introduce noise, or be irrelevant. In some
problems the number of features is very high and their dimensionality has to
be reduced in order to make the problem tractable. In other problems feature
selection provides new knowledge about the data classes. For example, in gene
selection [146] a set of genes (features) are sought in order to explain which
genes cause some disease. On the other hand, a properly selected feature set
significantly improves classification performance. However, feature selection is
a challenging task.
There are two major approaches to dimensionality reduction: feature se-
lection and feature transform. Feature selection reduces the feature set by
discarding features. A good introduction to feature selection can be found
in [69]. Feature transform refers to building a new feature space from the
original variables, therefore it is also called feature extraction.
Some well-known feature transform methods are principal component anal-
ysis (PCA), linear discriminant analysis (LDA), and independent component
analysis (ICA), among others. PCA transform relies on the eigenvalues of the
covariance matrix of the data, disregarding the classes. It represents the data
F. Escolano et al., Information Theory in Computer Vision and Pattern Recognition, 211
c Springer-Verlag London Limited 2009
212 6 Feature Selection and Transformation
1
In supervised classification, a classifier is built given a set of samples, each one
of them labeled with the class to which it belongs. In this section the term
classification always refers to supervised classification.
214 6 Feature Selection and Transformation
the samples {S1 , S3 , S6 , S8 , S9 } and the rest {S2 , S4 , S5 , S7 } as test set. Then,
the classifiers C1 and C2 have to be built with the following data:
C1 : Features Class C2 : Features Class
( x11 x12 ), C1 ( x11 x13 ), C1
( x31 x32 ), C1 ( x31 x33 ), C1
( x61 x62 ), C2 ( x61 x63 ), C2
( x81 x82 ), C2 ( x81 x83 ), C2
( x91 x92 ), C2 ( x91 x93 ), C2
and tested with the following data:
T estC1 : Features T estC2 : Features Output : Class
( x21 x22 ) ( x21 x23 ) C1
( x41 x42 ) ( x41 x43 ) C1
( x51 x52 ) ( x51 x53 ) C2
( x71 x72 ) ( x71 x73 ) C2
Denoted as Output is the set of labels that are expected to be returned
by the classifiers for the selected samples. The accuracy of the classifiers is
evaluated based on the similarity between its actual output and the desired
output. For example, if the classifier C1 returned C1, C2, C2, C2, while the
classifier C2 returned C1, C2, C1, C1, then C1 would be more accurate. The
conclusion would be that the feature set (F1 , F2 ) works better than (F1 , F3 ).
This wrapper example is too simple. Actually, drawing such conclusion from
just one classification test would be statistically unreliable, and cross vali-
dation techniques have to be applied in order to decide which feature set is
better than another.
Once explained the basic concept of wrapper feature selection and the cross
validation process, let us present an example in the field of computer vision.
The problem is the supervised classification of indoor and outdoor images
that come from two different sequences taken with a camera mounted on a
walking person. Both sequences are obtained along the same path. The first
one contains 721 images and is used as a training set. The second sequence,
containing 470 images, is used as a test set, so it does not take part in the
feature selection process. The images have a 320 × 240 resolution and they
are labeled with six different classes: an office, two corridors, stairs, entrance,
and a tree avenue, as shown in Fig. 6.1. The camera used is a stereo camera,
so range information (depth) is another feature in the data sets.
One of the most important decisions in a classification problem is how to
extract the features from the available data. However, this is not the topic of
this chapter and we will just explain a way to extract global low-level features
from an image. The technique consists of applying a set of basic filters to each
image and taking the histograms of their responses as the features that char-
acterize each image. Then, the feature selection process decides which features
are important for the addressed classification problem. The selected feature
set depends on the classifier, on the data, and on the labels of the training set.
The filters we use in this example are applied to the whole image and they
return a histogram of responses. This means that for each filter we obtain
information about the number of pixels in the image which do not respond to
it, the number of pixels which completely respond to it, and the intermediate
levels of response, depending on the number of bins in the histogram.
There are 18 filters, each one of which has a number of bins in the his-
togram. The features themselves are given by the bins of the histograms. An
example is shown in Fig. 6.2. The filters are the following:
• Nitzberg
• Canny
• Horizontal gradient
216 6 Feature Selection and Transformation
Fig. 6.1. A 3D reconstruction of the route followed during the acquisition of the
data set, and examples of each one of the six classes. Image obtained with 6-DOF
SLAM. Figure by F. Escolano, B. Bonev, P. Suau, W. Aguilar, Y. Frauel, J.M. Sáez
and M.A. Cazorla (2007
c IEEE). See Color Plates.
• Vertical gradient
• Gradient magnitude
• Twelve color filters Hi , 1 ≤ i ≤ 12
• Depth information
Some of them are redundant, for example the magnitude and the gradients.
Others are similar, like Canny and gradient’s magnitude. Finally, some fil-
ters may overlap, for example the color filters. The color filters return the
probability distribution of some definite color H (from the HSB color space).
The feature selection criterion of the wrapper method is based on the
performance of the classifier for a given subset of features, as already ex-
plained. For a correct evaluation of the classification error, the cross valida-
tion method is used. Tenfold CV is suitable for larger data sets, while LOOCV
is useful when the data set is very small. For the following experiments the
classification error reported is calculated using 10-fold cross validation. The
experiments are performed with a K-Nearest Neighbor (K-NN) classifier with
K = 1.
There are different strategies for generating feature combinations. The
only way to ensure that a feature set is optimum is the exhaustive search
6.2 Wrapper and the Cross Validation Criterion 217
Fig. 6.2. Responses of some filters applied to an image. From top-bottom and left-
right: input image, depth, vertical gradient, gradient magnitude, and four color fil-
ters. The rest of the filters are not represented as they yield null output for this
input image. Figure by B. Bonev, F. Escolano and M.A. Cazorla (2008
c Springer).
The fastest way to select from a large amount of features is a greedy strategy.
Its computational complexity is
n
O(n) = i (6.2)
i=1
The algorithm is described in Alg. 11. At the end of each iteration a new
feature is selected and its CV error is stored. The process is also outlined in
Fig. 6.3.
218 6 Feature Selection and Transformation
NF NS
10-Fold
M CV
M M Train
Test
Error
Images
Vectors Vectors
?
All Features Selected F.
Best F. Set
14
20 12
10
15 8
6
10 4
2
5 0
0 10 20 30 40 50 60 70 0 100 200 300 400 500 600 700
Number of features Number of features
Fig. 6.4. Comparison of feature selection using 2 and 12 bin histograms, on the
eight-class indoor experiment.
2 Classes
10
0
0 10 20 30 40 50
# Selected Features
Fig. 6.5. Evolution of the CV error for different number of classes. Figure by
B. Bonev, F. Escolano and M.A. Cazorla (2008
c Springer).
6.2.4 Experiments
Fig. 6.6. The nearest neighbors of different test images. The training set from which
the neighbors are extracted contains 721 images taken during an indoor–outdoor
walk. The amount of low-level filters selected for building the classifier is 13, out of
48 in total. Note that the test images of the first column belong to a different set of
images. Figure by B. Bonev, F. Escolano and M.A. Cazorla (2008c Springer).
700
600
500
NN #
400
300
200
100
0
0 100 200 300 400 500
Test Image #
Fig. 6.7. The nearest neighbor (from among the train images, Y axis) of each one
of the test images, X axis. In the ideal case the line should be almost straight, as
the trajectories of both train and test sets are similar.
where S is the vector of selected features and ci ∈ C is a class from all the
possible classes C existing in the data.
The Bayesian error rate is the ultimate criterion for discrimination; how-
ever, it is not useful as a cost, due to the nonlinearity of the max(·) function.
Then, some alternative cost function has to be used. In the literature there are
many bounds on the Bayesian error. An upper bound obtained by Hellman
and Raviv (1970) is
H(C|S)
E(S) ≤
2
This bound is related to mutual information, because mutual information can
be expressed as
I(S; C) = H(C) − H(C|S)
and H(C) is the entropy of the class labels which do not depend on the feature
subspace S. Therefore, the mutual information maximization is equivalent to
the maximization of the upper bound (Eq. 6.3) of the Bayesian error. There
is a Bayesian error lower bound as well, obtained by Fano (1961), and is also
related to mutual information.
The relation of mutual information with the Kullback–Leibler (KL) diver-
gence also justifies the use of mutual information for feature selection. The
KL divergence is defined as
p(x)
KL(P ||Q) = p(x) log dx
x q(x)
for the discrete case. From the definition of mutual information, and given
that the conditional entropy can be expressed as p(x|y) = p(x, y)/p(y), we
have that
6.3 Filters Based on Mutual Information 223
p(x, y)
I(X; Y ) = p(x, y) log (6.4)
p(x)p(y)
y∈Y x∈X
p(x|y)
= p(y) p(x|y) log (6.5)
p(x)
y∈Y x∈X
= p(y)KL (p(x|y)||p(x)) (6.6)
y∈Y
= EY (KL (p(x|y)||p(x))) (6.7)
Some works on feature selection avoid the multidimensional data entropy es-
timation by working with single features. This, of course, is not equivalent to
the maximization of I(S; C). In the approach of Peng et al. [125] the feature
selection criterion takes into account the mutual information of each sepa-
rate feature and the classes, but also subtract the redundance of each separate
feature with the already selected ones. It is explained in the next section.
A simpler approach is to limit the cost function to evaluate only the mutual
information between each selected feature xi ∈ S and the classes C:
I(S∗ ; C) ≈ I(xi ; C) (6.8)
xi ∈S
2
Some authors refer to the maximization of the mutual information between the
features and the classes as infomax criterion.
224 6 Feature Selection and Transformation
where x∗i is the ith most important feature and S∗1,i−1 = {x∗1 , . . . , x∗i−1 } is the
set of the first i − 1 best features, which have been selected before selecting
x∗i . This expression is obtained by applying the chain rule of mutual informa-
tion. For the mutual information between N variables X1 , . . . , XN , and the
variable Y , the chain rule is
N
I(X1 , X2 , . . . , XN ; Y ) = I(Xi ; Y |Xi−1 , Xi−2 , . . . , X1 )
i=1
The property from Eq. 6.9 is helpful for understanding the kind of trade-
off between discriminant power maximization and redundancy minimization
which is achieved by I(S∗ ; C). The first summation measures the individual
discriminant power of each feature belonging to the optimal set. The second
summation penalizes those features x∗i which, together with the already se-
lected ones S∗1,i−1 , are jointly informative about the class label C. This means
that if S∗1,i−1 is already informative about the class label, the informativeness
of the feature x∗i is the kind of redundancy which is penalized. However, those
features which are redundant, but do not inform about the class label, are
not penalized.
Given this property, Vasconcelos et al. [166] focus the feature selection
problem on visual processing with low level features. Several studies report
that there exist universal patterns of dependence between the features of bi-
ologically plausible image transformations. These universal statistical laws
of dependence patterns are independent of the image class. This conjecture
implies that the second summation in Eq. 6.9 would probably be close to
zero, because of the assumption that the redundancies which carry informa-
tion about the class are insignificant. In this case, only the first summation
would be significant for the feature selection process, and the approximation
in Eq. 6.8 would be valid. This is the most relaxed feature selection cost,
in which the discriminant power of each feature is individually measured.
An intermediate strategy was introduced by Vasconcelos et al. They se-
quentially relax the assumption that the dependencies are not informative
about the class. By introducing the concept of l-decomposable feature sets
they divide the feature set into disjoint subsets of size l. The constraint is
that any dependence which is informative about the class label has to be
between the features of the same subset, but not between susbsets. If S∗ is
the optimal feature subset of size N and it is l-decomposable into the subsets
T1 , . . . , TN/l , then
6.3 Filters Based on Mutual Information 225
N
I(S∗ ; C) = I(x∗i ; C)
i=1
&
N i−1/l
'
− I(x∗i ; T̃j,i ) − I(x∗i ; T̃j,i |C) (6.10)
i=2 j=1
where T̃j,i is the subset of Tj containing the features of index smaller than k.
This cost function makes possible an intermediate strategy which is not as
relaxed as Eq. 6.8, and is not as strict as Eq. 6.9. The gradual increase of
the size of the subsets Tj allows to find the l at which the assumption about
noninformative dependences between the subsets becomes plausible.
The assumption that the redundancies between features are independent of
the image class is not realistic in many feature selection problems, even in the
visual processing field. In the following section, we analyze some approaches
which do not make the assumption of Eq. 6.8. Instead they take into consid-
eration the interactions between all the features.
Peng et al. present in [125] a Filter Feature Selection criterion based on mutual
information estimation. Instead of estimating the mutual information I(S; C)
between a whole set of features and the class labels (also called prototypes),
they estimate it for each one of the selected features separately. On the one
hand they maximize the relevance I(xj ; C) of each individual feature xj ∈ F.
On the other hand they minimize the redundancy between xj and the rest of
selected features xi ∈ S, i = j. This criterion is known as the min-Redundancy
Max-Relevance (mRMR) criterion and its formulation for the selection of the
mth feature is
⎡ ⎤
1
max ⎣I(xj ; C) − I(xj ; xi )⎦ (6.11)
xj ∈F−Sm−1 m−1
xi ∈Sm−1
implies that by the time the mth feature xm has to be selected, there already
are m−1 selected features in the set of selected features Sm−1 . By defining the
following measure for the x1 , x2 , . . . , xn scalar variables (i.e., single features):
J(x1 , x2 , . . . , xn )
p(x1 , x2 , . . . , xn )
= · · · p(x1 , x2 , . . . , xn ) log dx1 · · · dxn ,
p(x1 )p(x2 ) · · · p(xn )
it can be seen that selecting the mth feature with mRMR first-order incre-
mental search is equivalent to maximizing the mutual information between
Sm and the class C. Equations 6.12 and 6.13 represent the simultaneous max-
imization of their first term and minimization of their second term. We show
the equivalence with mutual information in the following equation (Eq. 6.14):
This reasoning can also be denoted in terms of entropy. We can write J(·) as
therefore
J(Sm−1 , xm ) = J(Sm ) = H(xi ) − H(Sm )
xi ∈Sm
and
J(Sm−1 , xm , C) = J(Sm , C) = H(xi ) + H(C) − H(Sm , C)
xi ∈Sm
6.3 Filters Based on Mutual Information 227
J(Sm−1 , xm , C) − J(Sm−1 , xm )
! "
= H(xi ) + H(C) − H(Sm , C) − H(xi ) − H(Sm )
xi ∈Sm xi ∈Sm
= H(C) − H(Sm , C) + H(S) = I(S, C)
is performed with the aid of Entropic Spanning Graphs for entropy estimation
[74], as explained in Chapter 4. This entropy estimation is suitable for data
with a high number of features and a small number of samples, because its
complexity depends on the number ns of samples (O(ns log(ns ))) but not
on the number of dimensions. The MI can be calculated from the entropy
estimation in two different ways, with the conditional entropy and with the
joint entropy:
p(s, c)
I(S; C) = p(x, c) log (6.18)
p(x)p(c)
x∈S c∈C
= H(S) − H(S|C) (6.19)
= H(S) + H(C) − H(S, C) (6.20)
where x is a feature from the set of selected features S and c is a class label
belonging to the set of prototypes C.
Provided that entropy can be estimated for high-dimensional data sets,
different IT-based criteria can be designed, depending on the problem. For
example, the Max-min-Dependency (MmD) criterion (Eq. 6.21), in addition
to the Max-Dependency maximization, also minimizes the mutual information
between the set of discarded features and the classes:
max [I(S; C) − I(F − S; C)] (6.21)
S⊆F
Then, for selecting the mth feature, Eq. 6.22 has to be maximized:
max [I(Sm−1 ∪ {xj }; C) − I(F − Sm−1 − {xj }; C)] (6.22)
xj ∈F−Sm−1
The aim of the MmD criterion is to avoid leaving out features which have infor-
mation about the prototypes. In Fig. 6.8 we show the evolution of the criterion
as the number of selected features increases, as well as the relative values of
the terms I(S; C) and I(F − S; C), together with the 10-fold CV and test
errors of the feature sets.
The term “greedy” refers to the kind of searches in which the decisions cannot
be undone. In many problems, the criterion which guides the search does not
necessarily lead to the optimal solution and usually falls into a local maximum
(minimum). This is the case of forward feature selection. In the previous
sections we presented different feature selection criteria. With the following toy
problem we show an example of incorrect (or undesirable) feature selection.
Suppose we have a categorical data set. The values of categorical variables
are labels and these labels have no order: the comparison of two categorical
values can just tell whether they are the same or different. Note that if the
data are not categorical but they are ordinal, regardless if they are discrete
6.3 Filters Based on Mutual Information 229
Mutual Information (MmD criterion) Feature Selection
100
90
80
40
30
20
10
0
0 5 10 15 20 25 30 35 40 45
# features
Fig. 6.8. MD and MmD criteria on image data with 48 features. Figure by B. Bonev,
F. Escolano and M.A. Cazorla (2008
c Springer).
or continuous, then a histogram has to be built for the estimation of the
distribution. For continuous data, a number of histogram bins have to be cho-
sen necessarily, and for some discrete, but ordinal data, it is also convenient.
For example, the distribution of the variable x = {1, 2, 1,002, 1,003, 100} could
be estimated by a histogram with 1,003 bins (or more) where only five bins
would have a value of 1. This kind of histogram is too sparse. A histogram
with 10 bins offers a more compact representation, though less precise, and
the distribution of x would look like ( 25 , 15 , 0, 0, 0, 0, 0, 0, 0, 25 ). There also are
entropy estimation methods which bypass the estimation of the probability
distribution, as detailed in Chapter 4. These methods, however, are not suit-
able for categorical varibles. For simplicity we present an example with cat-
egorical data, where the distribution of a variable x = {A, B, Γ, A, Γ } is
P r(x = A) = 23 , P r(x = B) = 13 , P r(x = Γ ) = 23 .
The data set of the toy-example contains five samples defined by three
features, and classified into two classes.
x1 x2 x3 C
A Z Θ C1
B Δ Θ C1
Γ E I C1
A E I C2
Γ Z I C2
The mutual information between each single feature xi , 1 ≤ i ≤ 3 and the
class C is
I(x1 , C) = 0.1185
I(x2 , C) = 0.1185 (6.23)
I(x3 , C) = 0.2911
230 6 Feature Selection and Transformation
Therefore, both mRMR and MD criteria would decide to select x3 first. For the
next feature which could be either x1 or x2 , mRMR would have to calculate
the redundancy of x3 with each one of them:
In this case both values are the same and it does not matter which one to
select. The feature sets obtained by mRMR, in order, would be: {x3 }, {x1 , x3 },
{x1 , x2 , x3 }.
To decide the second feature (x1 or x2 ) with the MD criterion, the mutual
information between each one of them with x3 , and the class C, has to be es-
timated. According to the definition of MI, in this discrete case, the formula is
# p(x1 , x3 , C)
$
I(x1 , x3 ; C) = p(x1 , x3 , C) log
x x
p(x1 , x3 )p(C)
C 3 1
Then MD would also select any of them, as first-order forward feature se-
lection with MD and mRMR is equivalent. However, MD can show us that
selecting x3 in first place was not a good decision, given that the combination
of x1 and x2 has much higher mutual information with the class:
and if two feature sets provide the same information about the class, the
preferred is the one with less features: x3 is not informative about the class,
given x1 and x2 .
6.3 Filters Based on Mutual Information 231
The MmD criterion would have selected the features in the right order
in this case, because it not only calculates the mutual information about the
selected features, but it also calculates it for the nonselected features. Then, in
the case of selecting x3 and leaving unselected x1 and x2 , MD would prefer not
to leave together an unselected pair of features which jointly inform so much
about the class. However, MmD faces the same problem in other general cases.
Some feature selection criteria could be more suitable for one case or another.
However, there is not a criterion which can avoid the local maxima when used
in a greedy (forward or backward) feature selection. Greedy searches with a
higher-order selection, or algorithms which allow both addition and deletion
of features, can alleviate the local minima problem.
Even though greedy searches can fall into local maxima, it is possible to
achieve the highest mutual information possible for a feature set, by means of
a greedy backward search. However, the resulting feature set, which provides
this maximum mutual information about the class, is usually suboptimal.
There are two kinds of features which can be discarded: irrelevant features
and redundant features. If a feature is simply irrelevant to the class label, it
can be removed from the feature set and this would have no impact on the
mutual information between the rest of the features and the class. It is easy
to see that removing other features from the set is not conditioned by the
removal of the irrelevant one.
However, when a feature xi is removed due to its redundancy given other
features, it is not so intuitive if we can continue removing from the remaining
features, as some subset of them made xi redundant. By using the mutual
information chain rule we can easily see the following. We remove a feature xi
from the set Fn with n features, because that feature provides no additional
information about the class, given the rest of the features Fn−1 . Then we
remove another feature xi because, again, it provides no information about
the class given the subset Fn−2 . In this case, the previously removed one, xi ,
will not be necessary anymore, even after the removal of xi . This process can
continue until it is not possible to remove any feature because otherwise the
mutual information would decrease. Let us illustrate it with the chain rule of
mutual information:
n
I(S; C) = I(x1 , . . . , xn ; C) = I(xi ; C|xi−1 , xi−2 , . . . , x1 ) (6.28)
i=1
I(x1 , x2 , x3 , x4 ; C) = I(x1 ; C)
+ I(x2 ; C|x1 )
+ I(x3 ; C|x1 , x2 )
+ I(x4 ; C|x1 , x2 , x3 )
I(X;C)
C
x1
x4 x2 x3
Fig. 6.9. A Venn diagram representation of a simple feature selection problem where
C represents the class information, and X = {x1 , x2 , x3 , x4 } is the complete feature
set. The colored area represents all the mutual information between the features of
X and the class information. The feature x4 does not intersect this area; this means
that it is irrelevant.
6.3 Filters Based on Mutual Information 233
I(X;C)
I(X;C)
C
x1 C
x3
x2
Fig. 6.10. In Fig. 6.9, the features {x1 , x2 } (together) do not provide any further
class information than x3 provides by itself, and vice versa: I(x1 , x2 ; C|x3 ) = 0 and
I(x3 ; C|x1 , x2 ) = 0. Both feature sets, {x1 , x2 } (left) and x3 (right), provide the
same information about the class as the full feature set.
set there are six binary features, {x1 , x2 , x3 , x4 , x5 , x6 }. The class label is also
binary and it is the result of the operation:
C = (x1 ∧ x2 ) ∨ (x3 ∧ x4 )
where the Intersection property is only valid for positive probabilities. (Nega-
tive probabilities are used in several fields, like quantum mechanics and math-
ematical finance.)
Markov blankets are defined in terms of conditional independence. The set
of variables (or features) M is a Markov blanket for the variable xi , if xi is
conditionally independent of the rest of the variables F − M − {xi }, given M:
or
xi ⊥ F − M − {xi } | M (6.30)
where F is the set of features {x1 , . . . , xN }. Also, if M is a Markov blanket
of xi , then the class C is conditionally independent of the feature given the
Markov blanket: xi ⊥ C | M. Given these definitions, if a feature xi has a
Markov blanket among the set of features F used for classification, then xi
can safely be removed from F without losing any information for predicting
the class.
Once a Markov blanket for xi is found among F = {x1 , . . . , xN } and xi is
discarded, the set of selected (still not discarded) features is S = F − {xi }.
In [97] it is proven that, if some other feature xj has a Markov blanket among
S, and xj is removed, then xi still has a Markov blanket among S − {xj }.
This property of the Markov blankets makes them useful as a criterion for a
greedy feature elimination algorithm. The proof is as follows:
Let Mi ⊆ S be a Markov blanket for xi , not necessarily the same blanket
which was used to discard the feature. Similarly, let Mj ⊆ S be a Markov
6.3 Filters Based on Mutual Information 235
xi ⊥ X | Mi ∪ Mj (6.31)
In first place, from the assumption about the Markov blanket of xj we have
that
xj ⊥ S − Mj − {xj } | Mj
Using the Decomposition property (Eq. 6.29) we can decompose the set
S − Mj − {xj } and we obtain
xj ⊥ X ∪ Mi | Mj
Using the Weak union property (Eq. 6.29), we can derive from the last
statement:
xj ⊥ X | Mi ∪ Mj (6.32)
For xi we follow the same derivations and we have
and, therefore,
xi ⊥ X | Mj ∪ Mi ∪ {xj } (6.33)
From Eqs. 6.32 and 6.33, and using the Contraction property (Eq. 6.29) we
derive that
{xi } ∪ {xj } ⊥ X | Mj ∪ Mi
which, with the Decomposition property (Eq. 6.29), is equivalent to Eq. 6.31;
therefore, it is true that after the removal of xj , the subset Mi ∪ Mj is a
Markov blanket for xi .
In practice it would be very time-consuming to find a Markov blanket for
each feature before discarding it. In [97] they propose a heuristic in which
they fix a size K for the Markov blankets for which the algorithm searches.
The size K depends very much on the nature of the data. If it is too low, it is
not possible to find good Markov blankets. If it is too high, the performance
is also negatively affected. Among other experiments, the authors of [97] also
experiment with the “Corral” data set, already presented in the previous
section. With the appropriate K they successfully achieve the correct feature
selection on it, similar to the result shown in the previous section with the
MD greedy backward elimination.
236 6 Feature Selection and Transformation
40
LOOCV error (%)
I(S;C)
35
% error and MI
30
25
20
15
10
20 40 60 80 100 120 140 160
# features
Fig. 6.11. Maximum dependency feature selection performance on the NCI microar-
ray data set with 6,380 features. The mutual information of the selected features is
represented.
6.3 Filters Based on Mutual Information 237
MD Feature Selection mRMR Feature Selection
CNS CNS 1
CNS CNS
CNS CNS
RENAL RENAL
BREAST BREAST
CNS CNS
CNS CNS 0.9
BREAST BREAST
NSCLC NSCLC
NSCLC NSCLC
RENAL RENAL
RENAL RENAL
RENAL RENAL 0.8
RENAL RENAL
RENAL RENAL
RENAL RENAL
RENAL RENAL
BREAST BREAST
NSCLC NSCLC
RENAL RENAL 0.7
UNKNOWN UNKNOWN
OVARIAN OVARIAN
MELANOMA MELANOMA
PROSTATE PROSTATE
OVARIAN OVARIAN
OVARIAN OVARIAN 0.6
OVARIAN OVARIAN
OVARIAN OVARIAN
Class (disease)
OVARIAN OVARIAN
PROSTATE PROSTATE
NSCLC NSCLC
NSCLC NSCLC
NSCLC NSCLC
0.5
LEUKEMIA LEUKEMIA
K562B−repro K562B−repro
K562A−repro K562A−repro
LEUKEMIA LEUKEMIA
LEUKEMIA LEUKEMIA
LEUKEMIA LEUKEMIA 0.4
LEUKEMIA LEUKEMIA
LEUKEMIA LEUKEMIA
COLON COLON
COLON COLON
COLON COLON
COLON COLON 0.3
COLON COLON
COLON COLON
COLON COLON
MCF7A−repro MCF7A−repro
BREAST BREAST
MCF7D−repro MCF7D−repro
BREAST BREAST 0.2
NSCLC NSCLC
NSCLC NSCLC
NSCLC NSCLC
MELANOMA MELANOMA
BREAST BREAST
BREAST BREAST 0.1
MELANOMA MELANOMA
MELANOMA MELANOMA
MELANOMA MELANOMA
MELANOMA MELANOMA
MELANOMA MELANOMA
MELANOMA MELANOMA 0
→ 2080
→ 6145
1177
1470
1671
3227
3400
3964
4057
4063
4110
4289
4357
4441
4663
4813
5226
5481
5494
5495
5508
5790
5892
6013
6019
6032
6045
6087
6184
6643
→ 135
246
663
766
982
→ 2080
→ 6145
19
1378
1382
1409
1841
2081
2083
2086
3253
3371
3372
4383
4459
4527
5435
5504
5538
5696
5812
5887
5934
6072
6115
6305
6399
6429
6430
6566
→ 135
133
134
233
259
381
561
Fig. 6.12. Feature selection on the NCI DNA microarray data. The MD (left) and
mRMR (right) criteria were used. Features (genes) selected by both criteria are
marked with an arrow. See Color Plates.
validation error.3 The error keeps on descending until 39 features are selected,
then it increases, due to the addition of redundant and noisy features. Al-
though there are 6,380 features in total, only feature sets up to size 165 are
represented on the graph.
In Fig. 6.12 we have represented the gene expression matrices of the fea-
tures selected by MD and mRMR. There are only three genes which were
selected by both criteria. This is due to the differences in the mutual infor-
mation estimation and to the high number of different features in contrast to
the small number of samples.
Finally, in Fig. 6.13, we show a comparison of the different criteria in
terms of classification errors. The data set used is extracted from image fea-
tures with a total number of 48 features. Only the errors of the first 20 features
sets are represented, as for larger feature sets the error does not decrease. Both
the 10-fold CV error and the test error are represented. The latter is calculated
with a separate test set which is not used in the feature selection process, that
is why the test error is higher than the CV error. The CV errors of the feature
3
This error measure is used when the number of samples is so small that a test
set cannot be built. In filter feature selection the LOOCV error is not used as a
selection criterion. It is used after the feature selection process, for evaluating the
classification performance achieved.
238 6 Feature Selection and Transformation
Fig. 6.13. Feature selection performance on image histograms data with 48 fea-
tures, 700 train samples, and 500 test samples, labeled with six classes. Comparison
between the Max-Dependency (MD), Max-min-Dependency (MmD) and the min-
Redundancy Max-Relevance (mRMR) criteria. Figure by B. Bonev, F. Escolano and
M.A. Cazorla (2008
c Springer).
sets yielded by MmD are very similar to those of MD. Regarding mRMR and
MD, in the work of Peng et al. [125], the experimental results given by mRMR
outperformed MD, while in the work of Bonev et al. [25] MD outperformed
mRMR for high-dimensional feature sets. However, mRMR is theoretically
equivalent to first-order incremental MD. The difference in the results is due
to the use of different entropy estimators.
and the log-likelihood has two useful properties, related to its derivatives, for
finding the optimal Λ:
• First derivative (Gradient)
∂L(Λ) 1 ∂Z(Λ, I)
= = E(Gj (I)) − αj ∀j (6.37)
∂λj Z(Λ, I) ∂λj
• Second derivative (Hessian)
∂ 2 L(Λ)
= E((Gj (I) − αj )(Gj (I) − αj )T ) ∀j, k (6.38)
∂λj ∂λk
240 6 Feature Selection and Transformation
The first property provides an iterative method for obtaining the optimal Λ
through gradient ascent:
dλj
= E(Gj (I)) − αj , j = 1, . . . , m (6.39)
dt
The convergence of the latter iterative method is ensured by the property
associated to the Hessian. It turns out that the Hessian of the log-likelihood
is the covariance matrix of (G1 (I), . . . , Gm (I)), and such a covariance matrix
is definite positive under mild conditions. Definite positiveness of the Hessian
ensures that L(Λ) is concave, and, thus, a unique solution for the optimal
Λ exists. However, the main problem of Eq. 6.63 is that the expectations
E(Gj (I)) are unknown (only the sample expectations αj are known). An
elegant, though computational intensive way of estimating E(Gj (I)) is to use
a Markov chain, because Markov chain Monte Carlo methods, like a Gibbs
sampler (see Alg. 13), ensure that in the limit (M → ∞) we approximate the
expectation:
1
M
E(Gj (I)) ≈ Gj (Isyn syn
i ) = αj (Λ), j = 1, . . . , m (6.40)
M i=1
where Isyn
i are samples from p(I; Λ), Λ being the current multipliers so far.
Such samples can be obtained through a Gibbs sampler (see Alg. 13 where G
is the number of intensity values) starting from a pure random image. The jth
filter is applied to the ith generated image resulting the Gj (Isyn
i ) in the latter
equation. It is interesting to remark here that the Λ determine the current,
provisional, solution to the maximum entropy problem p(I; Λ), and, thus, the
synthesized images match partially the statistics of the observed ones. It is
very interesting to remark that the statistics of the observed images are used
to generate the synthesized ones. Therefore, if we have a fixed set of m filters,
the synthesizing algorithm proceeds by computing at iteration t = 1, 2, . . .
dλtj
= Δtj = αsyn
j (Λ ) − αj ,
t
j = 1, . . . , m (6.41)
dt
Then we obtain λtj + 1 ← λtj + Δtj and consequently Λt + 1 . Then, a new it-
eration begins. Therefore, as we approximate the expectations of each sub-
band and then we integrate all the multipliers in a new Λt + 1 , a new model
p(I; Λt + 1 ) from which we draw samples with the Markov chains is obtained.
As this model matches more and more the statistics of the observations it is not
surprising to observe that as the iterations progress, the results resemble more
and more the observations, that is, the images from the target class of textures.
For a fixed number of filters m, Alg. 12 learns a synthesized image from an ob-
served one. Such an algorithm exploits the Gibbs sampler (Alg. 13) attending
to the Markov property (intensity depends on the one of the neighbors). After
6.4 Minimax Feature Selection for Generative Models 241
applying the sampler and obtaining a new image, the conditional probabili-
ties of having each value must be normalized so that the sum of conditional
probabilities sum one. This is key to provide a proper histogram later on.
As the FRAME (Filters, Random Fields and Maximum Entropy) algorithm
depends on a Markov chain it is quite related to simulated annealing in the
sense that it starts with a uniform distribution (less structure-hot) and con-
verges to the closest unbiased distribution (target structure-cold) satisfying
the expectation constraints. The algorithm converges when the distance be-
tween the statistics of the observed image and the ones of the synthesized
image does not diverge too much (d(·) can be implemented as the sum of the
component-by-component absolute differences of these vectors).
242 6 Feature Selection and Transformation
The FRAME algorithm synthesizes the best possible image given the m fil-
ters used, but: (i) how many filters do you need for having a good result; (ii)
if you constrain the number of filters, which is reasonable for computational
reasons, what are the best filters? For instance, in Fig. 6.14, we show how
the quality estimation improves as we use more and more filters. A perfect
Fig. 6.14. Texture synthesis with FRAME: (a) observed image, (b) initial random
image, (c,d,e,f) synthesized images with one, two, three and six filters. Figure by
S.C. Zhu, Y.N. Wu and D. Mumford (1997
c MIT Press).
6.4 Minimax Feature Selection for Generative Models 243
end
Choose Fis+1 as the filter maximizing d(β) among those belonging to B/S
S ← S ∪ {Fis+1 }
s←s+1
Given p(I) and Isyn run the FRAME algorithm to obtain p∗ (I) and Isyn∗
p(I) ← p∗ (I) and Isyn ← Isyn∗
until (d(β) < );
Output: S
244 6 Feature Selection and Transformation
The search for a proper space where to project the data is more related to the
concept of feature transformation than to the one of feature selection. How-
ever, both concepts are complementary in the sense that both of them point
towards optimizing/simplifying the classification and/or clustering problems.
It is then no surprising that techniques like PCA (principal component anal-
ysis), originally designed to dimensionality reduction, have been widely used
for face recognition or classification (see the yet classic papers of Pentland
et al. [126, 160]). It is well known that in PCA the N vectorized training
images (concatenate rows or columns) xi ∈ Rk are mapped to a new space
(eigenspace) whose origin is the average vector x̂. The differences between
input vectors and the average are di = (xi − x̂) (centered patterns). Let X
be the k × N matrix whose colunms are the di . Then, the N eigenvectors
φi of XT X are the orthonormal axes of the eigenspace, and the meaning of
their respective eigenvalues λ1 ≥ λ2 ≥ · · · ≥ λN is the importance of each
6.5 From PCA to gPCA 245
8 8 8
6 6 6
4 4 4
2 2 2
0 0 0
−2 −2 −2
−4 −4 −4
−6 −6 −6
−8 −8 −8
−8 −6 −4 −2 0 2 4 6 8 −8 −6 −4 −2 0 2 4 6 8 −8 −6 −4 −2 0 2 4 6 8
Fig. 6.15. From left to right: PCA axes, ICA axes, and gPCA axes (polynomials).
246 6 Feature Selection and Transformation
k
k
I(y1 , . . . , yk ) = H(yi ) − H(y) = H(yi ) − H(x) − log | det W| (6.48)
i=1 i=1
However, we may assume that the yi are uncorrelated (if two variables are
independent then they are uncorrelated but the converse is not true in general)
and of unit variance: data are thus white. Given original centered z, a withening
transformation, which decorrelates the data, is done by z̃ = ΦD−1/2 ΦT , Φ is
the matrix of eigenvectors of the covariance matrix of centered data E(zzT )
and D = diag{λ1 , . . . , λk }. White data y satisfy E(yyT ) = I (unit variances,
that is, the covariance is the identity matrix). Then
which in turns implies that det W must be constant. For withened yi , en-
tropy and neg-entropy differ by a constant and therefore I(y1 , . . . , yk ) =
k k
C − i=1 J(yi ) and ICA may proceed by maximizing i=1 J(yi ).
One of the earlier algorithms for ICA is based on the infomax principle [14].
Consider a nonlinear scalar function gi (·) like the transfer function in a neural
network, but satisfying gi = fi (yi ), fi being the pdfs (derivatives coincident
with the densities). In these networks there are i = 1, 2, . . . , k neurons, each
one with weight wi . The weights must be chosen to maximize the transferred
information. Thus, the network must maximize H(g1 (wT1 x), . . . , gk (wTk x)).
For a single input, say x, with pdf fx (x) (see Fig. 6.16, top), and a single
neuron with transfer function g(x), the amount of information transferred de-
pends on fy (y ≡ g(x)), and the shape of fy depends on the matching between
the threshold w0 and the mean x̄ and variance of fx and also on the slope of
g(x). The optimal weight is the one maximizing the output information. The
purpose is to find w maximizing I(x, y) = H(y) − H(y|x) then, as H(y|x) is
6.5 From PCA to gPCA 247
∂I(x, y) ∂H(y)
= (6.51)
∂w ∂w
This is coherent with maximizing the output entropy. We have that
8 8
fx (x) 8 ∂y 8
fy (y) = ⇒ H(y) = −E(ln fy (y)) = E ln 88 88 − E(ln fx (x))
|∂y/∂x| ∂x
(6.52)
and a maximization algorithm may focus on the first term (output dependent).
Then 8 8 −1
∂H ∂ 8 ∂y 8 ∂y ∂ ∂y
Δw ∝ = ln 88 88 = (6.53)
∂w ∂w ∂x ∂x ∂w ∂x
Then, using as g(·) the usual sigmoid transfer: y = g(x) = 1/(1 + e−u ) : u =
wx + w0 whose derivative ∂y/∂x is wy(1 − y) it is straightforward to obtain
1
Δw ∝ + x(1 − 2y), Δw0 ∝ 1 − 2y (6.54)
w
When extending to many units (neurons) we have a weight matrix W to esti-
mate, a bias vector w0 (one bias component per unit), and y = g(Wx + w0 ).
Here, the connection between the multivariate input–output pdfs depends on
the absolute value of the Jacobian J:
⎛ ∂y1 ∂y1 ⎞
∂x1 . . . ∂xk
fx(x) ⎜ .. ⎟
fy (y) = , J = det ⎝ ... . ⎠ (6.55)
|J| ∂yk ∂yk
∂x1 . . . ∂xk
cof wij
wij ∝ + xj (1 − 2yi ) (6.57)
det W
cof wij being the co-factor of component wij , that is, (−1)i+1 times the
determinant of the matrix resulting from removing the ith row and jth col-
umn of W.
The latter rules implement an algorithm which maximizes I(x; y) through
maximizing H(y). Considering a two-dimensional case, infomax maximizes
H(y1 , y2 ), as H(y1 , y2 ) = H(y1 ) + H(y2 ) − I(y1 , y2 ) this is equivalent to min-
imizing I(y1 , y2 ) which is the purpose of ICA. However, the latter algorithm
does not guarantee the finding of a global minimum unless certain conditions
248 6 Feature Selection and Transformation
Fig. 6.16. Infomax principle. Top: nonlinear transfer functions where the threshold
w0 matches the mean of fx (a); selection of the optimal weight wopt attending to the
amount of information transferred (b). Bottom: maximizing the joint entropy does
not always result in minimizing the joint mutual information properly – see details
in text. Figures by A.J. Bell and T.J. Sejnowski (1995
c MIT Press).
Fig. 6.17. Left: axes derived from PCA, ICA, and other variants. Right: axes derived
from ICA. Figures by A.J. Bell and T.J. Sejnowski (1995
c MIT Press).
subject to ||x|| = 1. Input data x are assumed to be withened, and this implies
that wx has unit variance. Let g1 (y) = tanh(ay) and g2 (y) = ue−u /2 , with
2
the usual setting a = 1, be the derivatives of the G(·) functions. Then, FastICA
starts with a random w and computes
k
J(Θ) = H(yi ), (6.61)
i=1
where Θ are the parameters defining the rotation matrix R. Such parameters
are the k(k − 1)/2 Givens angles θpq of a k × k rotation matrix. The rotation
matrix Rpq (θpq ) is built by replacing the entries (p, p), (p, q) and (q, p) of the
identity matrix by cos θpq , − sin θpq and cos θpq , respectively. Then, the R is
computed as the product of all the 2D rotations:
n−1
n
R(Θ) = Rpq (θpq ) (6.62)
p=1 q=p+1
Then, the method proceeds by estimating the optimal θpq minimizing Eq. 6.61.
A gradient descent method over the given angles would proceed by computing
∂J(Θ) ∂H(yi )
k
= (6.63)
∂θpq i=1
∂θpq
which implies computing the derivative of the entropy and thus implies en-
tropy estimation. At this point, the maximum entropy principle tells us to
choose the most unbiased distribution (maximum entropy) satisfying the ex-
pectation constraints:
+∞
p∗ (ξ) = arg max − p(ξ) log p(ξ) dξ
p(ξ) −∞
+∞
s.t p(ξ)Gj (ξ) dξ = E(Gj (ξ)) = αj , j = 1, . . . , m
−∞
+∞
p(ξ) dξ = 1 (6.64)
−∞
6.5 From PCA to gPCA 251
u = p(ξ) dv = Gj (ξ)
m +∞
du = λr Gr (ξ) p(ξ), v = Fj (ξ) = Gj (ξ) dξ (6.67)
r=1 −∞
Therefore
8 +∞
m
αj = p(ξ)Fj (ξ) 8+∞
−∞ − Fj (ξ) λr Gr (ξ) p(ξ) dξ (6.68)
−∞ r=1
When the constraint functions are chosen among the moments of the variables,
the integrals Fj (ξ) do not diverge faster than the exponential decay of the
pdf representing the solution to the maximum entropy problem. Under these
conditions, the first term of the latter equation tends to zero and we have
m +∞
αj = − λr Fj (ξ)Gr (ξ)p(ξ) dξ
r=1 −∞
m
m
=− λr E(Fj (ξ)Gr (ξ)) = − λr βjr (6.69)
r=1 r=1
where the βjr may be obtained from the sample means for approximating
E(Fj (ξ)Gr (ξ)). Once we also estimate the αj from the sample, the vector of
Lagrange multipliers Λ = (λ1 , . . . , λm )T is simply obtained as the solution of
a linear system:
All the derivations between Eqs. 6.64 and 6.71 are referred to one generic
variable ξ. Let yi = ξ be the random variable associated to the ith dimension
of the output (we are solving a maximum entropy problem for each individual
output variable). Updating the notation consequently, we obtain
∂H(yi ) i ∂αri
m
= λr (6.72)
∂θpq r=1
∂θpq
Therefore, there is a very interesting link between the gradient of the cost
function and the expectation constraints related to each output variable. But
how to compute ∂αri /∂θpq ? If we approximate αri by the sample mean, that
N
is, αri = (1/N ) j=1 Gr (yij ). Now, the latter derivative is expanded using the
chain rule:
∂H(yi ) j ∂yij
N
= Gr (yi )
∂θpq j=1
∂θpq
T T
N
∂y j
∂Rj:
j i
= Gr (yi )
j=1
∂Rj: ∂θpq
N T
∂R
= Gr (yij ) (6.73)
j=1
∂θpq j:
where j constrains the derivative to the jth row of the matrix. Anyway, the
derivative of R with respect to θpq is given by
6.5 From PCA to gPCA 253
p−1 k
∂R
q−1
uv pv
= R (θuv ) R (θpv )
∂θpq u=1 v=u+1 v=p+1
∂Rpq (θpq )
k
k−1
k
× pv
R (θpv ) uv
R (θuv ) (6.74)
∂θpq v=q+1 u=p+1 v=u+1
and the final gradient descent rule minimizes the sum of entropies as follows:
k
∂H(yi )
Θt+1 = Θt − η (6.75)
i=1
∂Θ
that is, the negative gradient direction depends both on the αi and the β i ,
which actually depend on αi and the multipliers. As the αi depend on the
typically, higher-order moments used to define the constraints, the update
direction depends both on the gradients of the moments and the gradients of
non-Gaussianity: the non-Gaussianity of sub-Gaussian signals is minimized,
whereas the one of super-Gaussian signals is maximized. As we show in
Fig. 6.18 (left), as the entropy (non-Gaussianity) of the input distribution
increases, the same happens with the super-Gaussianity (positive kurtosis)
which is a good property as stated when discussing infomax.
24
0.2
22 Minimax ICA
0.15 Jade
20
Comon MMI
Coefficient
0.05 16 Mermaid
14
0
12
−0.05
1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3 10
Generalized Gaussian Parameter (β) 100 150 200 250 300 350 400 450 500
where, in each row, the source corresponding to the maximum entry is con-
sidered the main signal at that output. Given this measure, it is interesting to
analyze how the SIR performance depends on the number of available sam-
ples independently of the number of constraints. In Fig. 6.18(right) we show
that when generating random mixing matrices, the average SIR is better and
better for minimax ICA as the number of samples increases (such samples are
needed to estimate high-order moments) being the best algorithm (even than
FastICA) and sharing this position with JADE [35] when the number of sam-
ples decreases. This is consistent with the fact that JADE uses fourth-order
cumulants (cumulants are defined by the logarithm of the moment generating
function).
The development of ICA algorithms is still in progress. ICA algorithms
are also very useful for solving the blind source separation problem summa-
rized in the cocktail party problem: given several mixed voices demix them. It
turns out that the inverse of W is the mixing matrix [79]. Thus, ICA solves
the source separation problem up to the sign. However, regarding pattern
classification, recent experimental evidence [167] shows that whitened PCA
compares sometimes to ICA: when feature selection is performed in advance,
FastICA, withened PCA, and infomax have similar behaviors. However, when
feature selection is not performed and there is a high number of components,
infomax outperforms the other two methods in recognition rates.
In general, ICA is considered a technique of projection pursuit [10], that
is, a way of finding the optimal projections of the data (here optimal means
the projection which yields the best clarification of the structure of multidi-
mensional data). Projection pursuit is closely related to feature selection and
Linear Discriminant Analysis (LDA) [111]. The main difference with respect
to ICA and PCA is that LDA looks for projecting the data to maximize the
discriminant power. In [66], a technique dubbed the Adaptive Discriminant
Analysis (ADA) proceeds by iteratively selecting the axis yielding the max-
imum mutual information between the projection, the class label, and the
projection on the yet selected axes. This is, in general, untractable as we have
seen along the chapter. In [66], theoretical results are showed for mixtures of
two Gaussians.
Fig. 6.19. Top: arrangement of three subspaces (one plane V1 and two lines V2 and
V3 ). Bottom: different instances of the problem with increasing difficulty (from left
to right). Figure by Y. Ma, A.Y. Yang, H. Derksen and R. Fossum (2008
c SIAM).
νn (x) being a vector of monomials of degree n (in the latter case n = 2).
[k] + ,
In general, we have Mn = n+k−1 = (n + k − 1)!/((n + k − 1)(n − 1)!)
n
[3] +,
monomials, each one with a coefficient ch . In the latter example, M2 = 42 =
4!/(4(1!)) = 6 monomials and coefficients. Therefore, we have a mechanism to
[k]
map or embed a point x ∈ Rk to a space of Mn dimensions where the basis
is given by monomials of degree n and the coordinates are the coefficients
ch for these monomials. Such embedding is known as the Veronese map of
degree n [71]:
⎛ ⎞ ⎛ ⎞
x1 xn1
⎜ x2 ⎟ ⎜ xn−1 ⎟
[k] ⎜ ⎟ ⎜ 1 x2 ⎟
νn : Rk → RMn , where νn ⎜ . ⎟ = ⎜ .. ⎟
⎝ .. ⎠ ⎝ . ⎠
xk xnk
be the m×k Jacobian matrix of the collection of Q. Then, it turns out that the
rows of J (Q(xi )) evaluated at xi span an orthogonal space to Vi . This means
that the right null space of the Jacobian yields a basis Bi = (b1 , . . . , bdi )
of Vi , where di = k − rank(J (Q(xi ))T ) and usually di << k. If we repeat
258 6 Feature Selection and Transformation
the latter rationale for each Vi we obtain the basis for all subspaces. Finally,
we assign a sample x to subspace Vi if BTi x = 0 or we choose the subspace
Vi minimizing ||BTi x|| (clustering, segmentation, classification).
Until now we have assumed that we know an example xi belonging to
Vi , and then it is straightforward to obtain the basis of the corresponding
subspace. However, in general, such assumption is unrealistic and we need an
unsupervised learning version of gPCA. In order to do that we exploit the
Sampson distance. Assuming that the polynomials in Q are linearly indepen-
dent, given a point x in the arrangement, that is Q(x) = 0, if we consider the
first-order Taylor expansion of Q(x) at x, the value at x̃ (the closest point to
x) is given by
which implies
where (·)† denotes the pseudo-inverse. Then, ||x̃−x||2 (square of the Euclidean
distance) approximates the Sampson distance. Then, we will choose to point
lying in, say, the nth subspace, xn as the one minimizing the latter Sampson
distance but having a non-null Jacobian (points having null derivatives lie at
the intersection of subspaces and, thus, yield noisy estimations of the normals).
This allows us to find the basis Bn . Having this basis for finding the point xn−1
it is useful to exploit the fact that if we have a point xi ∈ Vi , points belonging
to ∪nl=i Vl satisfy ||BTi x|| · · · ||BTn x|| = 0. Therefore, given a point xi ∈ Vi ,
(for instance i = 1) a point xi−1 ∈ Vi−1 can be obtained as
Q(x)T (J (Q(x))J (Q(x))T )† Q(x̃) + δ
xi−1 = arg min (6.87)
x∈S:J (Q(x))=0 ||BTi x|| · · · ||BTn x|| + δ
Thus the ED is the average of two terms. The first one di (k − di ) is the total
number of reals needed to specify a di dimensional space in Rk (the product
of the dimension and its margin with respect to the dimension of ambient
space). This product is known as the dimension (Grassman dimension) of
the Grassmanian manifold of di dimensional subspaces of Rk . On the other
hand, the second term Ni ki is the total number of reals needed to code the
Ni samples in Vi . A simple quantitative example may help to understand the
concept of ED. Having three subspaces (two lines and a plane in R3 as shown
in Fig. 6.19) with 10 points lying on each line and 30 ones on the plane,
we have 50 samples. Then, if we consider having three subspaces, the ED is
50 (1 × (3 − 1) + 1 × (3 − 1) + 2 × (3 − 2) + 10 × 1 + 10 × 1 + 30 × 2) that
1
1
is 50 (6 + 80) = 1.72. However, if we decide to consider that the two lines
define a plane and consider only two subspaces (planes) the ED is given by
50 (2 × (3 − 2) + 2 × (3 − 2) + 20 × 2 + 30 × 2) = 2.08 which is higher than
1
the previous choice, and, thus, more complex. Therefore, the better choice is
having two lines and one plane (less complexity for the given data). We are
minimizing the ED. Actually the Minimum Effective Dimension (MED) may
be defined in the following terms:
that is, as the minimum ED among the arrangements fitting the samples.
Conceptually, the MED is a kind of information theoretic criterion adapted
to the context of finding the optimal number of elements (subspaces) of an
arrangement, considering that such elements may have different dimensions.
It is particularly related to the AIC. Given the data S : |S| = N , the model
parameters M, and the free parameters kM for the considered class of model,
the AIC is defined as
2 kM
AIC = − log P (S|M) + 2 ≈ E(− log P (S|M)) (6.90)
N N
when N → ∞. For instance, for the Gaussian noise model with equal variance
in all dimensions (isotropy) σ 2 , we have that
260 6 Feature Selection and Transformation
1
N
log P (S|M) = − ||xi − x̃i ||2 (6.91)
2σ 2 i=1
x̃i being the best estimate of xi given the model P (S|M). If we want to adapt
the latter criterion to a context of models with different dimensions, and we
consider the isotropic Gaussian noise model, we obtain that AIC minimizes
1
N
dM 2
AIC(dM ) = ||xi − x̃i ||2 + 2 σ (6.92)
N i=1 N
selecting the model class M∗ , dM being the dimension of a given model class.
AIC is related to BIC (used in the XMeans algorithm, see Prob. 5.14) in
the sense that in the latter one, one must simply replace the factor 2 in the
second summand by log(N ) which results in a higher amount of penalization
for the same data and model. In the context of gPCA, we are considering
Grassmanian subspaces of Rk with dimension d(k − d) which implies that
AIC minimizes
1
N
d(k − d) 2
AIC(d) = ||xi − x̃i ||2 + 2 σ (6.93)
N i=1 N
1
N
d(k − d) + N d 2
GAIC(d) = ||xi − x̃i ||2 + 2 σ (6.94)
N i=1 N
When applied to encoding the samples S with a given arrangement of, say n,
subspaces Z the GAIC is formulated as
1
N n
dj (k − dj ) + N dj 2
GAIC(S, Z) = ||xi − x̃i ||2 + 2 σ
N i=1 j=1
N
1
N
= ||xi − x̃i ||2 + 2N σ 2 ED(S, Z) (6.95)
N i=1
which reveals a formal link between GAIC and the ED (and the MED when
GAIC is minimized). However, there are several limitations that preclude
the direct use of GAIC in the context of gPCA. First, having subspaces of
different dimensions makes almost impossible to choose a prior model. Second,
the latter rationale can be easily extended to assuming a distribution for the
sample data. And third, even when the variance is known, the GAIC minimizes
the average residual ||xi − x̃i ||2 which is not an effective measure against
outliers. The presence of outliers is characterized by a very large residual.
6.5 From PCA to gPCA 261
This a very important practical problem where one can find two extremes in
context of high noise rates: samples are considered either as belonging to a
unique subspace or to N one-dimensional spaces, each one defined by a sample.
Thus, it seems very convenient in this context to find the optimal trade-off
between dealing noise and obtaining a good model fitness. For instance, in the
SVD version of PCA, the k×N data matrix X = (x1 , x2 , . . . , xN ) is factorized
through SVD X = UΣVT , the columns of U give a basis and the rank
provides the dimension of the subspace. As in the eigenvectors/eigenvalues
formulation of PCA, the singular values left represent the squared error of the
representation (residual error). Representing the ordered singular values vs.
the dimension of the subspace (see Fig. 6.20, left) it is interesting to note the
existence of a knee point after the optimal dimension. The remaining singular
values represent the sum of square errors, and weights w1 , w2 > 0, derived from
the knee points, represent the solution to an optimization problem consisting
on minimizing
N
JPCA (Z) = w1 ||xi − x̃i ||2 + w2 dim(Z) (6.96)
i=1
Z being the subspace, x̃i the closest point to xi in that subspace and dim(Z)
the dimension of the subspace. The latter objective function is closely related
to the MLD-AIC like principles: the first term is data likelihood (residual error
in the terminology of least squares) and the second one is complexity. In order
to translate this idea into the gPCA language it is essential to firstly redefine
the MED properly. In this regard, noisy data imply that the ED is highly de-
pendent on the maximum allowable residual error, known as error tolerance τ .
The higher τ the lower optimal dimension of the subspace (even zero dimen-
sional). However, when noisy data arise, and we set a τ lower to the horizontal
coordinate of the knee point, samples must be either considered as indepen-
dent subspaces or as unique one-dimensional subspace. The consequence is
K K
w 1x + w 2y w 1x + w 2y
0 τ σK 0 τ* τmax τ
singular values error tolerance
Fig. 6.20. Optimal dimensions. In PCA (left) and in gPCA (right). In both cases
K is the dimension of the ambient space and k is the optimal dimension. Figures by
K. Huang, Y. Ma and R. Vidal (2004
c IEEE).
262 6 Feature Selection and Transformation
N
JgPCA (Z) = w1 ||xi − x̃i ||∞ + w2 MED(X , τ ) (6.98)
i=1
Adopting a criterion like the one in Eq. 6.93 in gPCA by preserving the ge-
ometric structure of the arrangement can be done by embedding this criterion
properly in the algebraic original formulation of gPCA. As we have explained
above, an arrangement of n subspaces can be described by a set of polynomi-
als of degree n. But when subspaces with different dimensions arise (e.g. lines
and planes) also polynomials with less dimensions than n can fit the data. In
the classical example of two lines and one plane (Fig. 6.19) a polynomial of
second degree may also fit the data. The intriguing and decisive question is
whether this plane may be then partitioned into two lines. If so we will re-
duce the ED, and this will happen until no subdivision is possible. In an ideal
[k]
noise-free setting, as Mn decides the rank of matrix Ln and then bounds the
number of subspaces, it is interesting to start by n = 1, and then increase n
until we find a polynomial of degree n which fits all the data (Ln has lower
rank). Then, it is possible to separate all the n subspaces following the gPCA
steps described above. Then, we can try to apply the process recursively to
each of the n until there are no lower dimensional subspaces in each group or
there are too much groups. In the real case of noisy data, we must fix n and
find the subspaces (one at a time) with a given error tolerance. For a fixed n:
(i) find the fist subspace and assign to it the points with an error less than τ ;
(ii) repeat the process to fit the remaining subspaces to the data points (find
points associated to subspaces and then the orthogonal directions of the sub-
space). However, the value of τ is key to know whether the rank of Ln can be
identified correctly. If we under-estimate this rank, the number of subspaces
is not enough to characterize all the data. If we over-estimate the rank we
cannot identify all the subspaces because all points have been assigned to a
subspace in advance. We can define a range of ranks (between rmin and rmax )
and determine whether the rank is under or over-estimated in this range. If
none of the ranks provides a segmentation in n subspaces within a tolerance
of τ , it is quite clear that we must increase the number of subspaces.
In Alg. 15 α =< ., . > denotes the subspace angle which is an estimation
of the amount of dependence between two subspaces (low angle indicates high
dependence and vice versa). Orthogonal subspaces have a π/2 angle. If the
6.5 From PCA to gPCA 263
highest angle we can find is too small, this indicates that the classic examples
of two lines and one plane are illustrated in Fig. 6.21. The algorithm starts
with n = 1, enters the while loop and there is no way to find a suitable rank
for that group (always over-estimated). Then, we increase n to n = 2, and
it is possible to find two independent subspaces because we have two groups
(the plane and the two lines). Then, for each group a new recursion level
264 6 Feature Selection and Transformation
Fig. 6.21. Right: result of the iterative robust algorithm for performing gPCA recur-
sively while minimizing the effective dimension. Such minimization is derived from
the recursive partition. Left: decreasing of ED with τ . Figure by Y. Ma, A.Y. Yang,
H. Derksen and R. Fossum (2008c SIAM).
Problems
1
N
E(Fj (x)Gr (x)) = Fj (xi )Gr (xi )
N i=1
where Θ are the parameters of the model (means, variances), D are the data
(then log(D|Θ) is the log-likelihood), and I(Θ) is the Fisher information ma-
trix.
6.5 From PCA to gPCA 267
and then the union is represented by q1 (x) = (x1 x3 ) and q2 (x) = (x2 x3 ).
Then, the generic Jacobian matrix is given by
⎛ ∂q ∂q ∂q ⎞
1 1 1
∂x1 ∂x2 ∂x3 x3 0 x1
J (Q(x)) = ⎝ ⎠=
∂q2 ∂q2 ∂q2
∂x1 ∂x2 ∂x3
0 x3 x2
where the first rows Dp1 (x) and the second are Dp2 (x). Then choosing a
z1 = (0 0 1)T point with x3 = 1 for the line, and choosing a point z2 = (1 1 0)T
with x1 = x2 = 1 for the plane. We have
100 001
J (Q(z1 )) = , J (Q(z2 )) =
010 001
where it is clear that the mull of the transpose of the Jacobian yields the basis
of the line (a unique vector B1 = {(0 0 1)T } because the Jacobian has rank 2)
and of the second (with rank 1) B1 = {(0 1 0)T , (−1 0 0)T }. Check that the
obtained basis vectors are all orthogonal to their corresponding Dp1 (zi ) and
Dp2 (zi ). Given the latter explanations, compute these polynomials from the
[k]
Veronese map. Hint: for n = 2 in R3 (k = 3) we have Mn = 6 monomials and
coefficients. To that end suggest 10 samples for each subspace. Then compute
the Q collection of vectors and then the Jacobian assuming that we have a
known sample per subspace. Show that the results are coherent with the ones
described above. Find also the subspaces associated to each sample (segmen-
tation). Repeat the problem unsupervisedly (using the Sampson distance).
Introduce noise samples and reproduce the estimation and the segmentation
to test the possible change of rank and the deficiencies in the segmentation.
7.1 Introduction
The classic information-theoretic classifier is the decision tree. It is well known
that one of its drawbacks is that it tends to overfitting, that is, it yields large
classification errors with test data. This is why the typical optimization of
this kind of classifiers is some type of pruning. In this chapter, however, we
introduce other alternatives. After reminding the basics of the incremental
(local) approach to grow trees, we introduce a global method which is ap-
plicable when a probability model is available. The process is carried out by
a dynamic programming algorithm. Next, we present an algorithmic frame-
work which adapts decision trees for classifying images. Here it is interesting
to note how the tests are built, and the fundamental lesson is that the large
amount of possible tests (even when considering only binary relationships
between parts of the image) recommends avoiding building unique but deep
trees, in favor of a bunch of shallow trees. This is the keypoint of the chap-
ter, the emergence of ensemble classifying methods, complex classifiers built
in the aggregation/combination of simpler ones, and the role of IT in their de-
sign. In this regard, the method adapted to images is particularly interesting
because it yields experimental results in the domain of OCRs showing that
tree-averaging is useful. This work inspired the yet classical random forests
approach where we analyze their generalization error and show applications in
different domains like bioinformatics, and present links with Boosting. Follow-
ing the ensemble focus of this chapter, next section introduces two approaches
to improve Boosting. After introducing the Adaboost algorithm we show how
boosting can be driven both by mutual information maximization (infomax)
and by maximizing Jensen–Shannon divergence (JBoost). The main difference
between the two latter IT-boosting approaches relies on feature selection (lin-
ear or nonlinear features). Then, we introduce to the reader the world of
maximum entropy classifiers, where we present the basic iterative-scaling al-
gorithm for finding the Lagrange multipliers. Such algorithm is quite different
from the one described in Chapter 3, where we estimate the multipliers in
F. Escolano et al., Information Theory in Computer Vision and Pattern Recognition, 271
c Springer-Verlag London Limited 2009
272 7 Classifier Design
Let X = {x} be the training set (example patterns) and Y = {y} the class
labels which are assigned to each element of X in a supervised manner. As it
is well known, a classification tree, like CART [31] or C4.5 [132], is a model
extracted from the latter associations which tries to correctly predict the
most likely class for unseen patterns. In order to build such model, one must
consider that each pattern x ∈ X is a feature vector x = (x1 , x2 , . . . , xN ), and
also that each feature xt has associated a test I(xi > ci ) ∈ {0, 1}, whose value
depends on whether the value xi is above (1) or below (0) a given threshold ci .
Thus, a tree T (binary when assuming the latter type of test) consists of a set
of internal nodes Ṫ , each one associated to a test, and a set of terminal nodes
(leaves) ∂T , each one associated to a class label. The outcome hT (x) ∈ Y
relies on the sequence of outcomes of the tests/questions followed for reaching
a terminal.
Building T implies establishing a correspondence π(t) = i, with i ∈
{1, 2, . . . , N }, between each t ∈ Ṫ and each test Xt = I(xi > ci ). Such
correspondence induces a partial order between the tests (nodes) in the tree
(what tests should be performed first and what should be performed later),
and the first challenging task here is how to choose the proper order. Let Y a
random variable encoding the true class of the examples and defined over Y.
The uncertainty about such true class is encoded, as usual, by the entropy
H(Y ) = P (Y = y) log2 P (Y = y) (7.1)
y∈Y
where P (Xt = 1) = |X t|
|X | is the fraction of examples in X satisfying the test
Xt and P (Xt = 0) = 1 − P (Xt = 1). Moreover, Ht0 and Ht1 are the entropies
associated to the two descents of t:
Ht0 (Y ) ≡ H(Y |Xt = 0) = − P (Y = y|Xt = 0) log2 P (Y = y|Xt = 0)
y∈Y
Ht1 (Y ) ≡ H(Y |Xt = 1) = − P (Y = y|Xt = 1) log2 P (Y = y|Xt = 1)
y∈Y
Alternatively to the greedy method, consider that1 the path 1 for 1 reaching
a leaf l ∈ ∂T has length P and let Ql = Xπ(1) Xπ(2) . . . Xπ(P −1) .
Then P (Qt ) = P (Xπ(1) = kπ(1) , . . . , Xπ(P −1) = kπ(P −1) ) is the probability
274 7 Classifier Design
of reaching that leaf. A global error is partially given by minimizing the av-
erage terminal entropy
H(Y |T ) = P (Ql )Hl (Y ) = P (Ql )H(Y |Ql ) (7.6)
l∈∂T l∈∂T
which means that it is desirable to either have small entropies at the leaves,
when the P (Ql ) are significant, or to have small P (Ql ) when H(Y |Ql ) becomes
significant.
The latter criterion is complemented by a complexity-based one, which is
compatible with the MDL principle: the minimization of the expected depth
Ed(T ) = P (Ql )d(l) (7.7)
l∈∂T
d(l) being the depth of the leaf (the depth of the root node is zero).
When combining the two latter criteria into a single one it is key to high-
light that the optimal tree depends on the joint distribution of the set of
possible tests X = {X1 , . . . , XN } and the true class Y . Such joint distribu-
tion (X, Y ) is the model M. It has been pointed out [8, 62] that a model
is available in certain computer vision applications like face detection where
there is a “rare” class a, corresponding to the object (face) to be detected,
and a “common” one b corresponding to the background. In this latter case
we may estimate the prior density p0 (y), y ∈ Y = {a, b} and exploit this
knowledge to speed up the building of the classification tree.
Including the latter considerations, the global optimization problem can be
posed in terms of finding
where Ω is the set of possible trees (partial orders) depending on the available
features (tests) and λ > 0 is a control parameter. Moreover, the maximal depth
of T ∗ is bounded by D = maxl∈∂T d(l).
What is interesting in the latter formulation is that the cost C(T , M) can
be computed recursively. If we assume that the test Xt is associated to the
root of the tree we have
and therefore
Tk being the k-th subtree and {M|Xt = k} the resulting model after fixing
Xt = k. This means that, once Xt is selected for the root, we have only N − 1
tests to consider. Anyway, finding C ∗ (M, D), the minimum value of C(T , M)
over all trees with maximal depth D requires a near O(N !) effort, and thus is
only practical for a small number of tests and small depths.
In addition, considering that C ∗ (M, D) is the minimum value of C(T , M)
over all trees with maximal depth D, we have that for D > 0:
⎧
⎨ H(p0 )
C ∗ (M, D) = min λ + min P (Xt = k)C ∗ ({M|Xt = k}, D − 1)
⎩ t∈X
k∈{0,1}
(7.9)
∗
and obviously C (M, 0) = H(p0 ). As stated above, finding such a minimum
is not practical in the general case, unless some assumptions were introduced.
For instance, in a two-class context Y = {a, b}, consider that we have a “rare”
class a and a “common” one b, and we assume the prior p0 (a) ≈ 10−1 and
p0 (b) = 1−p0 (a). Consider for instance the following test: always X1 = 1 when
the true class is the rare one, and X1 = 1 randomly when the true class is the
common one; that is P (X1 = 1|Y = a) = 1 and P (X1 = 1|Y = b) = 0.5. If the
rare class corresponds to unfrequent elements to detect, such test never yields
false negatives (always fires when such elements appear) but the rate of false
positives (firing when such elements do not appear) is 0.5. Suppose also that
we have a second test X2 complementary to the first one: X2 = 1 randomly
when the true class is the rare one, and never X2 = 1 when the true class is
the common one, that is P (X2 = 1|Y = a) = 0.5 and P (X2 = 1|Y = b) = 0.
Suppose that we have three versions of X2 , namely X2 , X2 , and X2 which is
not a rare case in visual detection where we have many similar tests (features)
with similar lack of importance.
Given the latter specifications (see Table 7.1), a local approach has to
decide between X1 and any version of X2 for the root r, based on Hr (Y |X1 )
and Hr (Y |X2 ). The initial entropy is
whereas
H(Y |X2 ) = P (X2 = 0)H(Y |X2 = 0) + P (X2 = 1) H(Y |X2 = 1)
0
(9, 995 + 2) 2 2 9, 995 9, 995
= − log 2 − log 2
104 9, 997 9, 997 9, 997 9, 997
H(Y )
= 0.0003 ≈
2
7.2 Model-Based Decision Trees 277
and
H(Y |X2 ) = P (X2 = 0)H(Y |X2 = 0) + P (X2 = 1) H(Y |X2 = 1)
0
(9, 995 + 3) 3 3 9, 995 9, 995
= − log 2 − log 2
104 9, 998 9, 998 9, 998 9, 998
= 0.0004.
and H(Y |X2 )
H(Y |X2 ) = P (X2 = 0)H(Y |X2 = 0) + P (X2 = 1) H(Y |X2 = 1)
0
(9, 995 + 3) 3 3 9, 995 9, 995
= − log2 − log2
104 9, 998 9, 998 9, 998 9, 998
= 0.0004
The latter results evidence that H(Y |X2 ), for all versions of the X2 test,
will be always lower than H(Y |X1 ) because H(Y |X1 ) is dominated by the
case X1 = 1 because in this case the fraction of examples of class b is reduced
to 1/2 and also all patterns with class a are considered. This increases the
entropy approaching H(Y ). However, when considering X2 , the dominating
option is X2 = 0 and in case almost all patterns considered are of class b
except few patterns of class a which reduces the entropy.
Therefore, in the classical local approach, X2 will be chosen as the test for
the root node. After such a selection we have in X ∼ X2 almost all of them of
class b but two of class a, whereas in X2 there are three examples of class a.
This means that the child of the root node for X2 = 0 should be analyzed
more in depth whereas child for X2 = 1 results in a zero entropy and does not
require further analysis (labeled with class a). In these conditions, what is the
best following test to refine the root? Let us reevaluate the chance of X1 :
0
H(Y |X2 = 0, X1 ) = P (X2 = 0, X1 = 0) H(Y |X2 = 0, X1 = 0)
+ P (X2 = 0, X1 = 1)H(Y |X2 = 0, X1 = 1)
(5, 000 + 2)
= H(Y |X2 = 0, X1 = 1)
104
(5, 000 + 2) 2 2
= − log2
104 5, 002 5, 002
5, 000 5, 000
− log2 = 0.0003
5, 002 5, 002
278 7 Classifier Design
and consider also X2 :
H(Y |X2 = 0, X2 ) = P (X2 = 0, X2 = 0)H(Y |X2 = 0, X2 = 0)
+P (X2 = 0, X2 = 1) H(Y |X2 = 0, X2 = 1)
0
(9, 995 + 1)
= H(Y |X2 = 0, X2 = 0)
104
(9, 995 + 1) 1 1
= − log2
104 9, 996 9, 996
9, 995 9, 995
− log2 = 0.0002
9, 996 9, 996
and X2 :
H(Y |X2 = 0, X2 ) = P (X2 = 0, X2 = 0)H(Y |X2 = 0, X2 = 0)
+ P (X2 = 0, X2 = 1) H(Y |X2 = 0, X2 = 1)
0
(9, 995 + 2)
= H(Y |X2 = 0, X2 = 0)
104
(9, 995 + 2) 2 2
= − log2
104 9, 997 9, 997
9, 995 9, 995
− log2 = 0.0003
9, 997 9, 997
Again, X1 is discarded. What happens now is that the child for X2 = 1
has an example of class a and it is declared a leaf. On the other hand, the
branch for X2 = 0 has associated many examples of class b and only one of
class a, and further analysis is needed. Should this be the time for X1 ? Again
H(Y |X2 = 0, X2 = 0, X1 ) = P (X2 = 0, X2 = 0, X1 = 0)
0
H(Y |X2 = 0, X2 = 0, X1 = 0)
+P (X2 = 0, X2 = 0, X1 = 1)
H(Y |X2 = 0, X2 = 0, X1 = 1)
(9, 995 + 1)
= H(Y |X2 = 0, X2 = 0, X1 = 1)
104
(9, 995 + 1) 1 1
= − log2
104 9, 996 9, 996
9, 995 9, 995
− log2 = 0.0002
9, 996 9, 996
7.2 Model-Based Decision Trees 279
However, if we consider X2 what we obtain is
H(Y |X2 = 0, X2 = 0, X2 ) = P (X2 = 0, X2 = 0, X2 = 0)
0
H(Y |X2 = 0, X2 = 0, X2 = 0)
(9, 995 + 1)
+ H(Y |X2 = 0, X2 = 0, X2 = 1)
104
(9, 995 + 1) 1 1
= − log 2
104 9, 996 9, 996
9, 995 9, 995
− log2 = 0.0002
9, 996 9, 996
Then we may select X2 and X1 . This means that with a lower value of p0 (a),
X1 would not be selected with high probability. After selecting, for instance
X2 , we may have that the child for X2 = 0 is always of class b and the
child for X2 = 1 is a unique example of class a. Consequently, all leaves
are reached without selecting test X1 . Given the latter greedy tree, let us
call it Tlocal = (X2 , X2 , X2 ) attending to its levels, we have that the rate
of misclassification is 0 when the true class Y = b although to discover it we
should reach the deepest leaf. When we test the tree with a examples not
contemplated in the training we have that only one of them is misclassified
(the misclassification error when Y = a is 18 (12.5%)). In order to compute the
mean depth of the tree we must calculate the probabilities of reaching each
of the leaves. Following an inorder traversal we have the following indexes for
the four leaves l = 3, 5, 6, 7:
P (Q3 ) = P (X2 = 1) = 3 × 10−4
P (Q5 ) = P (X2 = 0, X2 = 1) = 10−4
P (Q6 ) = P (X2 = 0, X2 = 0, X2 = 0) = 9, 995 × 10−4
P (Q7 ) = P (X2 = 0, X2 = 0, X2 = 1) = 10−4
and then
Ed(Tlocal ) = P (Ql )d(l)
l∈∂Tlocal
= P (Q3 )d(3) + P (Q5 )d(5) + P (Q6 )d(6) + P (Q7 )d(7)
= 3 × 10−4 × 1 + 10−4 × 2 + 9, 995 × 10−4 × 3 + 10−4 × 3
= 2.9993
which is highly conditioned by P (Q6 )d(6), that is, it is very probable to reach
that leaf, the unique labeled with b. In order to reduce such average depth it
should be desirable to put b leaves at the lowest depth as possible, but this
280 7 Classifier Design
implies changing the relative order between the tests. The fundamental ques-
tion is that in Tlocal uneffective tests (features) are chosen so systematically
by the greedy method, and tests which work perfectly, but for a rare class are
relegated or even not selected.
On the other hand, the evaluation H(Y |Tlocal ) = 0 results from a typical
tree where all leaves have null entropy. Therefore the cost C(Tlocal , M) =
λEd(Tlocal ), and if we set λ = 10−4 we have a cost of 0.0003.
The latter probabilities are needed for computing C ∗ (M, D) in Eq. 7.9,
and, of course, its associated tree. Setting for instance D = 3, we should
compute C ∗ (M, D), being M = p0 because the model is mainly conditioned
by our knowledge of such prior. However, for doing that we need to compute
C ∗ (M1 , D − 1) where M1 = {M|Xt = k} for t = 1 . . . N (in this N = 2) and
k = 0, 1 is the set of distributions:
The latter equations are consistent with the key idea that the evolution of
(X, Y ) depends only on the evolution of the posterior distribution as the tests
are performed. Assuming conditional independence between the tests (X1 and
7.2 Model-Based Decision Trees 281
X2 in this case) it is possible to select the best test at each level and this will
be done by a sequence of functions:
Ψd : Md → {1 . . . N }
In addition, from the latter equations, it is obvious that we have only two
tests in our example: X1 and X2 , and consequently we should allow to use
282 7 Classifier Design
them repeatedly, that is, allow p(·|X1 = 0, X1 = 1) and so on. This can be
achieved in practice by having versions or the tests with similar statistical
behavior as we argued above. Thus, the real thing for taking examples and
computing the posteriors (not really them by their entropy) will be the un-
derlying assumption of p(·|X1 = 0, X1 = 1), that is, the first time X1 is
named we use its first version, the second time we use the second one, and
so
Don. However, although we use this trick,
D the algorithm needs to compute
0=1 |M d | which means, in our case, 0=1 (2 × N )d
= 1 + 4 + 16 + 64 pos-
teriors. However, if we do not take into account the order in which the tests
are performed along a branch the complexity is
d + 2M − 1
|Md | =
2M − 1
which reduces the latter numbers to 1 + 4 + 10 + 20 (see Fig. 7.1, top) and
makes the problem more tractable.
Taking into account the latter considerations it is possible to elicit a dy-
namic programming like solution following a bottom-up path, that is, starting
from computing the entropies of all posteriors after D = 3. The top-down
collection of values states that the tests selected at the first and second levels
are the same X2 (Fig. 7.1, bottom). However, due to ambiguity, it is possible
to take X2 as optimal test for the third level which yields Tlocal ≡ Tglobal1
X1=0 X1=0 X1=1 X1=1 X1=0 X1=0 X1=1 X2=0 X2=0 X2=1
X2=0 X2=1 X2=0 X2=1 X1=0 X1=1 X1=1 X2=0 X2=1 X2=1
X1=0 X1=0 X1=0 X1=0 X1=0 X11=0 X1=0 X1=1 X1=1 X1=1 X1=1 X1=1 X1=0 X1=0 X1=0 X1=1 X2=0 X2=0 X2=0 X2=1
X2=0 X2=0 X2=0 X2=0 X2=1 X2=1 X2=1 X2=0 X2=0 X2=0 X2=1 X2=1 X1=0 X1=0 X1=1 X1=1 X2=0 X2=0 X2=1 X2=1
X1=0 X1=1 X2=0 X3=
=1 X1=0 X1=1 X2=1 X1=1 X2=0 X2=1 X1=1 X2=1 X1=0 X1=1 X1=1 X1=1 X2=0 X2=1 X2=1 X2=1
X2
X1 X11 X2 X1 X2
X1 X1 X2
X1 X2 X2 X2 X1 X1 X2 X2 X1
.0001 .0007 .0004 .0001 .0001 .0001 .0013 .0001 .0001 .0007
0.000 0.000 0.000 .0012 .0003 0.000 0.000 0.000 0.000 0.000 0.000 .0057 0.000 0.000
Fig. 7.1. Bottom-up method for computing the tree with minimal cost. Top: cells
needed to be filled during the dynamic programming process. Bottom: bottom-up
process indicating the cost cells and the provisional winner test at each level. Dashed
lines correspond to information coming from X1 and solid lines to information com-
ing from X2 .
7.2 Model-Based Decision Trees 283
having exactly the optimal cost. However, it seems that it is also possible to
take X1 for the third level and obtaining for instance Tglobal2 = (X2 , X2 , X1 ).
The probabilities of the four leaves are:
P (Q3 ) = P (X2 = 1) = 3 × 10−4 ,
P (Q5 ) = P (X2 = 0, X2 = 1) = 10−4 ,
P (Q6 ) = P (X2 = 0, X2 = 0, X1 = 0) = 4, 995 × 10−4 ,
P (Q7 ) = P (X2 = 0, X2 = 0, X1 = 1) = 5, 001 × 10−4 ,
and
H(Y |Tglobal2 ) = P (Ql )d(l)
l∈∂Tglobal2
which indicates that this is a suboptimal choice, having, at least, the same
misclassification error than Tlocal . However, in deeper trees what is typically
observed is that the global method improves the misclassification rate of the
local one and produces more balanced trees including X1 -type nodes (Fig. 7.2).
284 7 Classifier Design
H(Y)=0.0007 H(Y|X’2)=0.0003
X´2 X´2
H(Y|X’2=0,X’’2)=0.0002
X˝2 A X˝2 A
H(Y|...,X’’’2)=0.0002
A A
X˝´2 X´1
B A B AB
Fig. 7.2. Classification trees. Left: using the classical greedy approach. Ambiguity:
nonoptimal tree with greater cost than the optimal one.
Classical decision trees, reviewed and optimized in the latter section, are de-
signed to classify vectorial data x = (x1 , x2 , . . . , xN ). Thus, when one wants to
classify images (for instance bitmaps of written characters) with these meth-
ods, it is important to extract significant features, and even reduce N as
possible (see Chapter 5). Alternatively it is possible to build trees where the
tests Xt operate over small windows and yield 0 or 1 depending on whether
the window corresponds to a special configuration, called tag. Consider for in-
stance the case of binary bitmaps. Let us, for instance, consider 16 examples
of the 4 simple arithmetic symbols: +, −, ÷, and ×. All of them are 7 × 7
binary bitmaps (see Fig. 7.3). Considering the low resolution of the examples
we may retain all the 16 tags of dimension 2 × 2 although the tag correspond-
ing to number 0 should be deleted because it is the least informative (we will
require that at least one of the pixels is 1 (black in the figure)).
In the process of building a decision tree exploiting the tags, we may assign
a test to each tag, that is X1 , . . . , X15 which answer 1 when the associated
tag is present in the bitmap and 0 otherwise. However, for the sake of invari-
ance and higher discriminative power, it is better to associate the tests to
the satisfaction of “binary” spatial relationships between tags, although the
complexity of the learning process increases significantly with the number of
tags. Consider for instance four types of binary relations: “north,” “south,”
“west” and “east.” Then, X5↑13 will test whether tag 5 is north of tag 13,
whereas X3→8 means that tag 5 is at west of tag 8. When analyzing the lat-
ter relationships, “north,” for instance, means that the second row of the tag
must be greater than the first row of the second tag.
Analyzing only the four first initial (ideal) bitmaps we have extracted
38 canonic binary relationships (see Prob. 7.2). Canonic means that many
relations are equivalent (e.g. X1↑3 is the same as X1↑12 ). In our example the
7.3 Shape Quantization and Multiple Randomized Trees 285
Fig. 7.3. Top: Sixteen example bitmaps. They are numbered 1 to 16 from top-
left to bottom-right. Each column has examples of a different class. Bitmaps 1 to
4 represent the ideal prototypes for each of the four classes. Bottom: The 16 tags
coding the binary numbers from 0000 to 1111. In all these images 0 is colored white
and 1 is black.
+1, +1,
initial entropy is H(Y ) = −4 4 log2 4 = 2 as there are four classes and
four examples per class.
Given B, the set of canonic binary relations, called binary arrangements, the
basic process for finding a decision tree consistent with the latter training set,
could be initiated by finding the relation minimizing the conditional entropy.
The tree is shown in Fig. 7.4 and X3↑8 is the best local choice because it
allows to discriminate between two superclasses: (−, ×) and (+, ÷) yielding
the minimal relative entropy H(Y |X3↑8 ) = 1.0. This choice is almost obvious
because when X3↑8 = 0 class − without “north” relations is grouped with
class × without the tag 3. In this latter case, it is then easy to discriminate
between classes − and × using X3→3 which answers “no” in ×. Thus, the
leftmost path of the tree (in general: “no”,“no”,. . .) is built obeying to the
286 7 Classifier Design
3,7,11,15 1,5,9
Fig. 7.4. Top: tree inferred exploiting binary arrangements B and the minimal
extensions At found by the classical greedy algorithm. Bottom: minimal extensions
selected by the algorithm for example 5. Gray masks indicate the tag and the number
of tag is in the anchoring pixel (upper-left).
rule to find the binary arrangement which is satisfied by almost one example
and best reduces the conditional entropy. So, the code of this branch will have
the prefix 00 . . .. But, what happens when the prefix 00 . . . 1 appears?
The rightmost branch of the tree illustrates the opposite case, the prefix is
11 . . . . Once a binary arrangement has been satisfied, it proceeds to complete
it by adding a minimal extension, that is, a new relation between existing tags
or by adding a new tag and a relation between this tag and one of the existing
ones. We denote by At the set of minimal extensions for node t in the tree.
In this case, tags are anchored to the coordinates of the upper-leftmost pixel
(the anchoring pixel) and overlapping between new tags (of course of different
types) is allowed. Once a 1 appears in the prefix, we must find the pending
arrangement, that is, the one minimizing the conditional entropy. For instance,
our first pending arrangement is X5+5→3 , that is, we add tag 5 and the relation
5 → 3 yielding a conditional entropy of H(Y |X3↑8 = 1, X5→3 ) = 0.2028. After
that, it is easy to discriminate between + and ÷ with the pending arrangement
X4+4→5 yielding a 0.0 conditional entropy. This result is close to the ideal
Twenty Questions (TQ): The mean number of queries EQ to determine the
true class is the expected length of the codes associated to the terminal leaves:
C1 = 00, C2 = 01, C3 = 10, C4 = 110, and C5 = 111:
1 1 1 1 3
EQ = P (Cl )L(Cl ) = ×2+ ×2+ ×1+ ×3+ × 3 = 2.3750
4 4 16 4 16
l∈T
7.3 Shape Quantization and Multiple Randomized Trees 287
that is H(Y ) ≤ EQ < H(Y ) + 1, being H(Y ) = 2. This is derived from the
Huffman code which determines the optimal sequence of questions to classify
an object. This becomes clear from observing that in the tree, each test divides
(not always) the masses of examples in subsets with almost equal size.
0.4
0.3
P(Y=c|XA=1)
0.2
0.1
0
0 1 2 3 4 5 6 7 8 9
Digit
Fig. 7.5. Top: first three tag levels and the most common configuration below each
leaf. Center: instances of geometric arrangements of 0 and 6. Bottom: conditional
distribution for the 10 digits and the arrangement satisfied by the shapes in the
middle. Top and Center figures by Y. Amit and D. Geman (1997c MIT Press).
7.3 Shape Quantization and Multiple Randomized Trees 289
Let us then suppose that we have a set X of M possible tests and also that
these tests are not too complex to be viable in practice. The purpose of a
classification tree T is to minimize H(Y |X), which is equivalent to maximizing
P (Y |X). However, during the construction of T not all tests are selected, we
maximize instead P (Y |T ) which is only a good approximation of maximizing
P (Y |X) when M , and hence the depth of the tree, is large.
Instead of learning a single deep tree, suppose that we replace it by a
collection of K shallow trees T1 , T2 , . . . , TK . Then we will have to estimate
P (Y = c|Tk ) k = 1, . . . , K.
we have that μTk (x) denotes the posterior for the leaf reached by a given
input x. Aggregation by averaging consists of computing:
1
K
μ̄(x) = μTk (x) (7.10)
K
k=1
1
In the example in Fig. 7.4 we have first selected a binary arrangement.
290 7 Classifier Design
Fig. 7.6. Randomization and multiple trees. Top: inferred graphs in the leaves of
five differently grown trees. Center: examples of handwritten digit images, before
(up) and after (down) preprocessing. Bottom: conditional distribution for the ten
digits and the arrangement satisfied by the shapes in the middle. Figure by Y. Amit
and D. Geman (1997
c MIT Press).
7.4 Random Forests 291
Finally, the class assigned to a given input x is the mode of the averaged
distribution:
Ŷ = arg max μ̄c (7.11)
c
This approach has been successfully applied to the OCR domain . Considering
the NIST database with 223, 000 binary digits written by more than two
thousand writers, and using 100, 000 for training and 50, 000 for testing (see
Fig. 7.6, center) the classification rate significantly increases with the number
of trees (see Fig. 7.6, bottom). In these experiments the average depth of the
trees was 8.8 and the average number of terminal nodes was 600.
As we have seen in the latter section, the combination of multiple trees in-
creases significantly the classification rate. The formal model, developed years
later by Breiman [30], is termed “Random Forests” (RFs). Given a labeled
training set (X , Y), an RF is a collection of tree classifiers F = {hk (x), k =
1, . . . , K}. Probably the best way of understanding RFs is to sketch the way
they are usually built.
First of all, we must consider the dimension of the training set |X | and
the maximum number of variables (dimensions) of their elements (examples),
which is N . For each tree. Then, the kth tree will be built as follows: (i) Select
randomly and with replacement |X | samples from X , a procedure usually
denoted bootstrapping2 ; this is its training set Xk . (ii) Each tree is grown
by selecting randomly at each node n << N variables (tests) and finding the
best split with them. (iii) Let each tree grow as usual, that is, without pruning
(reduce the conditional entropy as much as possible). After building the K
forest, an x input is classified by determining the most voted class among all
the trees (the class representing the majority of individual choices), namely
The analysis of multiple randomized trees (Section 7.3.4) reveals that from
a potentially large set of features (and their relationships) building multi-
ple trees from different sets of features results in a significant increment of
the recognition rate. However, beyond the intuition that many trees improve
2
Strictly speaking, bootstrapping is the complete procedure of extracting several
training sets by sampling with replacement.
292 7 Classifier Design
the performance, one must measure or bound, if possible, the error rate of
the forest, and check in which conditions it is low enough. Such error may be
quantified by the margin: the difference between the probability of predict-
ing the correct class and the maximum probability of predicting the wrong
class. If (x, y) (an example of the training and its correct class) is a sample of
the random vector X , Y which represents the corresponding training set, the
margin is formally defined as
The margin will be in the range [−1, 1] and it is obvious that the largest the
margin the more confidence in the classification. Therefore, the generalization
error of the forest is
that is, the probability of having a negative margin over the distribution X , Y
of training sets. Therefore, for the sake of the classification performance, it is
interesting that the GE, the latter probability, has a higher bound as low as
possible. In the case of RFs such a bound is related to the correlation between
the trees in the forest. Let us for instance define the strength of the forest as
s = EX ,Y mar(X , Y) (7.14)
that is, the expectation of the margin over the distribution X , Y, and let
= EΘ [rmar(X , Y, Θ)]
I(·) being an indicator (counting) function, which returns the number of times
the argument is satisfied, rmar(., ., .) the so-called raw margin, and Θ = {Θk }
a bag of parameters sets (one set per tree). For instance, in the case of bagging
we have Θk = Xk . Consequently, the margin is the expectation of the raw
margin with respect to Θ, that is, with respect to all the possible ways of
bootstraping the training set.
Let us now consider two different parameter sets Θ and Θ . Assuming i.i.d.,
the property (EΘ f (Θ))2 = EΘ f (Θ)×EΘ f (Θ ) is satisfied. Consequently, and
applying Eq. 7.16, it is verified that
7.4 Random Forests 293
As Θ and Θ are independent with the same distribution the following relation
holds:
v
ρ̄(Θ, Θ ) = (7.20)
(EΘ [Std(Θ)])2
which is equivalent to
And we have
v ≤ ρ̄(EΘ [Std(Θ)])2 ≡ ρ̄EΘ [V ar(Θ)] (7.22)
2 2
Furthermore, as V arΘ (Std(Θ)) = EΘ [Std(Θ) ]−(EΘ [Std(Θ)]) , we have that
Thus, we have two interesting connections: (i) an upper bound for the
variance of the margin depending on the average correlation; and (ii) another
upper bound between the expectation of the variance with respect to Θ, which
depends on the strength of the forest. The coupling of the latter connections
is given by the Chebishev inequality. Such inequality gives an upper bound
for rare events in the following sense: P (|x − μ| ≥ α) ≤ σ 2 /α2 . In the case of
random forests we have the following setting:
(1 − s2 )
GE = PX ,Y (mar(X , Y) < 0) ≤ ρ̄ (7.25)
s2
which means that the generalization error is bounded by a function depending
on both the correlation and the strength of the set classifiers. High correla-
tion between trees in the forest results in poor generalization and vice versa.
Simultaneously, low strength of the set of classifiers results in poor general-
ization.
be the oob proportions of votes for a wrong class z. Such votes come from the
test set, and this means that yQ(x, z) is formally connected with GE. First
of all, it is an estimate for p(hF (x) = z). From the definition of the strength
in Eq. 7.14, we have
1
p̂z∗ (Θk ) = 1 I(hk (x) = z ∗ ) (7.30)
3 |X | x∈X ∼Xk
suitable descriptors like the popular SIFT one [107], a 128-feature vector with
good orientation invariance properties. The descriptors (once clustered) define
a visual vocabulary. The more extended classification mechanism is to obtain
a histogram with the frequencies of each visual word for each image. Such
histogram is compared with stored histograms in the database and the closest
one is chosen as output. The two main weaknesses of the BoW approach are:
(i) to deal with clutter; and (ii) to represent the geometric structure properly
and exploit it in classification. Spatial pyramids [102] seek a solution for both
problems. The underlying idea of this representation is to partition the image
into increasingly fine subregions and compute histograms of the features inside
each subregion. A refinement of this idea is given in [27]. Besides histograms
of descriptors, also orientation histograms are included (Fig. 7.7, top). With
respect to robustness against clutter, a method for the automatic selection of
the Region of Interest (ROI) for the object (discarding clutter) is also pro-
posed in the latter paper. A rectangular ROI is learnt by maximizing the
area of similarity between objects from the same class (see Fig. 7.7, bottom).
The idea is to compute histograms only in these ROIs. Thus, for each level in
the pyramid, constrained to the ROI, two types of histograms are computed:
appearance ones and shape ones. This is quite flexible, allowing to give more
weight to shape or to appearance depending on the class. When building a
tree in the forest, the type of descriptor is randomly selected, as well as the
pyramid level from which it is chosen. If x is the descriptor (histogram) of a
given learning example, each bin is a feature, considering that the number of
bins, say N , depends on the selected pyramid level. Anyway, the correspond-
ing test for such example is nT x + b ≤ 0 for the right child, and >0 for the
left one. Thus all trees are binary. The components of n are chosen randomly
from [−1, 1]. The variable b is chosen randomly between 0 and the distance of
x from the origin. The number of zeros nz imposed for that test is also chosen
randomly and yields a number n = N − nz of effective features used. For each
node, r tests, with r = 100D (increasing linearly with the node depth D),
are performed and the best one in terms of conditional entropy reduction is
selected. The purpose of this selection is to reduce the correlation between
different trees, accordingly with the trade-off described above. The number of
effective features is not critical for the performance, and the method is com-
parable with state-of-the-art approaches to BoW [178] (around 80% with 101
categories and only 45% with 256 ones).
Another interesting aspect to cover within random forest is not the num-
ber of features used but the possibility of selecting variables on behalf of their
importance (Chapter 6 is devoted to feature selection, and some techniques
presented there could be useful in this context). In the context of random
forests, important variables are the ones which, when noised, maximize the
increasing of misclassification error with respect to the oob error with all vari-
ables intact. This definition makes equal importance to sensitiveness. How-
ever, this criterion has been proved to be inadequate in some settings: for
instance in scenarios with heterogeneous variables [151], where subsampling
7.4 Random Forests 297
Fig. 7.7. Top: appearance and shape histograms computed at different levels of
the pyramid (the number of bins of each histogram depends on the level of the
pyramid where it is computed). Bottom: several ROIs learn for different categories
in the Caltech-256 Database (256 categories). Figure by A. Bosch, A. Zisserman and
X. Muñoz (2007
c IEEE). See Color Plates.
298 7 Classifier Design
The main idea of this algorithm is that information theory may be used dur-
ing Boosting in order to select the most informative weak learner in each
300 7 Classifier Design
1.5 1.5
ε= 0.120000
1 1 α= 0.136364
0.5 0.5
0 0
−0.5 −0.5
−0.5 0 0.5 1 1.5 −0.5 0 0.5 1 1.5
1.5 1.5
ε= 0.130682
1 1 α= 0.150327
0.5 0.5
0 0
−0.5 −0.5
−0.5 0 0.5 1 1.5 −0.5 0 0.5 1 1.5
1.5 1.5
ε= 0.109974
1 1 α= 0.123563
0.5 0.5
0 0
−0.5 −0.5
−0.5 0 0.5 1 1.5 −0.5 0 0.5 1 1.5
1.5
0.5
−0.5
−0.5 0 0.5 1 1.5
Fig. 7.8. Adaboost applied to samples extracted from two different Gaussian distri-
butions, using three iterations (three weak classifiers are trained). Different weights
are represented as different sample sizes. First row: initial samples and first weak
classifier. Second row: reweighted samples after first iteration and second classifier.
Third row: reweighted samples after second iteration and third classifier. Fourth row:
final classifier.
7.5 Infomax and Jensen–Shannon Boosting 301
φ(x) = φT x (7.31)
with φ ∈ Rd and φT φ = 1.
The most informative feature at each iteration is called Infomax feature, and
the objective of the algorithm is to find this Infomax feature at each iteration
of the boosting process. In order to measure how informative a feature is, mu-
tual dependence between input samples and class label may be computed. A
high dependence will mean that input samples will provide more information
about which class can it be labeled as. A natural measure of mutual depen-
dence between mapped feature φT x and class label c is mutual information:
C
p(φT x, c)
I(φT x; c) = p(φT x, c) log dφT x (7.32)
c=1 x p(φT x)p(c)
1.5 1.5
1 1
0.5 0.5
0 0
−0.5 −0.5
−0.5 0 0.5 1 1.5 −0.5 0 0.5 1 1.5
1.5
0.5
−0.5
−0.5 0 0.5 1 1.5
Fig. 7.9. Three different classifiers applied to the same data, obtained from two
different Gaussian distributions (represented at bottom right of the figure). Mutual
information between inputs and class labels, obtained by means of entropic graphs
(as explained in Chapters 3 and 4), are 6.2437, 3.5066, and 0, respectively. The
highest value is achieved in the first case; it is the Infomax classifier.
where Nc is the number of classes, xci represents input sample i from class c,
and wic is a non-negative weight applied to each training sample of class c,
Nc c
satistying that i=1 wi = 1.
Although solving this first problem, the derivation of the numerical inte-
gration of Eq. 7.32 is not simple. However, knowing that Mutual Information
7.5 Infomax and Jensen–Shannon Boosting 303
Q(p||q) = (p(x) − q(x))2 dx (7.37)
x
C
IQ (φT x; c) = (p(φT x, c) − p(φT x)p(c))2 dφT x (7.38)
c=1 φT x
and after joining this expression with the probability density functions ap-
proximated by means of Eq. 1.5:
C
Nc
Nc
IQ (φT x; c) = Pc2 wic wjc G√2σ (yic − yjc )
c=1 i=1 j=1
C
C
Nc
Nc
+ uci vjc G√2σ (yic − yjc ) (7.39)
c1 =1 c2 =1 i=1 j=1
where wic is the associated weight of xci in the approximated probability den-
sity function and
Pc = p(c) (7.40)
yic = φT xci (7.41)
C
uci = Pc Pc2 − 2Pc wic (7.42)
c =1
vic = Pc wic (7.43)
∂IQ Cc N
∂IQ ∂yic c
∂IQ c
C N
= = x (7.44)
∂φ c=1 i=1
c
∂yi ∂φ c=1 i=1
∂yic i
304 7 Classifier Design
where:
Nc c
∂IQ 2 c c
yi − yjc
= P c w w
i j 1 H
∂yic j=1
2σ
C Nc c
y c
i − y j
+ uci vjc H1 (7.45)
j=1
2σ
c =1
the inner loop, where all concepts explained during this section are incorpo-
rated. A common modification of Adaboost is also included at initialization,
where weight values are calculated separately for samples belonging to dif-
ferent classes, in order to improve efficiency. It must be noted that strong
classifier output is not a discrete class label, but a real number, due to the
fact that Infomax Boosting is based on real Adaboost, and that class labels
are in {−1, 1}.
A question that may arise is if Infomax performance is really better than
in the case of Adaboost algorithm. Infomax boosting was originally applied
to face detection, like Viola’s face detection based on Adaboost. As can be
seen in Fig. 7.10, in this case features were obtained from oriented and scaled
Gaussians and Gaussian derivatives of first and second order, by means of con-
volutions. Figure 7.10 shows results of comparing both algorithms. Regarding
false positive rate, its fall as more features are added is similar; however,
at first iterations, error for Infomax is quite lower, needing a lower number
of features to decrease this error below 1%. In the case of ROC curves, it
can be clearly seen that Infomax detection rate is higher (96.3–85.1%) than
Adaboost value. The conclusion is that not only Infomax performance is bet-
ter, but that low error rates can also be achieved faster than in the case of
Adaboost, thanks to the Infomax principle, which allows algorithm to focus
on the most informative features.
Fig. 7.10. Comparison between Adaboost and Infomax. Top row left: example of
Haar features, relative to detection window, used in Viola’s face detection based
on Adaboost. Pixel values from white rectangles are subtracted from pixel values
in black rectangles. Top row right: example of application of two Haar features to
the same face image. Center row: feature bank used in Lyu’s face detection based
on Infomax. (Figure by S. Lyu (2005
c IEEE)). Bottom row: Lyu’s experiments
comparing Infomax and Adaboost false positive rate and ROC curves. (Figure by
S. Lyu (2005
c IEEE)).
−
positive samples (h+i (φi (x)) > hi (φi (x))) and ϕi (φi (x)) < 0 for negative ones
− +
(hi (φi (x)) > hi (φi (x))):
1 h+
i (φi (x))
ϕi (x) = log − (7.46)
2 hi (φi (x))
0.35 15
0.3 10
0.25
5
0.2
0
0.15
−5
0.1
0.05 −10
0 −15
−1.5 −1 −0.5 0 0.5 1 1.5 2 −1 −0.5 0 0.5 1 1.5
x 104 x 104
0.035 10
0.03
5
0.025
0.02 0
0.015 −5
0.01
−10
0.005
0 −15
−15000 −10000 −5000 0 5000 −10000 −8000−6000 −4000 −2000 0 2000
0.4 10
0.35
5
0.3
0.25 0
0.2
0.15 −5
0.1
−10
0.05
0 −15
−200 −100 0 100 200 −150 −100 −50 0 50 100 150
Fig. 7.11. Example of JS Feature pursuit. Three Haar based features and an exam-
ple of application to a positive sample are shown, from best to worst. At the center
of each row, the corresponding h+ (x) (dash line) and h− (x) (solid line) distributions
are represented, and at the right, φ(x) is shown. SJS values for each feature, from
top to bottom, are 6360.6, 2812.9, and 1837.7, respectively.
JSBoost (98.4% detection rate) improves KLBoost results (98.1%) and out-
performs Real Adaboost ones (97.9%). Furthermore, JSBoost achieves higher
detection rates with a lower number of iterations than other methods.
Initialize wi = 2N1+ for positive samples and wi = 2N1− for negative samples
for k=1 to K do
Select JS feature φk by Jensen–Shannon divergence using weights wi
1 h+ (φk (x))
fk (x) = log k
2 h−
k
(φk (x))
Update weights wi = wi · exp(−β
k ) · yi · fk (xi ), i = 1, . . . , N , and
normalize weights so that i wi = 1
end
Output:⎧Strong classifier:
⎪
⎨
K
1 h+k (φk (x))
1, log − ≥0
hf (x) = 2 h k (φk (x))
⎪
⎩ 1=1
0, otherwise
most suitable model for classification is the most uniform one, given some con-
straints which we call features. The maximum entropy classifier has to learn
a conditional distribution from labeled training data. Let x ∈ X be a sample
and c ∈ C be a label, then the distribution to be learnt is P (c|x), which is to
say, we want to know the class given a sample. The sample is characterized
by a set of D features (dimensions). For the maximum entropy classifier we
will formulate the features as fi (x, c), and 1 ≤ i ≤ NF , and NF = D|C|. Each
sample has D features for each one of the existing classes, as shown in the
following example.
Table 7.3 contains the original feature space of a classification problem
where two classes of vehicles are described by a set of features. In Table 7.4
the features are represented according to the formulation of the maximum
entropy classifier.
310 7 Classifier Design
Table 7.4. The values of the feature function fi (x, c) for the training data.
fi (x, c)
Class Class(i) = 1 = motorbike Class(i) = 2 = car
feature “has gears” “# wheels” “# seats” “has gears” “# wheels” “# seats”
x, c \ i i=1 i=2 i=3 i=4 i=5 i=6
x = 1, c = 1 1 3 3 0 0 0
x = 2, c = 1 0 2 1 0 0 0
x = 3, c = 1 1 2 2 0 0 0
x = 4, c = 2 0 0 0 1 4 5
x = 5, c = 2 0 0 0 1 4 2
Here P (x, c) and P (x) are the expected values in the training sample. The
equation is a constraint of the model, it forces the expected value of f to
be the same as the expected value of f in the training data. In other words,
the model has to agree with the training set on how often to output a feature.
The empirically consistent classifier that maximizes entropy is known as
the conditional exponential model. It can be expressed as
N
1 F
where λi are the weights to be estimated. In the model there are NF weights,
one for each feature function fi . If a weight is zero, the corresponding fea-
ture has no effect on classification decisions. If a weight is positive then the
corresponding feature will increase the probability estimates for labels where
this feature is present, and decrease them if the weight is negative. Z(x) is a
normalizing factor for ensuring a correct probability and it does not depend
on c ∈ C: N
F
NF
P (x) P (c|x)fi (x, c) exp(Δλi fj (x, c))
c∈C
x∈X j=1 (7.56)
= P (x, c)fi (x, c)
x∈X c∈C
Update λi ← λi + Δλi
end
until convergence of all λi ;
Output: The λi parameters of the conditional exponential model (Eq. 7.53).
Table 7.5. The values of the feature function fi (x, c) for the new sample supposing
that its class is motorbike (x = 6, c = 1) and another entry for the supposition that
the class is car (x = 6, c = 2).
fi (x, c)
Class Class(i) = 1 = motorbike Class(i) = 2 = car
feature “has gears” “# wheels” “# seats” “has gears” “# wheels” “# seats”
x, c \ i i=1 i=2 i=3 i=4 i=5 i=6
x = 6, c = 1 1 4 4 0 0 0
x = 6, c = 2 0 0 0 1 4 4
We want to classify a new unlabeled sample which has the following features:
has gears, 4 wheels, 4 seats
We have to calculate the probabilities for each one of the existing classes, in
this case C = motorbike, car. Let us put the features using the notation of
Table 7.4. We have two new samples, one of them is supposed to belong to
the first class, and the other one to the second class (see Table 7.5).
According to the estimated model the conditional probability for the sam-
ple (x = 6, c = 1) is
N
1 F
problem. Actually, iterative scaling and its improvements are good methods
for working in the dual space, but, what is the origin of this dual framework?
Let us start by the classical discrete formulation of the ME problem:
1
∗
p (x) = arg max p(x)
p(x)
x
log p(x)
s.t p(x)Fj (x) = aj , j = 1, . . . , m
x
p(x) = 1
x
p(x) ≥ 0 ∀x , (7.64)
where aj are empirical estimations of E(Gj (x)). It is particularly interesting
here to remind the fact that the objective function is concave and, also, that
all constraints are linear. These properties are also conserved in the following
generalization of the ME problem:
p(x)
∗
p (x) = arg min D(p||π) = p(x) log
p(x)
x
π(x)
s.t p(x)Fj (x) = aj , j = 1, . . . , m
x
p(x) = 1
x
p(x) ≥ 0 ∀x (7.65)
where π(x) is a prior distribution, and thus, when the prior is uniform we
have the ME problem. This generalization is known as the Kullback’s mini-
mum cross-entropy principle [100]. The primal solution is obtained through
Lagrange multipliers, and has the form of the following exponential function
(see Section 3.2.2):
1 m
p∗ (x) = e j=1 λj Fj (x) π(x) (7.66)
Z(Λ)
where Λ = (λ1 , . . . , λm ), and the main difference with respect to the solution
to the ME problem is the factor corresponding to the prior π(x). Thinking
of the definition of p∗ (x) in Eq. 7.66 as a family of exponential functions
(discrete or continuous) characterized by: an input space S ⊆ Rk , where the x
are defined, Λ ∈ Rm (parameters), F (x) = (F1 (x), . . . , Fm ) (feature set), and
π(x): Rk → R (base measure, not necessarily a probability measure). Thus,
the compact form of Eq. 7.66 is
1 Λ·F (x)
pΛ (x) = e π(x) (7.67)
Z(Λ)
Λ·F (x)
and considering that Z(Λ) = xe π(x), the log-partition function is
denoted G(Λ) = ln x eΛ·F (x) π(x). Thus it is almost obvious that
7.6 Maximum Entropy Principle for Classification 315
1 Λ·F (x)
pΛ (x) = e π(x) = eΛ·F (x)−G(Λ) π(x) (7.68)
eG(Λ)
where the right-hand side of the latter equation is the usual definition of an
exponential family:
pΛ (x) = eΛ·F (x)−G(Λ) π(x) (7.69)
The simplest member of this family is the Bernoulli distribution defined over
S = {0, 1} (discrete distribution) by the well-known parameter θ ∈ [0, 1] (the
success probability) so that p(x) = θx × (1 − θ)1−x , x ∈ S. It turns out that
E(x) = θ. Beyond the input space S, this distribution is posed in exponential
form by setting: T (x) = x, π(x) = 1 and choosing
pλ (x) ∝ eλx
pλ (x) = eλx−G(λ)
+ ,
G(λ) = ln eλx = ln eλ·0 + eλ·1 = ln(1 + eλ )
x
λx
e
pλ (x) =
1 + eλ
eλ
pλ (1) = =θ.
1 + eλ
eλ 1
pλ (0) = 1 − pλ (1) = 1 − λ
= =1−θ (7.70)
1+e 1 + eλ
The latter equations reveal a correspondence between the natural space λ and
the usual parameter θ = eλ /(1 + eλ ). It is also interesting to note here that
the usual parameter is related to the expectation of the distribution. Strictly
speaking, in the general case the natural space has as many dimensions as the
number of features, and it is defined as N = {Λ ∈ Rm : −1 < G(Λ) < 1}.
Furthermore, many properties of the exponential distributions, including the
correspondence between parameter spaces, emerge from the analysis of the
log-partition. First of all, G(Λ) = ln x eΛ·F (x) π(x) is strictly convex with
respect to Λ. The partial derivatives of this function are:
∂G 1 ∂ Λ·F (x)
= e π(x)
∂λi G(Λ) ∂λi x
1 ∂ m
= e j=1 λj Fj (x) π(x)
G(Λ) x ∂λi
1 m
= e j=1 λj Fj (x) π(x)Fi (x)
G(Λ) x
1 Λ·F (x)
= e π(x)Fi (x)
G(Λ) x
= pΛ (x)Fi (x)
x
= EΛ (Fi (x)) = ai (7.71)
316 7 Classifier Design
N
N
(X|Λ) = log pΛ (xi ) = log(pΛ (xi ))
i=1 i=1
N
= (Λ · F (xi ) − G(Λ) + log π(xi ))
i=1
N
N
=Λ ·F (xi ) − N G(Λ) + log π(xi ) (7.72)
i=1 i=1
Then, for finding the distribution maximizing the log-likelihood we must set
to zero its derivative with respect to Λ:
N
(X|Λ) = 0 ⇒ ·F (xi ) − N G (Λ) = 0
i=1
1
N
⇒ G (Λ) ≡ a = ·F (xi ) (7.73)
N i=1
which implies that Λ corresponds to the unique possible distribution fitting the
average vector a. Thus, if we consider the one-to-one correspondence, between
the natural space and the expectation space established through G (Λ), the
maximum-likelihood distribution is determined up to the prior π(x). If we
know the prior beforehand, there is a unique member of the exponential dis-
tribution family, p∗ (x) satisfying EΛ (F (x)) = a which is the closest to π(x),
that is, which minimizes D(p||π). That is, any other distribution p(x) satisfy-
ing the constraints is farther from the prior:
where the change of variable from p∗ (x) to p(x) in the right term of the fourth
line is justified
by the fact that
both pdfs satisfy the expectation
constraints.
Thus: x p∗ (x)(Λ
∗
· F (x)) = x p(x)(Λ ∗
· F (x)) because ∗
x p(x) F (x) =
EΛ∗ (F (x)) = x p(x)F (x) = EΛ (F (x)) = a. This is necessary so that the
Kullback–Leibler divergence satisfies a triangular equality (see Fig. 7.12, top):
Otherwise (in the general case that p∗ (x) to p(x) satisfy, for instance, the same
set of inequality constraints) the constrained minimization of cross-entropy
(Kullback–Leibler divergence with respect to a prior) satisfies [45, 144]:
In order to prove the latter inequality, we firstly remind that D(p∗ ||π) =
minp∈Ω D(p||π), Ω being a convex space of probability distributions (for in-
stance, the space of members of the exponential family). Then, let p∗α (x) =
(1 − α)p∗ (x) + αp(x), with α ∈ [0, 1], be pdfs derived from a convex combina-
tion of p∗ (x) and p(x). It is obvious that
D(p||π) π
D(p*||π)
p Ξ={p:E(F(x))=a}
Ω space
Ξ3={p:E(F3(x))=a3} Ξ2={p:E(F2(x))=a2}
p0=π
p'2
p1
p5 p3
p7
p* p
Ξ1={p:E(F1(x))=a1}
and, thus, D (p∗0 ||p) ≥ 0. Now, taking the derivative of D(p∗α ||p) with respect
to α:
0 ≤ D (p∗α ||p)
8
d ∗ (1 − α)p∗ (x) + αp(x) 88
= ((1 − α)p (x) + αp(x)) log 8 α=0
dα x
π(x)
p∗ (x)
= (p(x) − p∗ (x)) 1 + log
x
π(x)
∗
p (x) ∗ ∗ p∗ (x)
= p(x) + p(x) log − p (x) + p (x) log
x
π(x) x
π(x)
7.6 Maximum Entropy Principle for Classification 319
p∗ (x) p∗ (x)
= p(x) log − p∗ (x) log
x
π(x) x
π(x)
p(x) p∗ (x) p∗ (x)
∗
= p(x) log − p (x) log
x
π(x) p(x) x
π(x)
p(x) p∗ (x) ∗ p∗ (x)
= p(x) log + p(x) log − p (x) log
x
π(x) p(x) x
π(x)
= D(p||π) − D(p∗ ||p) − D(p∗ ||π) (7.78)
Therefore,
D(p||π) − D(p∗ ||p) − D(p∗ ||π) ≥ 0 ≡ D(p||π) ≥ D(p∗ ||p) + D(p∗ ||π) (7.79)
Thus, for the equality case (the case where the equality constraints are satis-
fied) it is timely to highlight the geometric interpretation of such triangular
equality. In geometric terms, all the pdfs of the exponential family which sat-
isfy the same set of equality constraints lie in an affine subspace Ξ (here is the
connection with information geometry, introduced in Chapter 4). Therefore,
p∗ can be seen as the projection of π onto such subspace. That is the defini-
tion of I-projection or information projection. Consequently, as p is also the
I-projection of π in the same subspace, we have a sort of Pythagorean theorem
from the latter equality. Thus, the Kullback–Leibler divergences play the role
of the Euclidean distance in the Pythagorean theorem. In addition, pi and p∗
belong to Ω, the convex space. Actually, the facts that Ω is a convex space,
and also the convexity of D(p||π), guarantee that p∗ is unique as we have seen
above. Anyway, the p∗ distribution has the three following equivalent prop-
erties: (i) it is the I-projection of the prior; (ii) it is the maximum likelihood
distribution in Ω; and (iii) p∗ ∈ Ξ ∩ Ω.
In algorithmic terms, the latter geometric interpretation suggests an iter-
ative approach to find p∗ from an initial p, for instance π. A very intuitive
approach was suggested by Burr 20 years ago [32]. For the sake of simplic-
ity, consider that we have only two expectation constraints to satisfy, that
is F (x) = (F1 (x), F2 (x)), and E(F1 (x)) = a1 , E(F2 (x)) = a2 , thus being
a = (a1 , a2 ), and Λ = (λ1 , λ2 ) (here, for simplicity, we do not include λ0 , the
multiplier ensuring up-to-one sum). Each of the constraints has associated a
different convex space of probability distributions satisfying them: Ξ1 and Ξ2 ,
respectively. Obviously, p∗ lies in Ξ1 ∩ Ξ2 . However, starting by pt=0 = π it is
possible to alternatively generate pt ∈ Ξ1 (t even) and pt ∈ Ξ2 (t odd) until
convergence to p∗ . More precisely, p1 has the form
p1 (x) = eλ1 F1 (x)−G(λ1 ) π(x) (7.80)
as G (λ1 ) = Eλ1 (F1 (x)) = a1 , then λ1 is uniquely determined. Let us then
project p1 onto Ξ2 using p1 as prior. Then we have
p2 (x) = eλ2 F2 (x)−G(λ2 ) p1 (x)
= eλ2 F2 (x)−G(λ2 ) eλ1 F1 (x)−G(λ1 ) π(x)
= eλ1 F1 (x)+λ2 F2 (x)−(G(λ1 )+G(λ2 )) π(x) (7.81)
320 7 Classifier Design
where we must find λ2 , given λ1 , which, in general, does not satisfy the con-
straint generating Ξ2 . However, a key element in this approach is that the
triangular equality (Eq. 7.75) ensures that: (i) p1 is closer to p∗ than p0 ; and
(ii) p2 is closer to p∗ than p1 . In order to prove that, the cited triangular
equality is rewritten as
D(p||π) = D(p∗ ||π) − D(p∗ ||p) ≡ D(p∗ ||p) = D(p∗ ||π) − D(p||π) (7.82)
because D(p||p∗ ) = −D(p∗ ||p). This new form of the triangular equality is
more suitable for giving the intuition of the convergence of alternate projec-
tions:
The latter definitions ensure that p̃∗ (x, j) = p∗ (x)π̃(x, j). Furthermore, as Ψ̃1
and Ψ̃2 are linear families whose marginals with respect to j yield a, then
Ψ = Ψ̃1 ∩ Ψ̃2 . This implies that we may iterate alternating projections onto
the latter spaces until convergence to the optimal p∗ lying in the intersection.
We define p̃2t (x, j) = pt (x)Fj , t = 0, 1, . . ., with p0 = π. Then, let p̃2t+1 be
the I-projection of p̃2t on Ψ̃1 , and, p̃2t+2 the I-projection of p̃2t+1 on Ψ̃2 . The
projection of p̃2t on Ψ̃1 can be obtained by exploiting the fact that we are
projecting on a family of distribution defined by a marginal. Consequently
the projection can be obtained by scaling the projecting pdf with respect to the
value of the corresponding marginal:
j p̃2t (x, j)
p̃2t+1 (x, j) = p̃2t (x, j)
x p̃2t (x, j)
aj
= pt (x)Fj (x) , aj,t = pt (x)Fj (x) (7.86)
aj,t x
m
p̃(x, j)
D(p̃(x)||p̃2t+1 (x)) = p̃(x, j) log
x j=1
p̃2t+1 (x, j)
m
p(x)Fj (x)
= p(x)Fj (x) log aj
x j=1
pt (x)Fj (x) aj,t
m
p(x)
= p(x) Fj (x) log aj
x j=1
pt (x) aj,t
⎧ ⎫
⎨
p(x) m
aj,n ⎬
= p(x) log + Fj (x) log
⎩ pt (x) j=1 aj ⎭
x
⎧ ⎫
⎨ m F (x)
p(x) aj,n j ⎬
= p(x) log + log (7.87)
⎩ pt (x) aj ⎭
x j=1
322 7 Classifier Design
In order to find a minimizer for the latter equation, the definition of the
following quantity:
m Fj (x)
aj Rt+1 (x)
Rt+1 (x) = pt (x)
a
that is pt (x) =
& 'Fj (x) (7.88)
j,n m aj
j=1
j=1 aj,n
Thus, the minimizing distribution is pt+1 (x) = Rt+1 (x)/Zt+1 , the minimum
being 1/Zt+1 . Consequently we obtain
m Fj (x)
1 aj
p̃2t+1 (x, j) = pt+1 (x)Fj (x), pt+1 (x) = pt (x) (7.90)
Zt+1 j=1
aj,t
and the following recurrence relation, defining the generalized iterative scal-
ing [47], is satisfied:
m Fj (x)
aj
Rt+1 (x) = Rt (x) , bj,t = Rt (x)Fj (x) (7.91)
j=1
bj,t x
N
N
(X|Λ) = log(pΛ (xi )) = (Λ · F (xi ) − G(Λ) + log π(xi )) (7.92)
i=1 i=1
N
(X|Λ + Δ) − (X|Λ) = (Δ · F (xi )) − G(Λ + Δ) + G(Λ) (7.93)
i=1
N
(Δ · F (xi )) − G(Λ + Δ) + G(Λ)
i=1
N
= (Δ · F (xi )) − log (Λ+Δ)·F (xi )
e π(xi ) + log Λ·F (xi )
e π(xi )
i=1 xi xi
N
e(Λ+Δ)·F (xi ) π(xi )
= (Δ · F (xi )) − log Λ·F (x ) xi
xi e
i π(x )
i=1 i
N
xi p(xi )Z(Λ)e
Δ·F (xi )
= (Δ · F (xi )) − log
i=1 xi p(xi )Z(Λ)
N
xi p(xi )e
Δ·F (xi )
= (Δ · F (xi )) − log
i=1 xi p(xi )
N
= (Δ · F (xi )) − log p(xi )eΔ·F (xi )
= (X|Λ + Δ)− (X|Λ).
i=1 xi
(7.94)
N & '
(X|Λ+Δ)−(X|Λ) ≥ (Δ · F (xi )) + 1 − p(xi )eΔ·F (xi ) ≥ 0 (7.95)
i=1 xi
A(Δ|Λ)
Then, the right side of the latter equation is a lower bound of the log-likelihood
increment. Next formal task consists of posing that lower bound A(Δ|Λ) in
terms of the δj : Δ = (δ1 , . . . , δj , . . . , δm ) so that setting to zero the partial
derivatives with respect to each δj allows to obtain it. To that end, it is key
to realize that the derivatives of A(Δ|Λ) will leave each λj as a function
of all the other ones due to the exponential (coupled variables). In order to
324 7 Classifier Design
decouple each λj so that each partial derivative depends only on it, a useful
trick (see [17]) is to perform the following transformation on the exponential:
m m m δj Fj (xi )
δj Fj (xi ) ( Fj (xi )) m
eΔ·F (xi ) = e
j=1 j=1 F (x )
j=1 =e (7.96)
j=1 j i
Then, let us define
pj (x) = Fj (x)/( j Fj (x)) the pdf resulting from the
introduction of j Fj (x) in the exponentiation. The real trick comes when
Jensen’s inequality for pj (x) is exploited. As we have seen in other chapters in
the book (see for instance Chapter 3 on segmentation), given a convex function
ϕ(E(x)) ≤
ϕ(x) we have that E(ϕ(x)). Therefore, as the exponential
is convex
we have that e x pj (x)q(x) ≤ x p(x)eq(x) . Then setting q(x) = j Fj (x) we
have that
⎛ ⎞ ⎛ ⎞
N
m
m m
A(Δ|Λ) ≥ ⎝ δj Fj (xi )⎠ + 1 − p(xi ) ⎝ pj (xi )eδj j=1 Fj (xi ) ⎠
i=1 j=1 xi j=1
B(Δ|Λ)
⎛ ⎞
N
m & m '
= ⎝ δj Fj (xi )⎠ + 1 − p(xi ) eδj j=1 Fj (xi ) . (7.97)
i=1 j=1 xi
Then we have
∂B(Δ|Λ)
N & m '
= Fj (xi ) − p(xi ) Fj (xi )eδj k=1 Fk (xi ) (7.98)
∂δj i=1 x i
and setting each partial derivative to zero we obtain the updating equations.
The difference between the expression above and the ones used in Alg. 19 is
the inclusion of the conditional model used for formulating the classifier. The
verification of these equations is left as an exercise (see Prob. 7.12). A faster
version of the Improved Iterative Scaling algorithm is proposed in [86]. The
underlying idea of this latter algorithm is to exploit tighter bounds by decou-
pling only part of the variables.
where the equality holds when z = min Dφ (w, y) with w ∈ Ω the so-called
Bregman projection of y onto the convex set Ω (this is the generalization of
information projection illustrated in Fig. 7.12). Thus, given the connection
between Bregman divergences and information projection, it is not surprising
that when considering distributions belonging to the exponential family, each
distribution has associated a natural divergence. Actually, there is an inter-
esting bijection between exponential pdfs and Bregman divergences [11]. Such
bijection is established through the concept of Legendre duality. As we have
seen in the previous section, the expressions
||c−x||2 ≤ r2 . Actually the MEB problem under the latter distortion is highly
connected with a variant of Support Vector Machines (SVMs) classifiers [165],
known as Core Vector Machines [158]. More precisely, the approach is known
as Ball Vector Machines [157] which is faster, but only applicable in contexts
where it can be assumed that the radius r is known beforehand.
Core Vector Machines (CVMs) inherit from SVMs the so-called kernel
trick. The main idea behind the kernel trick in SVMs is the fact that when the
original training data are not linearly separable, it may be separable when we
project such data in a space of higher dimensions. In this regard, kernels k(., .),
seen as dissimilarity functions, play a central role. Let kij = k(xi , xj ) = ϕ(xi )·
ϕ(xj ) be the dot product of the projections through function ϕ(·) of vectors
xi , xj ∈ S. Typical examples of kernels are the polynomial and the Gaussian
ones. Given vectors xi√= (u1√ , u2 ) and x √j = (v1 , v2 ) it is straightforward to see
that using ϕ(x) = (1, 2u1 , 2u2 , u21 , 2u1 u2 , u22 ) yields the quadratic kernel
k(xi , xj ) = ϕ(xi ) · ϕ(xj ) = (xi · xj + 1)d with d = 2. Given the definition
of a kernel and a set of vectors S, if the kernel is symmetric, the resulting
|S| × |S| = N × N matrix Kij = k(x i , x
j ) is the Gramm matrix if it is positive
semi-definite, that is, it satisfies i j Kij ci cj ≥ 0 for all choices of real
numbers for all finite set of vectors and choices of real numbers ci (Mercer’s
theorem). Gram matrices are composed of inner products of elements of a set
of vectors which are linearly independent if and only if the determinant of K
is nonzero.
In CVMs, the MEB problem is formulated as finding
min r2
c,r
which is called the primal problem. The primal problem has as many con-
straints as elements in S because all of them must be inside the MEB. Thus,
for any constraint in the primal problem there is a Lagrange multiplier in the
Lagrangian, which is formulated as follows:
N
L(S, Λ) = r2 − λi (r2 − ||c − ϕ(xi )||2 ) (7.108)
i=1
max ΛT diag(K) − ΛT KΛ
Λ
s.t ΛT 1 = 1, Λ ≥ 0 (7.109)
being 1 = (1, . . . , 1)T and 0 = (0, . . . , 0)T . The solution to the dual problem
Λ∗ = (λ∗1 , . . . , λ∗1 )T comes from solving a constrained quadratic programming
(QP) problem. Then, the primal problem is solved by setting
328 7 Classifier Design
N
c∗ = λ∗i ϕ(xi ), r = Λ∗ T diag(K) − Λ∗ T KΛ∗ (7.110)
i=1
In addition, when using the Lagrangian and the dual problem, the Karush–
Kuhn–Tucker (KKT) condition referred to as complementary slackness en-
sures that λi (r2 −||c−ϕ(xi )||2 ) = 0 ∀i. This means that when the ith equality
constraint is not satisfied (vectors inside the hypersphere) then λi > 0 and
otherwise (vectors defining the border, the so-called support vectors) we have
λi = 0. In addition, using the kernel trick and c we have that the distances
between projected points and the center can be determined without using
explicitly the projection function
N
N
N
||c∗ − ϕ(xl )||2 = λ∗i λ∗j Kij − 2 λ∗i Kil + Kll (7.111)
i=1 j=1 i=1
projection function yield better results than the polynomial ones, as we show
in Fig. 7.13. In the one-class classification problem, support vectors may be
considered outliers. In the polynomial case, the higher the degree the tighter
the representation (nonoutliers are converted into outliers). Although in this
case many input examples are accepted the hypershere is too sparse. On the
contrary, when using a Gaussian kernel with an intermediate variance there
is a trade-off between overfitting the data (for small variance) and too much
generalization (higher variance).
Core Vector Machines of the type SVDD are highly comparable with
the use of a different distance/divergence (not necessarily Euclidean) in-
stead of using a different kernel. This allows to define Bregman balls and
to face the MEB from a different perspective: the Smallest Enclosing Breg-
man Ball (SEBB) [119]. Then, given a Bregman divergence Dφ , a Bregman
(c,r)
ball Bφ = {x ∈ S : Dφ (c, x) ≤ r}. The definition of Bregman divergence
applied to this problem is
5 5 5
0 0 0
−5 −5 −5
0 5 10 0 5 10 0 5 10
sigma=1 sigma=5 sigma=15
10 10 10
5 5 5
C=25.0
0 0 0
−5 −5 −5
10 10 10
5 5 5
C= 0.1
0 0 0
−5 −5 −5
Fig. 7.13. Top: SVDD using different polynomial kernels (degrees). Bottom: using
different Gaussian kernels (variances). In both cases, the circles denote the support
vectors. Figure by D. M. J. Tax and R. P. W. Duin [153], ( c Elsevier 2004). See
Color Plates.
N
L(S, Λ) = r − λi (r − Dφ (c, xi )) (7.113)
i=1
with Λ > 0 according to the dual feasibility KKT condition. Considering the
definition of Bregman divergence in Eq. 7.112, the partial derivatives of L
with respect to c and r are
∂L(S, Λ) N N
= ∇φ(c) λi − λi ∇φ(xi ) ,
∂c i=1 i=1
∂L(S, Λ) N
= 1− λi (7.114)
∂r i=1
330 7 Classifier Design
N
λi = 1
i=1
N
N
∗ −1
∇φ(c) = λi ∇φ(xi ) ⇒ c = ∇ φ λi ∇φ(xi ) (7.115)
i=1 i=1
when the Bregman generator is the Euclidean distance. The key question
here is that there is a bijection between generators and values of the inverse
of the generator derivative which yields c∗ (the so-called functional averages).
As we have seen, for the Euclidean distance, the functional average is c∗j =
N
λi xj,i (arithmetic mean). For the Kullback–Leibler divergence we have
i=1
∗ N
cj = i=1 xλj,ii (geometric mean), and for the Itakura–Saito distance we obtain
N
c∗j = 1/ i=1 xλj,i i
(harmonic mean). Then, the solution to the SEBB problem
depends on the choice of the Bregman divergence. Anyway the resulting dual
problem is complex and it is more practical to compute g∗ , an estimation of
c∗ , by solving the following problem:
7.7 Bregman Divergences and Classification 331
min r
g,r
f being the minimal nonzero value of the Hessian norm ||Hφ|| inside the
convex closure of S and the error assumed for finding the center of the
MEB within an approximation algorithm using the Euclidean distance: it is
assumed that r ≤ (1+)r∗ , see for instance [33], and points inside this ball are
called core sets. Such algorithm, named BC, can be summarized as follows:
(i) choose at random c ∈ S; (ii) for a given number of iterations T − 1, for
t = 1, . . . , T −1: set x ← arg maxx ∈S ||c−x ||2 and then set c ← t+1
t 1
c+ t+1 x.
The underlying idea of the latter algorithm is to move along the line between
the current estimation of the center and the new one. This approach is quite
efficient for a large number of dimensions. The adaptation to find the center
when using Bregman Divergerces (BBC algorithm) is straigthforward: simply
change the content& ' x ← arg maxx ∈S Dφ (c, x ) and then
in the main loop: set
set c ← ∇−1 φ t+1 t
∇φ(c) + t+11
∇φ(x) . This method (see results in Fig. 7.14)
converges quite faster than BC in terms of the error (Dφ (c, c∗ )+Dφ (c∗ , c))/2.
Another method, less accurate than BBC but better than it in terms of rate
of convergence, and better than BC, is MBC. The MBC algorithm takes into
account the finding of g ∗ : (i) define a new set of transformed vectors: S ←
{∇φ(x) : x ∈ S}; (ii) call BC g∗ ← BC(S, T ); (ii) obtain c ← ∇−1 φ(c).
Anyway, whatever the method used, once the c∗ (or an approximation) is
obtained, the computation of r∗ is straightforward. Similar methods have
been used to simplify the QP problem in CVMs or SVDDs.
Bregman balls (and ellipsiods) have been recently tested in contexts of
one-class classification [118]. Suppose, for example, that we have a Gaussian
distribution. As the Gaussian pdf belongs to the exponential family, the op-
timal Bregman divergence to choose is the Kullback–Leibler divergence, in
order to find the corresponding ball and to detect supports for classifying test
vectors (see Prob. 7.14).
Fig. 7.14. BBC results with k = 2 for different Bregman divergences: Itakura–Saito
vs. Kullback–Leibler. (Figure by courtesy of Richard Nock).
depends on the properties of the loss function chosen. The typical one is the
0/1-loss function: 0/1 (y, h(x)) = I(h(x) = y) = I(sign(yh(x)) = +1), I(·)
being an indicator function. In that case, R̂ is renamed 0/1 and scaled by N
to remove the average. However, the ER using this loss function is hard to
minimize. However, and for the sake of clarity, in this section it is more con-
venient to: (i) assume that either h : X → R or h : X → [0, 1]; Y = {0, 1} and
Y ∗ = {−1, 1}. Mapping h to reals allows to consider |h|, the confidence on the
classifier, and considering the new Y and Y ∗ , which are equivalent in terms
of labeling an example, gives more flexibility. In this regard, the 0/1 loss may
be defined either as: R (y ∗ , h(x) = I(σ(h(x)) = y ∗ )) (σ(z) = +1 if z ≥ 0
0/1
0/1
and −1 otherwise) if the image of h is R, and [0,1] (y, h(x) = I(τ (h(x)) = y))
0/1
(τ (z) = 1 if z ≥ 1/2 and 0 otherwise) if the image of h is [0, 1]. Thus, R
0/1
and [0,1] are defined consequently. Anyway, surrogates relax the problem to
estimate a tight upper bound of 0/1 (whatever the notation used) by means
of using different, typically convex, loss functions. It turns out that some of
these functions are a subset of the Bregman divergences. The most typical are
(see Fig. 7.15)
∗
∗ −y
exp
R (y , h(x)) = e
h(x)
(exponential loss)
∗ −y ∗ h(x)
log
R (y , h(x)) = log(1 + e ) (logistic loss)
∗ ∗
sqr
R (y , h(x)) = (1 − y h(x)) (squared loss)
2
(7.121)
Surrogates must satisfy that 0/1 (h, S) ≤ (h, S). First of all, in or-
der to define permissible surrogates based on a given loss function (., .),
it must satisfy three properties: (i) (., .) (lower bounded by zero); (ii)
arg minx [0,1] (x, S) = q x being the output of the classifier mapped to
the interval [0, 1], [0,1] an empirical risk based on a loss [0,1] when the output
Fig. 7.15. The 0/1 loss functions and other losses used for building surrogates.
334 7 Classifier Design
Our main concern here is that the link between permissible functions and
surrogates is that any loss (y, h) is properly defined and satisfies the three
properties referred above if and only if (y, h) = Dφ (y, h), being Dφ (., .) a
Bregman divergence with a permissible generator φ(·) [120]. Using again the
Legendre transformation, the Legendre conjugate of φ is given by
As Dφ (y, ∇−1 φ(h)) = Dφ (1−y, 1−∇−1 φ(h)) because the Bregman divergence
is a loss function, and loss functions satisfy symmetry, we have
Dφ (y, ∇−1 φ(h)) = φ∗ (−y ∗ h) + aφ =Dφ (0, ∇−1 φ(−y ∗ h)) = Dφ (1, ∇−1 φ(y ∗ h))
(7.127)
7.7 Bregman Divergences and Classification 335
where proving the latter two equivalences is straightforward. Thus, φ∗ (·) seems
to be a Bregman divergence (it is, because aφ = 0 for well-known permissi-
ble φ). Then, we are going to define Fφ (x) = (φ∗ (−x) − aφ )/bφ , which, for
aφ = 0, bφ = 1, yields Fφ (x) = φ∗ (−x). This new function is strictly convex
and satisfies R (y ∗ , h) ≤ Fφ (y ∗ h) which ensures that Fφ defines a surrogate
0/1
because
N
N
Fφ (yi∗ h(xi )) = Dφ (0, ∇−1 φ(−yi∗ h(xi )))
0/1
0/1 (h, S) ≤ φ (h, S) =
i=1 i=1
(7.128)
When using the latter definitions of surrogates it is interesting to remind that
we have
Fφ (y ∗ h) = −y ∗ h + 1 + (y ∗ h)2 for φM
& ∗
'
Fφ (y ∗ h) = log 1 + e−y h for φQ
Fφ (y ∗ h) = (1 − y ∗ h2 ) for φB (7.129)
which correspond in the two latter cases to the losses defined above (logistic
and squared). Given the basic theory, how to apply it to a linear-separator
(LS) classifier like the one coming from Boosting? Such classifier, H(xi ) =
T
t=1 αt ht (xi ), is composed of weak classifiers ht and leveraging coefficients
αt . First of all, we are going to assimilate ∇−1 φ(−yi∗ h(xi )) to the weights wi
used in the sampling strategy of boosting. This assimilation comes naturally
from the latter definitions of Fφ (y ∗ h): the more badly classified the example
the higher the loss and the higher the weights and vice versa. Firstly, it is
interesting to group the disparities between all classifiers and all the examples
in a unique matrix M of dimensions N × T :
⎛ ∗ ⎞
y1 h1 (x1 ) . . . y1∗ ht (x1 ) . . . y1∗ hT (x1 )
⎜ .. .. .. ⎟
⎜ ⎟
⎜ ∗ . . . ⎟
M = − ⎜ yi h1 (xi ) . . . yi ht (xi ) . . . yi hT (xi ) ⎟
⎜ ∗ ∗
⎟ (7.130)
⎜ .. .. .. ⎟
⎝ . . . ⎠
∗ ∗ ∗
yN h1 (xN ) . . . yN ht (xN ) . . . yN hT (xN )
A key element for finding the optimal classifier through this approach is
the Bregman–Pythagoras theorem applied to this context:
where w∞ are the optimal weights, and the vectorial notation of Dφ consists of
Dφ (u, v) = i Dφ (ui , vi ) (component-wise sum). Then, the idea is to design
an iterative algorithm for minimizing Dφ (0, w) in Ω ) w. More precisely
0/1
min φ (H, S) = min Dφ (0, w) (7.133)
H w∈Ω
Such iterative algorithm starts by setting w1 ← ∇−1 φ(0)1, that is, w1,i =
1/2, ∀i because ∇−1 φ(0) = 1/2. The initial leverage weights are set to zero:
α1 ← 0. The most important part of the algorithm is how to update wj+1 ,
given wj
wj+1,i ← ∇−1 φ(Mi (αj + δ j )) , (7.134)
N
+ ,
Mit ∇−1 φ(Mi (αj + δ j )) = 0 (7.135)
i=1
0 D (0,w )
8
w Ker MT
8
D (w w)
8
D (0,w)
w
-1
w1=Δφ(0)1
wj
Dφ(0,wj) Dφ(wj+1,wj)
wj+1
0 Dφ(0,wj+1) w Ker MT
8
Ker M|TTj
N
+ ,
Mit ∇−1 φ(Mi (αj + δ j )) = 0 (7.140)
i=1
wj+1,i
Set αj+1 = αj + δ
end
Output:Strong classifier:
H(x) = Tt=1 αt ht (x)
each sequence of decisions (test results) yields a given label. As the tree in-
duces a partition of S into subsets Sl (examples arriving to each leaf l) the
probability that the output class of the tree is +1 is the proportion between
the samples of Sl labeled as +1 (Sl+ ) and |Sl |, and the same rationale applies
for the probability of class −1. This is the base of weak classifiers associated
to the leaves of the tree, and this rationale leads to LDS or Linear Decision
Tree. To apply the concept of ULS here, we must estimate the leveraging
coefficients for each node. In this case, these coefficients can be obtained from
+
1 |Sl |
αl = ∇φ − αt ht (7.141)
hl |Sl |
t∈Pl
Pl being the classifiers belonging to the path between the root and leaf l. Then,
for an observation x reaching leaf l, we have that H(x) = ∇φ(|Sl+ |/|Sl |).
Therefore ∇−1 φ(x) = |Sl+ |/|Sl |, and the surrogate is defined as follows:
N
|Sl+ |
R
φ (H, S) = Fφ (yi∗ H(xi )) = |Sl | −φ (7.142)
i=1
|Sl |
l∈∂H
where H is the current tree and H is the new one after the expansion of a
given leaf. The leverage coefficients of the new leafs may be computed at each
iteration given the ones of the interior nodes.
Problems
7.1 Rare classes and Twenty Questions
The purpose of “Twenty Questions” (TQ) is to find a testing strategy which
performs the minimum number of tests for finding the true class (value of
Y ). For doing so, the basic idea is to find the test dividing the population in
masses as equal as possible.
1. Given the sequence of questions Qt , check that this strategy is consistent
with finding the test Xt+1 maximizing H(Y |Qt , Xt+1 ).
2. What is the result of applying the latter strategy to the data in
Table 7.2?
7.2 Working with simple tags
1. Given the image examples and tags of Fig. 7.3, extract all the binary
relationships (the number of tag may be repeated in the relationship, for
instance 5 ↑ 5) and obtain the classification tree.
2. Compare the results, in terms of complexity of the tree and classification
performance with the case of associating single tags to each test.
7.3 Randomization and multiple trees
1. Given the image examples and tags of Fig. 7.3, and having extracted B,
use randomization and the multiple-tree methodology to learn shallow trees.
2. Is the number of examples enough to learn all posteriors. If not, expand
the training set conveniently.
7.4 Gini index vs. entropy
An alternative measure to entropy for quantifying node impurity is the Gini
index defined as ⎡ ⎤
1
G(Y ) = ⎣1 − P 2 (Y = y)⎦
2
y∈Y
This measure is used, for instance, during the growing of RFs. Considering the
case of two categories (classes), compare both analytically and experimentally
this measure with entropy.
7.5 Breiman’s conjecture
When defining RFs, Breiman conjectures that Adaboost emulates RFs at
the later stages. Relate this conjecture with the probability of the kth set of
weights.
7.6 Weak dependence of multiple trees
In the text it is claimed that randomization yields weak statistical depen-
dency among the trees. Given the trees grown in the previous problem, check
340 7 Classifier Design
the claim by computing both the conditional variances νc and the sum of
conditional covariances for all classes γc which are defined as follows:
1
K C
νc = V ar(μTk (d)|Y = c)
K
k=1 d=1
1
C
γc = Cov(μTk (d), μTp (d)|Y = c)
K2
k=p d=1
Calculate again the probability that the vehicle which has gears, 4
wheels and 4 seats is a car, and the probability that it is a motorbike.
3. Has the new feature helped the model to improve the classification?
4. Why is λ8 = 0?
5. Is the Maximum Entropy classifier suitable for real-valued features?
How would you add the maximum speed of the vehicle as a feature?
7.8 Key References 341
N +
|Sl |
R
φ (H, S) = Fφ (yi∗ H(xi )) = |Sl | −φ (7.146)
i=1
|Sl |
l∈∂H
15. A.J. Bell and T.J. Sejnowski, Edges are the independent components of natural
scenes, Advances on Neural Information Processing Systems (NIPS), 8, (1996),
831–837.
16. A.J. Bell, The co-information lattice, Proceedings of the International Work-
shop on Independent Component Analysis and Blind Signal Separation, Nara,
Japan, 2003, pp. 921–926.
17. A. Berger, The improved iterative scaling algorithm: a gentle introduction,
Technical report, Carnegie Melon University, 1997.
18. A. Berger, Convexity, maximum likelihood and all that, Carnegie Mellon
University, Pittsburgh, PA, 1998.
19. A. Berger, S. Della Pietra, and V. Della Pietra, A Maximum entropy approach
to natural language processing, Computational Linguistics, 22 (1996).
20. D. Bertsekas, Convex analysis and optimization, Athena Scientific, Nashua,
NH, 2003.
21. D.J. Bertsimas and G. Van Ryzin, An asymptotic determination of the min-
imum spanning tree and minimum matching constants in geometrical proba-
bility, Operations Research Letters 9 (1990), 1, 223–231.
22. P.J. Besl and N.D. McKay, A method for registration of 3D shapes, IEEE
Transactions on Pattern Analysis and Machine Intelligence 14 (1992), 2,
239–256.
23. A. Blake and M. Isard, Active contours, Springer, New York, 1998.
24. A. Blum and P. Langley, Selection of relevant features and examples in machine
learning, Artificial Intelligence 97 (1997), 1–2, 245–271.
25. B. Bonev, F. Escolano, and M. Cazorla, Feature selection, mutual informa-
tion, and the classification of high-dimensional patterns, Pattern Analysis and
Applications, 11 (2008) 309–319.
26. F.L. Bookstein, Principal warps: thin plate splines and the decomposition of
deformations, IEEE Transactions on Pattern Analysis and Machine Intelligence
11 (1989), 6, 567–585.
27. A. Bosch, A. Zisserman, and X. Muñoz, Image classification using random
forests and ferns, Proceedings of the International Conference on Computer
Vision, Rio de Janeiro, Brazil, 2007, pp. 1–8.
28. A. Bosch, A. Zisserman, and X. Muñoz, Scene classification using a hybrid
generative/discriminative approach, IEEE Transactions on Pattern Analysis
and Machine Intelliegnce 30 (2008), 4, 1–16.
29. L.M. Bregman, The relaxation method of finding the common point of convex
sets and its application to the solution of problems in convex programming,
USSR Computational Mathematics and Physics 7 (1967), 200–217.
30. L. Breiman, Random forests, Machine Learning 1 (2001), 45, 5–32.
31. L. Breiman, J. Friedman, R. Olshen, and C. Stone, Classification and regression
trees, Wadsworth, Belmont, CA, 1984.
32. J.E. Burr, Properties of cross-entropy minimization, IEEE Transactions on
Information Theory 35 (1989), 3, 695–698.
33. M. Bǎdoiu and K.-L. Clarkson, Optimal core-sets for balls, Computational
Geometry: Theory and Applications 40 (2008), 1, 14–22.
34. X. Calmet and J. Calmet, Dynamics of the Fisher information metric, Physical
Review E 71 (2005), 056109.
35. J.-F. Cardoso and A. Souloumiac, Blind beamforming for non Gaussian signals,
IEE Proceedings-F 140 (1993), 6, 362–370.
References 345
36. M.A. Cazorla, F. Escolano, D. Gallardo, and R. Rizo, Junction detection and
grouping with probabilistic edge models and Bayesian A∗ , Pattern Recognition
9 (2002), 35, 1869–1881.
37. T.F. Chan and L. Vese, An active contour model without edges, Proceedings
of International Conference Scale-Space Theories in Computer Vision, Corfu,
Greece, 1999, pp. 141–151.
38. X. Chen and A.L. Yuille, Adaboost learning for detecting and reading text in
city scenes, Proceedings of IEEE Conference on Computer Vision and Pattern
Recognition, Washington DC, USA, 2004.
39. X. Chen and A.L. Yuille, Time-efficient cascade for real time object detection,
1st International Workshop on Computer Vision Applications for the Visually
Impaired. Proceedings of IEEE Conference on Computer Vision and Pattern
Recognition, Washington DC, USA, 2004.
40. H. Chui and A. Rangarajan, A new point matching algorithm for nonrigid
registration, Computer Vision and Image Understanding 89 (2003), 114–141.
41. P. Comon, Independent component analysis, a new concept? Signal Processing
36 (1994), 287–314.
42. J.M. Coughlan and A. L. Yuille, Bayesian A∗ tree search with expected o(n)
node expansions: applications to road tracking, Neural Computation 14 (2002),
1929–1958.
43. T. Cover and J. Thomas, Elements of information theory, Wiley, New York
1991.
44. I. Csiszár, A geometric interpretation of Darroch and Ratcliff’s generalized
iterative scaling, Annals of Probability 17 (1975), 3, 1409–1413.
45. I. Csiszár, I-divergence geometry of probability distributions and minimization
problems, Annals of Probability 3 (1975), 1, 146–158.
46. S. Dalal and W. Hall, Approximating priors by mixtures of natural conjugate
priors, Journal of the Royal Statistical Society(B) 45 (1983), 1.
47. J.N. Darroch and D. Ratcliff, Generalized iterative scaling for log-linear models,
Annals of Mathematical Statistics 43 (1983), 1470–1480.
48. P. Dellaportas and I. Papageorgiou, Petros Dellaportas1 Contact Information
and Ioulia Papageorgiou, Statistics and Computing 16 (2006), 1, 57–68.
49. A. Dempster, N. Laird, and D. Rubin, Maximum likelihood estimation from
incomplete data via the EM algorithm, Journal of the Royal Statistical Society
39 (1977), 1, 1–38.
50. G.L. Donato and S. Belongie, Approximate thin plate spline mappings, Proceed-
ings of the European Conference on Computer Vision, Copenhagen, Denmark,
vol. 2, 2002, pp. 531–542.
51. R. O. Duda and P. E. Hart, Pattern classification and scene analysis, Wiley,
New York, 1973.
52. D. Erdogmus, K.E. Hild II, Y.N. Rao, and J.C. Prı́ncipe, Minimax mutual in-
formation approach for independent component analysis, Neural Computation
16 (2004), 6, 1235–1252.
53. L. Fei-Fei and P. Perona, A Bayesian hierarchical model for learning natural
scene categories, Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, San Diego, USA, vol. 2, 2005, pp. 524–531.
54. D.J. Field, What is the goal of sensory coding? Neural Computation 6 (1994),
559–601.
346 References
55. M.A.T. Figueiredo and A.K. Jain, Unsupervised selection and estimation
of finite mixture models, International Conference on Pattern Recognition
(ICPR2000) (Barcelona, Spain), IEEE, 2000.
56. M.A.T. Figueiredo and A.K. Jain, Unsupervised learning of finite mixture
models, IEEE Transactions on Pattern Analysis and Machine Intelligence 24
(2002), 3, 381–399.
57. M.A.T. Figueiredo, J.M.N. Leitao, and A.K. Jain, Adaptive parametrically de-
formable contours, Proceedings of Energy Minimization Methods and Pattern
Recognition (EMMCVPR’97), Venice, Italy, 1997, pp. 35–50.
58. M.A.T. Figueiredo, J.M.N. Leitao, and A.K. Jain, Unsupervised contour rep-
resentation and estimation using B-splines and a minimum description length
criterion, IEEE Transactions on Image Processing 6 (2000), 9, 1075–1087.
59. M.A.T Figueiredo, J.M.N Leitao, and A.K. Jain, On fitting mixture models,
Energy Minimization Methods in Computer Vision and Pattern Recognition.
Lecture Notes in Computer Science 1654 (1999), 1, 54–69.
60. D.H. Fisher, Knowledge acquisition via incremental conceptual clustering,
Machine Learning (1987), 2, 139–172.
61. Y. Freund and R.E. Schapire, A decision-theoretic generalization of on-line
learning and an application to boosting, Journal of Computer and System
Sciences 1 (1997), 55, 119–139.
62. D. Geman and B. Jedynak, Model-based classification trees, IEEE Transactions
on Information Theory 3 (2001), 47, 1075–1082.
63. J. Goldberger, S. Gordon, and H. Greenspan, Unsupervised image-set cluster-
ing using an information theoretic framework, IEEE Transactions on Image
Processing 2 (2006), 449–458.
64. P.J. Green, Reversible jump Markov chain Monte Carlo computation and
Bayesian model determination, Biometrika 4 (1995), 82, 711–732.
65. U. Grenander and M.I. Miller, Representation of knowledge in complex sys-
tems, Journal of the Royal Statistical Society Series B 4 (1994), 56, 569–603.
66. R. Gribonval, From projection pursuit and cart to adaptive discriminant anal-
ysis? IEEE Transactions on Neural Networks 16 (2005), 3, 522–532.
67. P.D. Grünwald, The minimum description length principle, MIT Press,
Cambridge, MA, 2007.
68. S. Guilles, Robust description and matching of images, Ph.D. thesis, University
of Oxford, 1998.
69. I. Guyon and A. Elisseeff, An introduction to variable and feature selection,
Journal of Machine Learning Research (2003), 3, 1157–1182.
70. A. Ben Hamza and H. Krim, Image registration and segmentation by maximiz-
ing the Jensen-Rényi divergence, Lecture Notes in Computer Science, EMM-
CVPR 2003, 2003, pp. 147–163.
71. J. Harris, Algebraic geometry, a first course, Springer-Verlag, New York, 1992.
72. T. Hastie and R. Tibshirani, Discriminant analysis by Gaussian mixtures, Jour-
nal of the Royal Statistical Society(B) 58 (1996), 1, 155–176.
73. A.O. Hero and O. Michel, Asymptotic theory of greedy aproximations to min-
nimal k-point random graphs, IEEE Transactions on Information Theory 45
(1999), 6, 1921–1939.
74. A.O. Hero and O. Michel, Applications of spanning entropic graphs, IEEE
Signal Processing Magazine 19 (2002), 5, 85–95.
References 347
75. K. Huang, Y. Ma, and R. Vidal, Minimum effective dimension for mixtures
of subspaces: A robust GPCA algorithm and its applications, Computer Vision
and Pattern Recognition Conference (CVPR04), vol. 2, 2004, pp. 631–638.
76. X. Huang, S.Z. Li, and Y. Wang, Jensen-Shannon boosting learning for object
recognition, IEEE Conference on Computer Vision and Pattern Recognition 2
(2005), 144–149.
77. A. Hyvärinen, New approximations of differential entropy for independent com-
ponent analysis and projection pursuit, Technical report, Helsinki University of
Technology, 1997.
78. A. Hyvarinen, J. Karhunen, and E. Oja, Independent component analysis,
Wiley, New York, 2001.
79. A. Hyvarinen and E. Oja, Independent component analysis: algorithms and
applications, Neural Networks 13 (2000), 4–5, 411–430.
80. A. Hyvrinen and E. Oja, A fast fixed-point algorithm for independent compo-
nent analysis, Neural Computation 9 (1997), 7, 1483–1492.
81. A.K. Jain and R. Dubes, Algorithms for clustering data, Prentice Hall, Engle-
wood Cliffs, NJ, 1988.
82. A.K. Jain, R. Dubes, and J. Mao, Statistical pattern recognition: a review,
IEEE Transactions on Pattern Analysis Machine Intelligence 22 (2000), 1,
4–38.
83. E.T. Jaynes, Information theory and statistical mechanics, Physical Review
106 (1957), 4, 620–630.
84. B. Jedynak, H. Zheng, and M. Daoudi, Skin detection using pairwise models,
Image and Vision Computing 23 (2005), 13, 1122–1130.
85. B. Jedynak, H. Zheng, and Daoudi M., Statistical models for skin detection,
Proceedings of IEEE International Conference on Computer Vision and Pat-
tern Recognition (CVPRV’03), Madison, USA, vol. 8, 2003, pp. 92–92.
86. R. Jin, R. Yan, J. Zhang, and A.G. Hauptmann, A faster iterative scaling algo-
rithm for conditional exponential model, Proceedings of the 20th International
Conference on Machine Learning (ICML 2003), Washington, USA, 2003.
87. G.H. John, R. Kohavi, and K. Pfleger, Irrelevant features and the sub-
set selection problem, International Conference on Machine Learning (1994),
pp. 121–129.
88. M.C. Jones and R. Sibson, What is projection pursuit?, Journal of the Royal
Statistical Society. Series A (General) 150 (1987), 1, 1–37.
89. M.J. Jones and J.M. Rehg, Statistical color models with applications to skin
detection, Proceedings of IEEE International Conference on Computer Vision
and Pattern Recognition, Ft. Collins, USA, 1999, pp. 1–8.
90. T. Kadir and M. Brady, Estimating statistics in arbitrary regions of interest,
Proceedings of the 16th British Machine Vision Conference, Oxford, UK, Vol. 2,
2005, pp. 589–598.
91. T. Kadir and M. Brady, Scale, saliency and image description, International
Journal on Computer Vision 2 (2001), 45, 83–105.
92. K. Kanatani, Motion segmentation by subspace separation and model selection,
International Conference on Computer Vision (ICCV01), vol. 2, 2001, pp. 586–
591.
93. M. Kass, A. Witkin, and D. Terzopoulos, Snakes: Active contour models, In-
ternational Journal on Computer Vision (1987), 1, 259–268.
348 References
94. Robert E. Kass and Larry Wasserman, A reference Bayesian test for nested hy-
potheses and its relationship to the Schwarz criterion, Journal of the American
Statistical Association 90 (1995), 928–934.
95. M. Kearns and L. Valiant, Cryptographic limitations on learning Boolean for-
mulae and finite automata, Journal of the ACM 1 (1994), 41, 67–95.
96. R. Kohavi and G.H. John, Wrappers for feature subset selection, Artificial
Intelligence 97 (1997), 1–2, 273–324.
97. D. Koller and M. Sahami, Toward optimal feature selection, Proceedings of
International Conference in Machine Learning, 1996, pp. 284–292.
98. S. Konishi, A.L. Yuille, and J.M. Coughlan, A statistical approach to multi-
scale edge detection, Image and Vision Computing 21 (2003), 1, 37–48.
99. S. Konishi, A.L. Yuille, J.M. Coughlan, and S.C. Zhu, Statistical edge detec-
tion: learning and evaluating edge cues, IEEE Transactions on Pattern Analysis
and Machine Intelligence 1 (2003), 25, 57–74.
100. S. Kullback, Information theory and statistics, Wiley, New York, 1959.
101. M.H.C. Law, M.A.T. Figueiredo, and A.K. Jain, Simultaneous feature selection
and clustering using mixture models, IEEE Transactions on Pattern Analysis
Machine Intelligence 26 (2004), 9, 1154–1166.
102. S. Lazebnik, C. Schmid, and Ponce P., Beyond bags of features: Spatial pyramid
matching for recognizing natural scene categories, Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, New York, USA,
vol. 2, 2006, pp. 2169–2178.
103. N. Leonenko, L. Pronzato, and V. Savani, A class of Rényi information esti-
mators for multidimensional densities, The Annals of Statistics 36 (2008), 5,
2153–2182.
104. J. Lin, Divergence measures based on the Shannon entropy, IEEE Transactions
on Information Theory 1 (1991), 37, 145–151.
105. R. Linsker, Self-organization in a perceptual network, Computer 3 (1988), 21,
105–117.
106. C. Liu and H.Y. Shum, Kullback–Leibler boosting, IEEE Conference on Com-
puter Vision and Pattern Recognition (2003), 407–411.
107. D. Lowe, Distinctive image features form scale-invariant keypoints, Interna-
tional Journal of Computer Vision 60 (2004), 2, 91–110.
108. S. Lyu, Infomax boosting, IEEE Conference on Computer Vision and Pattern
Recognition 1 (2005), 533–538.
109. Y. Ma, A.Y. Yang, H. Derksen, and R. Fossum, Estimation of subspace ar-
rangements with applications in modelling and segmenting mixed data, SIAM
Review 50 (2008), 3, 413–458.
110. J. Matas, O. Chum, U. Martin, and T. Pajdla, Robust wide baseline stereo from
maximally stable extremal regions, Proceedings of the British Machine Vision
Conference, Cardiff, Wales, UK, vol. 1, 2002, pp. 384–393.
111. G. McLachlan, Discriminant analysis and statistical pattern recognition, Wiley,
New York, 1992.
112. G. McLachlan and D. Peel, Finite mixture models, Wiley, 2000.
113. K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman, J. Matas, F. Schaf-
falitzky, T. Kadir, and L. Van Gool, A comparison of affine region detectors,
International Journal of Computer Vision 65 (2005), 1–2, 43–72.
114. W. Mio and X. Liu, 2113–2116, Proceedings of the IEEE International Con-
ference on Image Processing, Atlanta, USA, 2006, pp. 531–542.
References 349
153. D.M.J. Tax and R.P.W Duin, Support vector data description, Machine Learn-
ing 54 (2004), 45–66.
154. Z. Thu, X. Chen, A.L. Yuille, and S. Zhu, Image parsing: unifying segmenta-
tion, detection, and recognition, International Journal of Computer Vision 63
(2005), 2, 113–140.
155. A. Torsello and D.L. Dowe, Learning a generative model for structural represen-
tations, Proceedings of the Australasian Conference on Artificial Intelligence,
Auckland, New Zealand, 2008, pp. 573–583.
156. A. Torsello and E.R. Hancock, Learning shape-classes using a mixture of tree-
unions, IEEE Transactions on Pattern Analysis and Machine Intelligence 28
(2006), 6, 954–967.
157. I.W. Tsang, A. Kocsor, and J.T. Kwok, Simpler core vector machines with
enclosing balls, International Conference on Machine Learning, ICML 2007,
2007, pp. 911–918.
158. I.W. Tsang, J.T. Kwok, and P.-M. Cheung, Core vector machines: fast svm
training on very large datasets, Journal of Machine Learning Research 6 (2005),
363–392.
159. Z. Tu and S.-C. Zhu, Image segmentation by data-driven markov chain Monte
Carlo, IEEE Transactions on Pattern Analysis and Machine Intelligence 24
(2002), 5, 657–673.
160. M. Turk and A. Pentland, Eigenfaces for recognition, Journal of Cognitive
Neuroscience 3 (1991), 1.
161. N. Ueda, R. Nakano, Z. Ghahramani, and G. E. Hinton, SMEM algorithm for
mixture models, Neural Computation 12 (2000), 1, 2109–2128.
162. G. Unal, H. Krim, and A. Yezzi, Fast incorporation of optical flow into active
polygons, IEEE Transactions on Image Processing 6 (2005), 14, 745–759.
163. G. Unal, A. Yezzi, and H. Krim, Information-theoretic active polygons for
unsupervised texture segmentation, International Journal on Computer Vision
3 (2005), 62, 199–220.
164. M.J. van der Laan, Statistical inference for variable importance, International
Journal of Biostatistics 2 (2006), 1, 1008–1008.
165. V. N. Vapnik, Statistical learning theory, Wiley, New York, 1998.
166. N. Vasconcelos and M. Vasconcelos, Scalable discriminant feature selection for
image retrieval and recognition, Computer Vision and Pattern Recognition
Conference (CVPR04), 2004, pp. 770–775.
167. M.A. Vicente, P.O. Hoyer, and A. Hyvarinen, Equivalence of some common
linear feature extraction techniques for appearance-based object recognition
tasks, IEEE Transactions on Pattern Analysis and Machine Intelligence 29
(2007), 5, 233–254.
168. R. Vidal, Y. Ma, and J. Piazzi, A new GPCA algorithm for clustering sub-
spaces by fitting, differentiating and dividing polynomials, Computer Vision
and Pattern Recognition Conference (CVPR04), 2004.
169. R. Vidal, Y. Ma, and S. Sastry, Generalized principal component analysis
(gpca), Computer Vision and Pattern Recognition Conference (CVPR04),
vol. 1, 2003, pp. 621–628.
170. P. Viola and M.J. Jones, Robust real-time face detection, International Journal
of Computer Vision 2 (2004), 57, 137–154.
171. P. Viola and W.M. Wells-III, Alignment by maximization of mutual informa-
tion, 5th International Conference on Computer Vision, vol. 2, IEEE, 1997,
pp. 137–154.
352 References
353
354 Index
Pure generative
Data−driven (edges)
Data−driven (Hough transform)
Data−driven (Hough and edges)
Energy
← solution 100
← solution 86
←solution 76
← solution 58
← solution 36
← solution 14
180
160
140
120
100
80
60
40
# solution 20
0
Fig. 3.13. The solution space in which the K-adventurers algorithm has selected
K = 6 representative solutions, which are marked with rectangles.
Fig. 3.16. Skin detection results. Comparison between the baseline model (top-
right), the tree approximation of MRFs with BP (bottom-left), and the tree approx-
imation of the first-order model with BP instead of Alg. 5. Figure by B. Jedynak,
H. Zheng and M. Daoudi (2003
c IEEE).
a football match scene
point process
Fig. 3.17. A complex image parsing graph with many levels and types of patterns.
(Figure by Tu et al. 2005
c Springer.)
Fig. 5.10. Color image segmentation results. Original images (first column) and
color image segmentation with different Gaussianity deficiency levels (second and
third columns). (Courtesy of A. Peñalver.)
b 0.5
1
a
0.4
0.2 3
4
5
0.1 876
10 9
18171413
19 11
12
1615
0
1400 1410 1420 1430 1440 1450 1460
Algorithm Steps
1
c 2
3
4
5 6
8 7
9
10
11
12
13
14
16 15
17
18
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
d
6
AIB - Color+XY GMM
AIB - Color GMM
AIB - Color histogram
5 AHI - Color histogram
4 e
Image Representation I (X,Y)
Color Histogram 2.08
I(C;Y)
0
8 7 6 5 4 3 2 1 0
I(C;X)
Fig. 5.17. From left to right and from top to bottom. (a) Image representation. Each
ellipsoid represents a Gaussian in the Gaussian Mixture Model of the image, with its
support region, mean color and spatial layout in the image plane. (b) Loss of mutual
information during the IAB clustering. The last steps are labeled with the number of
clusters in each step. (c) Part of the cluster tree formed during AIB, starting from 19
clusters. Each cluster is represented with a representative image. The labeled nodes
indicate the order of cluster merging, following the plot in (b). (d) I(T ; X) vs.
I(T ; Y ) plot for four different clustering methods. (e) Mutual Information between
images and image representations. (Figure by Goldberger et al. (2006c IEEE)).
Fig. 6.1. A 3D reconstruction of the route followed during the acquisition of the
data set, and examples of each one of the six classes. (Image obtained with 6-DOF
SLAM. Figure by Juan Manuel Sáez (2007
c IEEE).)
OVARIAN OVARIAN
PROSTATE PROSTATE
NSCLC NSCLC
NSCLC NSCLC
NSCLC NSCLC
0.5
LEUKEMIA LEUKEMIA
K562B−repro K562B−repro
K562A−repro K562A−repro
LEUKEMIA LEUKEMIA
LEUKEMIA LEUKEMIA
LEUKEMIA LEUKEMIA 0.4
LEUKEMIA LEUKEMIA
LEUKEMIA LEUKEMIA
COLON COLON
COLON COLON
COLON COLON
COLON COLON 0.3
COLON COLON
COLON COLON
COLON COLON
MCF7A−repro MCF7A−repro
BREAST BREAST
MCF7D−repro MCF7D−repro
BREAST BREAST 0.2
NSCLC NSCLC
NSCLC NSCLC
NSCLC NSCLC
MELANOMA MELANOMA
BREAST BREAST
BREAST BREAST 0.1
MELANOMA MELANOMA
MELANOMA MELANOMA
MELANOMA MELANOMA
MELANOMA MELANOMA
MELANOMA MELANOMA
MELANOMA MELANOMA 0
→ 2080
→ 6145
1177
1470
1671
3227
3400
3964
4057
4063
4110
4289
4357
4441
4663
4813
5226
5481
5494
5495
5508
5790
5892
6013
6019
6032
6045
6087
6184
6643
→ 135
246
663
766
982
→ 2080
→ 6145
19
1378
1382
1409
1841
2081
2083
2086
3253
3371
3372
4383
4459
4527
5435
5504
5538
5696
5812
5887
5934
6072
6115
6305
6399
6429
6430
6566
→ 135
133
134
233
259
381
561
Fig. 6.12. Feature selection on the NCI DNA microarray data. The MD (left) and
mRMR (right) criteria were used. Features (genes) selected by both criteria are
marked with an arrow.
Fig. 6.22. Unsupervised segmentation with gPCA. (Figures by Huang et al. IEEE
c
2004.)
Fig. 7.7. Left: appearance and shape histograms computed at different levels of
the pyramid (the number of bins of each histogram depends on the level of the
pyramid where it is computed). Right: several ROIs learn for different categories in
the Caltech-256 Database (256 categories). (Figure by A. Bosch, A. Zisserman and
X. Muñoz. [27] (2007
c IEEE)).
d=1 d=3 d=6
5 5 5
0 0 0
−5 −5 −5
0 5 10 0 5 10 0 5 10
sigma=1 sigma=5 sigma=15
10 10 10
5 5 5
C=25.0
0 0 0
−5 −5 −5
10 10 10
5 5 5
C= 0.1
0 0 0
−5 −5 −5
Fig. 7.13. Top: SVDD using different polynomial kernels (degrees). Bottom: using
different Gaussian kernels (variances). In both cases, the circles denote the support
vectors. Figure by D. M. J. Tax and R. P. W. Duin [153], ( c Elsevier 2004).