0% found this document useful (0 votes)
12 views

Information - Theory - in - Computer - Vision - and - Pattern - Recognition 2009

Uploaded by

Bob
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Information - Theory - in - Computer - Vision - and - Pattern - Recognition 2009

Uploaded by

Bob
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 375

Information Theory in Computer Vision

and Pattern Recognition


Francisco Escolano • Pablo Suau
Boyán Bonev

Information Theory
in Computer Vision
and Pattern Recognition
Foreword by
Alan Yuille

123
Francisco Escolano Boyán Bonev
Universidad Alicante Universidad Alicante
Depto. Ciencia de la Depto. Ciencia de la
Computación e Computación e
Inteligencia Artificial Inteligencia Artificial
Campus de San Vicente, s/n Campus de San Vicente, s/n
03080 Alicante 03080 Alicante
Spain Spain
[email protected] [email protected]

Pablo Suau
Universidad Alicante
Depto. Ciencia de la
Computación e
Inteligencia Artificial
Campus de San Vicente, s/n
03080 Alicante
Spain
[email protected]

ISBN 978-1-84882-296-2 e-ISBN 978-1-84882-297-9


DOI 10.1007/978-1-84882-297-9
Springer Dordrecht Heidelberg London New York

British Library Cataloguing in Publication Data


A catalogue record for this book is available from the British Library

Library of Congress Control Number: 2009927707

Springer
c Verlag London Limited 2009
Apart from any fair dealing for the purposes of research or private study, or criticism or review, as
permitted under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced,
stored or transmitted, in any form or by any means, with the prior permission in writing of the publish-
ers, or in the case of reprographic reproduction in accordance with the terms of licenses issued by the
Copyright Licensing Agency. Enquiries concerning reproduction outside those terms should be sent to
the publishers.
The use of registered names, trademarks, etc., in this publication does not imply, even in the absence of a
specific statement, that such names are exempt from the relevant laws and regulations and therefore free
for general use.
The publisher makes no representation, express or implied, with regard to the accuracy of the information
contained in this book and cannot accept any legal responsibility or liability for any errors or omissions
that may be made.

Cover design: SPi Publisher Services

Printed on acid-free paper

Springer is part of Springer Science+Business Media (www.springer.com)


To my three joys: Irene, Ana, and Mamen.
Francisco

To my parents, grandparents, and brother. To Beatriz.


Pablo

To Niya, Elina, and Maria.


Boyan
Foreword

Computer vision and pattern recognition are extremely important research


fields with an enormous range of applications. They are also extremely diffi-
cult. This may seem paradoxical since humans can easily interpret images and
detect spatial patterns. But this apparent ease is misleading because neuro-
science shows that humans devote a large part of their brain, possibly up to
50% of the cortex, to processing images and interpreting them. The difficulties
of these problems have been appreciated over the last 30 years as researchers
have struggled to develop computer algorithms for performing vision and pat-
tern recognition tasks. Although these problems are not yet completely solved,
it is becoming clear that the final theory will depend heavily on probabilistic
techniques and the use of concepts from information theory.
The connections between information theory and computer vision have
long been appreciated. Vision can be considered to be a decoding problem
where the encoding of the information is performed by the physics of the
world – by light rays striking objects and being reflected to cameras or eyes.
Ideal observer theories were pioneered by scientists such as Horace Barlow
to compute the amount of information available in the visual stimuli, and to
see how efficient humans are at exploiting it. But despite the application of
information theory to specific visual tasks, there has been no attempt to bring
all this work together into a clear conceptual framework.
This book fills the gap by describing how probability and information
theory can be used to address computer vision and pattern recognition prob-
lems. The authors have developed information theory tools side by side with
vision and pattern recognition tasks. They have characterized these tools into
four classes: (i) measures, (ii) principles, (iii) theories, and (iv) algorithms.
The book is organized into chapters addressing computer vision and pattern
recognition tasks at increasing levels of complexity. The authors have de-
voted chapters to feature detection and spatial grouping, image segmentation,
matching, clustering, feature selection, and classifier design. As the authors
address these topics, they gradually introduce techniques from information
theory. These include (1) information theoretic measures, such as entropy
VII
VIII Foreword

and Chernoff information, to evaluate image features; (2) mutual informa-


tion as a criteria for matching problems (Viola and Wells 1997); (3) minimal
description length ideas (Risannen 1978) and their application to image seg-
mentation (Zhu and Yuille 1996); (4) independent component analysis (Bell
and Sejnowski 1995) and its use for feature extraction; (5) the use of rate
distortion theory for clustering algorithms; (6) the method of types (Cover
and Thomas 1991) and its application to analyze the convergence rates of
vision algorithms (Coughlan and Yuille 2002); and (7) how entropy and in-
fomax principles (Linsker 1988) can be used for classifier design. In addition;
the book covers alternative information theory measures, such as Rényi alpha-
entropy and Jensen–Shannon divergence, and advanced topics; such as data
driven Markov Chain Monte Carlo (Tu and Zhu 2002) and information geo-
metry (Amari 1985). The book describes these theories clearly, giving many
illustrations and specifying the code by flow -charts.
Overall, the book is a very worthwhile addition to the computer vision
and pattern recognition literature. The authors have given an advanced in-
troduction to techniques from probability and information theory and their
application to vision and pattern recognition tasks. More importantly, they
have described a novel perspective that will be of growing importance over
time. As computer vision and pattern recognition develop, the details of these
theories will change, but the underlying concepts will remain the same.

UCLA, Department of Statistics and Psychology Alan Yuille


Los Angeles, CA
March 2009
Preface

Looking through the glasses of Information Theory (IT) has proved to be


effective both for formulating and designing algorithmic solutions to many
problems in computer vision and pattern recognition (CVPR): image match-
ing, clustering and segmentation, salient point detection, feature selection and
dimensionality reduction, projection pursuit, optimal classifier design, and
many others. Nowadays, researchers are widely bringing IT elements to the
CVPR arena. Among these elements, there are measures (entropy, mutual in-
formation, Kullback–Leibler divergence, Jensen–Shannon divergence...), prin-
ciples (maximum entropy, minimax entropy, minimum description length...)
and theories (rate distortion theory, coding, the method of types...).
This book introduces and explores the latter elements, together with
the one of entropy estimation, through an incremental complexity approach.
Simultaneously, the main CVPR problems are formulated and the most
representative algorithms, considering authors’ preferences for sketching the
IT–CVPR field, are presented. Interesting connections between IT elements
when applied to different problems are highlighted, seeking for a basic/skeletal
research roadmap. This roadmap is far from being comprehensive at present
due to time and space constraints, and also due to the current state of devel-
opment of the approach. The result is a novel tool, unique in its conception,
both for CVPR and IT researchers, which is intended to contribute, as much
as possible, to a cross-fertilization of both areas.
The motivation and origin of this manuscript is our awareness of the ex-
istence of many sparse sources of IT-based solutions to CVPR problems, and
the lack of a systematic text that focuses on the important question: How
useful is IT for CVPR? At the same time, we needed a research language,
common to all the members of the Robot Vision Group. Energy minimization,
graph theory, and Bayesian inference, among others, were adequate method-
ological tools during our daily research. Consequently, these tools were key to
design and build a solid background for our Ph.D. students. Soon we realized
that IT was a unifying component that flowed naturally among our rationales
for tackling CVPR problems. Thus, some of us enrolled in the task of writing
IX
X Preface

a text in which we could advance as much as possible in the fundamental links


between CVPR and IT. Readers (starters and senior researchers) will judge
to what extent we have both answered the above fundamental question and
reached our objectives.
Although the text is addressed to CVPR–IT researchers and students,
it is also open to an interdisciplinary audience. One of the most interesting
examples is the computational vision community, which includes people in-
terested both in biological vision and psychophysics. Other examples are the
roboticians and the people interested in developing wearable solutions for the
visually impaired (which is the subject of our active work in the research
group).
Under its basic conception, this text may be used for an IT-based one
semester course of CVPR. Only some rudiments of algebra and probability
are necessary. IT items will be introduced as the text flows from one computer
vision or pattern recognition problem to another. We have deliberately avoided
a succession of theorem–proof pairs for the sake of a smooth presentation.
Proofs, when needed, are embedded in the text, and they are usually excellent
pretexts for presenting or highlighting interesting properties of IT elements.
Numerical examples with toy settings of the problems are often included for
a better understanding of the IT-based solution. When formal elements of
other branches of mathematics like field theory, optimization, and so on, are
needed, we have briefly presented them and referred to excellent books fully
dedicated to their description.
Problems, questions and exercises are also proposed at the end of each
chapter. The purpose of the problems section is not only to consolidate what
is learnt, but also to go one step forward by testing the ability of generalizing
the concepts exposed in each chapter. Such section is preceded by a brief
literature review that outlines the key papers for the CVPR topic, which is
the subject of the chapter. These papers’ references, together with sketched
solutions to the problems, will be progressively accessible in the Web site
http://www.rvg.ua.es/ITinCVPR.
We have started the book with a brief introduction (Chapter 1) regarding
the four axes of IT–CVPR interaction (measures, principles, theories, and en-
tropy estimators). We have also presented here the skeletal research roadmap
(the ITinCVPR tube). Then we walk along six chapters, each one tackling a
different problem under the IT perspective. Chapter 2 is devoted to interest
points, edge detection, and grouping; interest points allow us to introduce the
concept of entropy and its linking with Chernoff information, Sanov’s theo-
rem, phase transitions and the method of types. Chapter 3 covers contour
and region-based image segmentation mainly from the perspective of model
order selection through the minimum description length (MDL) principle, al-
though the Jensen–Shannon measure and the Jaynes principle of maximum
entropy are also introduced; the question of learning a segmentation model is
tackled through links with maximum entropy and belief propagation; and the
unification of generative and discriminative processes for segmentation and
Preface XI

recognition is explored through information divergence measures. Chapter 4


reviews registration, matching, and recognition by considering the follow-
ing image registration through minimization of mutual information and re-
lated measures; alternative derivations of Jensen–Shannon divergence yield
deformable matching; shape comparison is encompassed through Fisher infor-
mation; and structural matching and learning are driven by MDL. Chapter 5
is devoted to image and pattern clustering and is mainly rooted in three IT ap-
proaches to clustering: Gaussian mixtures (incremental method for adequate
order selection), information bottleneck (agglomerative and robust with model
order selection) and mean-shift; IT is also present in initial proposals for en-
sembles clustering (consensus finding). Chapter 6 reviews the main approaches
to feature selection and transformation: simple wrappers and filters exploit-
ing IT for bypassing the curse of dimensionality; minimax entropy principle
for learning patterns using a generative approach; and ICA/gPCA methods
based on IT (ICA and neg-entropy, info-max and minimax ICA, generalized
PCA and effective dimension). Finally, Chapter 7, Classifier Design, analyzes
the main IT strategies for building classifiers. This obviously includes decision
trees, but also multiple trees and random forests, and how to improve boost-
ing algorithms by means of IT-based criteria. This final chapter ends with an
information projection analysis of maximum entropy classifiers and a careful
exploration of the links between Bregman divergences and classifiers.
We acknowledge the contribution of many people to this book. In first
place, we thank many scientists for their guide and support, and for their im-
portant contributions to the field. Researchers from different universities and
institutions such as Alan Yuille, Hamid Krim, Chen Ping-Feng, Gozde Unal,
Ajit Rajwadee, Anand Rangarajan, Edwin Hancock, Richard Nock, Shun-ichi
Amari, and Mario Figueiredo, among many others, contributed with their ad-
vices, deep knowledge and highly qualified expertise. We also thank all the
colleagues of the Robot Vision Group of the University of Alicante, especially
Antonio Peñalver, Juan Manuel Sáez, and Miguel Cazorla, who contributed
with figures, algorithms, and important results from their research. Finally,
we thank the editorial board staff: Catherine Brett for his initial encourage-
ment and support, and Simon Rees and Wayne Wheeler for their guidance
and patience.

University of Alicante, Spain Francisco Escolano


Pablo Suau
Boyan Bonev
Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Measures, Principles, Theories, and More . . . . . . . . . . . . . . . . . . . 1
1.2 Detailed Organization of the Book . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 The ITinCVPR Roadmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2 Interest Points, Edges, and Contour Grouping . . . . . . . . . . . . . 11


2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Entropy and Interest Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.1 Kadir and Brady Scale Saliency Detector . . . . . . . . . . . . . 12
2.2.2 Point Filtering by Entropy Analysis Through
Scale Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.3 Chernoff Information and Optimal Filtering . . . . . . . . . . 16
2.2.4 Bayesian Filtering of the Scale Saliency Feature
Extractor: The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3 Information Theory as Evaluation Tool: The Statistical
Edge Detection Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3.1 Statistical Edge Detection . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3.2 Edge Localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4 Finding Contours Among Clutter . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.4.2 A∗ Road Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.4.3 A∗ Convergence Proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.5 Junction Detection and Grouping . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.5.1 Junction Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.5.2 Connecting and Filtering Junctions . . . . . . . . . . . . . . . . . . 35
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.6 Key References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3 Contour and Region-Based Image Segmentation . . . . . . . . . . . 43


3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.2 Discriminative Segmentation with
Jensen–Shannon Divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
XIII
XIV Contents

3.2.1 The Active Polygons Functional . . . . . . . . . . . . . . . . . . . . . 44


3.2.2 Jensen–Shannon Divergence and the Speed Function . . . 46
3.3 MDL in Contour-Based Segmentation . . . . . . . . . . . . . . . . . . . . . . 53
3.3.1 B-Spline Parameterization of Contours . . . . . . . . . . . . . . . 53
3.3.2 MDL for B-Spline Parameterization . . . . . . . . . . . . . . . . . 58
3.3.3 MDL Contour-based Segmentation . . . . . . . . . . . . . . . . . . 60
3.4 Model Order Selection in Region-Based Segmentation . . . . . . . . 63
3.4.1 Jump-Diffusion for Optimal Segmentation . . . . . . . . . . . . 63
3.4.2 Speeding-up the Jump-Diffusion Process . . . . . . . . . . . . . 71
3.4.3 K-adventurers Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.5 Model-Based Segmentation Exploiting The Maximum
Entropy Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
3.5.1 Maximum Entropy and Markov Random Fields . . . . . . . 79
3.5.2 Efficient Learning with Belief Propagation . . . . . . . . . . . . 83
3.6 Integrating Segmentation, Detection and Recognition . . . . . . . . 86
3.6.1 Image Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
3.6.2 The Data-Driven Generative Model . . . . . . . . . . . . . . . . . . 91
3.6.3 The Power of Discriminative Processes . . . . . . . . . . . . . . . 96
3.6.4 The Usefulness of Combining Generative
and Discriminative . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
3.7 Key References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

4 Registration, Matching, and Recognition . . . . . . . . . . . . . . . . . . 105


4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
4.2 Image Alignment and Mutual Information . . . . . . . . . . . . . . . . . . 106
4.2.1 Alignment and Image Statistics . . . . . . . . . . . . . . . . . . . . . 106
4.2.2 Entropy Estimation with Parzen’s Windows . . . . . . . . . . 108
4.2.3 The EMMA Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
4.2.4 Solving the Histogram-Binning Problem . . . . . . . . . . . . . . 111
4.3 Alternative Metrics for Image Alignment . . . . . . . . . . . . . . . . . . . 119
4.3.1 Normalizing Mutual Information . . . . . . . . . . . . . . . . . . . . 119
4.3.2 Conditional Entropies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
4.3.3 Extension to the Multimodal Case . . . . . . . . . . . . . . . . . . . 121
4.3.4 Affine Alignment of Multiple Images . . . . . . . . . . . . . . . . . 122
4.3.5 The Rényi Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
4.3.6 Rényi’s Entropy and Entropic Spanning Graphs . . . . . . . 126
4.3.7 The Jensen–Rényi Divergence and Its Applications . . . . 128
4.3.8 Other Measures Related to Rényi Entropy . . . . . . . . . . . . 129
4.3.9 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
4.4 Deformable Matching with Jensen Divergence
and Fisher Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
4.4.1 The Distributional Shape Model . . . . . . . . . . . . . . . . . . . . . 132
4.4.2 Multiple Registration and Jensen–Shannon
Divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
Contents XV

4.4.3 Information Geometry and Fisher–Rao Information . . . . 140


4.4.4 Dynamics of the Fisher Information Metric . . . . . . . . . . . 143
4.5 Structural Learning with MDL . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
4.5.1 The Usefulness of Shock Trees . . . . . . . . . . . . . . . . . . . . . . 146
4.5.2 A Generative Tree Model Based on Mixtures . . . . . . . . . . 147
4.5.3 Learning the Mixture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
4.5.4 Tree Edit-Distance and MDL . . . . . . . . . . . . . . . . . . . . . . . 151
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
4.6 Key References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

5 Image and Pattern Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157


5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
5.2 Gaussian Mixtures and Model Selection . . . . . . . . . . . . . . . . . . . . 157
5.2.1 Gaussian Mixtures Methods . . . . . . . . . . . . . . . . . . . . . . . . 157
5.2.2 Defining Gaussian Mixtures . . . . . . . . . . . . . . . . . . . . . . . . 158
5.2.3 EM Algorithm and Its Drawbacks . . . . . . . . . . . . . . . . . . . 159
5.2.4 Model Order Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
5.3 EBEM Algorithm: Exploiting Entropic Graphs . . . . . . . . . . . . . . 162
5.3.1 The Gaussianity Criterion and Entropy Estimation . . . . 162
5.3.2 Shannon Entropy from Rényi Entropy Estimation . . . . . 163
5.3.3 Minimum Description Length for EBEM . . . . . . . . . . . . . 166
5.3.4 Kernel-Splitting Equations . . . . . . . . . . . . . . . . . . . . . . . . . 167
5.3.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
5.4 Information Bottleneck and Rate Distortion Theory . . . . . . . . . 170
5.4.1 Rate Distortion Theory Based Clustering . . . . . . . . . . . . . 170
5.4.2 The Information Bottleneck Principle . . . . . . . . . . . . . . . . 173
5.5 Agglomerative IB Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
5.5.1 Jensen–Shannon Divergence and Bayesian
Classification Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
5.5.2 The AIB Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
5.5.3 Unsupervised Clustering of Images . . . . . . . . . . . . . . . . . . 181
5.6 Robust Information Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
5.7 IT-Based Mean Shift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
5.7.1 The Mean Shift Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 189
5.7.2 Mean Shift Stop Criterion and Examples . . . . . . . . . . . . . 191
5.7.3 Rényi Quadratic and Cross Entropy from Parzen
Windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
5.7.4 Mean Shift from an IT Perspective . . . . . . . . . . . . . . . . . . 196
5.8 Unsupervised Classification and Clustering Ensembles . . . . . . . . 197
5.8.1 Representation of Multiple Partitions . . . . . . . . . . . . . . . 198
5.8.2 Consensus Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
5.9 Key References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
XVI Contents

6 Feature Selection and Transformation . . . . . . . . . . . . . . . . . . . . . 211


6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
6.2 Wrapper and the Cross Validation Criterion . . . . . . . . . . . . . . . . 212
6.2.1 Wrapper for Classifier Evaluation . . . . . . . . . . . . . . . . . . . . 212
6.2.2 Cross Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
6.2.3 Image Classification Example . . . . . . . . . . . . . . . . . . . . . . . 215
6.2.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
6.3 Filters Based on Mutual Information . . . . . . . . . . . . . . . . . . . . . . . 220
6.3.1 Criteria for Filter Feature Selection . . . . . . . . . . . . . . . . . . 220
6.3.2 Mutual Information for Feature Selection . . . . . . . . . . . . . 222
6.3.3 Individual Features Evaluation, Dependence
and Redundancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
6.3.4 The min-Redundancy Max-Relevance Criterion . . . . . . . 225
6.3.5 The Max-Dependency Criterion . . . . . . . . . . . . . . . . . . . . . 227
6.3.6 Limitations of the Greedy Search . . . . . . . . . . . . . . . . . . . . 228
6.3.7 Greedy Backward Search . . . . . . . . . . . . . . . . . . . . . . . . . . 231
6.3.8 Markov Blankets for Feature Selection . . . . . . . . . . . . . . . 234
6.3.9 Applications and Experiments . . . . . . . . . . . . . . . . . . . . . . 236
6.4 Minimax Feature Selection for Generative Models . . . . . . . . . . . 238
6.4.1 Filters and the Maximum Entropy Principle . . . . . . . . . . 238
6.4.2 Filter Pursuit through Minimax Entropy . . . . . . . . . . . . . 242
6.5 From PCA to gPCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
6.5.1 PCA, FastICA, and Infomax . . . . . . . . . . . . . . . . . . . . . . . . 244
6.5.2 Minimax Mutual Information ICA . . . . . . . . . . . . . . . . . . . 250
6.5.3 Generalized PCA (gPCA) and Effective Dimension . . . . 254
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
6.6 Key References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269

7 Classifier Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271


7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
7.2 Model-Based Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
7.2.1 Reviewing Information Gain . . . . . . . . . . . . . . . . . . . . . . . . 272
7.2.2 The Global Criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
7.2.3 Rare Classes with the Greedy Approach . . . . . . . . . . . . . . 275
7.2.4 Rare Classes with Global Optimization . . . . . . . . . . . . . . . 280
7.3 Shape Quantization and Multiple Randomized Trees . . . . . . . . . 284
7.3.1 Simple Tags and Their Arrangements . . . . . . . . . . . . . . . . 284
7.3.2 Algorithm for the Simple Tree . . . . . . . . . . . . . . . . . . . . . . 285
7.3.3 More Complex Tags and Arrangements . . . . . . . . . . . . . . 287
7.3.4 Randomizing and Multiple Trees . . . . . . . . . . . . . . . . . . . . 289
7.4 Random Forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
7.4.1 The Basic Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
7.4.2 The Generalization Error of the RF Ensemble . . . . . . . . . 291
7.4.3 Out-of-the-Bag Estimates of the Error Bound . . . . . . . . . 294
7.4.4 Variable Selection: Forest RI vs. Forest-RC . . . . . . . . . . . 295
Contents XVII

7.5 Infomax and Jensen–Shannon Boosting . . . . . . . . . . . . . . . . . . . . . 298


7.5.1 The Infomax Boosting Algorithm . . . . . . . . . . . . . . . . . . . . 299
7.5.2 Jensen–Shannon Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . 305
7.6 Maximum Entropy Principle for Classification . . . . . . . . . . . . . . . 308
7.6.1 Improved Iterative Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . 308
7.6.2 Maximum Entropy and Information Projection . . . . . . . . 313
7.7 Bregman Divergences and Classification . . . . . . . . . . . . . . . . . . . . 324
7.7.1 Concept and Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
7.7.2 Bregman Balls and Core Vector Machines . . . . . . . . . . . . 326
7.7.3 Unifying Classification: Bregman Divergences
and Surrogates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339
7.8 Key References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353

Color Plates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357


1
Introduction

Shannon’s definition of channel capacity, information and redundancy


was a landmark [...]. Since these measurable quantities were obviously
important to anyone who wanted to understand sensory coding and
perception, I eagerly stepped on the boat.

Horace Basil Barlow1

1.1 Measures, Principles, Theories, and More

The Information Theory (IT) approach has proved to be effective in solving


many Computer Vision and Pattern Recognition (CVPR) problems (image
matching, clustering and segmentation, extraction of invariant interest points,
feature selection, optimal classifier design, model selection, PCA/ICA,2
Projection Pursuit, and many others). The computational analysis of images
and more abstract patterns is a complex and challenging task, which demands
interdisciplinary efforts. Thus, the confluence of Bayesian/Statistical Learn-
ing, Optimization (Energy Minimization), Gibbs/Random Fields, and other
methodologies yields a valuable cross-fertilization that is helpful both for
formulating and solving problems.
Nowadays, researchers are widely exploiting IT elements to formulate and
solve CVPR problems. Among these elements, we find measures, principles,
and theories. Entropy, Mutual Information, and Kullback–Leibler divergence
are well known measures, which are typically used as metrics or as optimiza-
tion criteria. For instance, in ICA it is interesting to find the projection di-
rections maximizing the independence of the outputs, and this is equivalent

1
Redundancy reduction revisited, Network: Comput. Neural Syst. 12 (2001)
241–253.
2
Principal Component Analysis, Independent Component Analysis.
F. Escolano et al., Information Theory in Computer Vision and Pattern Recognition, 1

c Springer-Verlag London Limited 2009
2 1 Introduction

to minimize their mutual information. On the other hand, some examples


of IT principles are Minimum Description Length (MDL), and the Minimax
Entropy principles. The first one, enunciated by Rissanen, deals with the selec-
tion of the simplest model in terms of choosing the shortest code (or number
of parameters), explaining the data well enough. For example, when facing
the region segmentation problem, it is convenient to partition the image in as
few regions as possible, provided that the statistics of each of them may be
explained by a reduced set of parameters.
Minimax Entropy, formulated by Christensen, has more to do with per-
ceptual learning. Inferring the probability distribution characterizing a set of
images may be posed in terms of selecting, among all distributions match-
ing the statistics of the learning examples, the one with maximum entropy
(ME) (Jaynes’ maximum entropy principle). This ensures that the learned
distribution contains no more information than the examples. However, the
latter statistics depend on the features selected for computing them, and it
is desirable to select the ones yielding the maximum information (minimal
entropy).
Regarding IT theories, we refer to mathematical developments. Two
examples are Rate Distortion Theory and the Method of Types. Rate Dis-
tortion Theory formalizes the question of what is the minimum expected
distortion yielded by lowering the bit-rate, for instance, in lossy compression.
From this point of view, clustering provides a compact representation (the
prototypes) that must preserve the information about the example patterns
or images. As classification implies some loss of individual identity in favor of
the prototypes, a unique class for all patterns yields the maximal loss of in-
formation (minimizes mutual information between examples and prototypes).
However, the compactness may be relaxed so that the average distortion is
constrained to be under a given threshold. A more elaborated criterion is the
Information-Bottleneck (IB) method in which the distorted upper bound con-
straint is replaced by a lower bound constraint over the relevant information.
The key idea is to find a trade-off between maximizing information between
relevant features and prototypes (relevance), and minimizing the mutual in-
formation between the examples and prototypes (compactness).
The Method of Types, developed by Csiszár and Körner, provides theo-
retical insights into calculating the probability of rare events. Types are asso-
ciated to empirical histograms, and the Sanov’s theorem yields bounds to the
probability that a given type lies within a certain set of types (for instance
those which indicate that the texture characterized by the histograms belongs
to a given class instead of to another one). As a consequence of the Sanov’s
theorem, the expected number of the mis-classified textures depends on the
Bhattacharyya distance, which, in turn, is bounded by Chernoff information.
Chernoff information quantifies the overlapping between the distributions as-
sociated to different classes of patterns (textures, for instance). Bhattacharyya
distance determines order parameters whose sign determines whether the dis-
crimination problem may be solved or not (in this latter case, when the two
1.2 Detailed Organization of the Book 3

pattern classes are too close). For the zero value, there is a phase transition.
These results come from the analysis of other tasks like detecting a path (edge)
among clutter, and the existence of order parameters, which allow to quantify
the degree of success of related tasks like contour linking.
In addition to measures, principles and theories, there is a fourth dimen-
sion, orthogonal to the other three, to explore: the problem of estimating
entropy, the fundamental quantity in IT. Many other measures (mutual in-
formation, Kullback–Leibler divergence, and so on) are derived from entropy.
Consequently, in many cases, the practical application of the latter elements
relies on a consistent estimation of entropy. The two extreme cases are the
plug-in methods, in which the estimation of the probability density precedes
the computing of entropy, and the bypass methods where entropy is estimated
directly. For example, as the Gaussian distribution is the maximum entropy
one among all the distributions with the same variance, the latter consid-
eration is key, for instance, in Gaussian Mixture Models, typically used as
classifiers, and also in ICA methods, which usually rely on the departure from
Gaussianity. Thus, entropy estimation will be a recurrent topic along the book.

1.2 Detailed Organization of the Book


Given the above described introductory considerations, it proceeds now to re-
view the organization of the book and the exposition of both the IT elements
and entropy estimation approaches within the context of specific computer
vision and pattern recognition problems. The first important consideration
is that we will follow an increasing-complexity problem driven exposition: in-
terest points and edges detection, contour grouping and segmentation, im-
age and point matching, image and pattern clustering, feature selection and
transformation, and finally, classifier design. The outcome is a total of six more
chapters. The proposed increasing-complexity exposition allows us to preclude
devoting this introductory chapter to mathematical formalisms. We sincerely
believe in introducing the mathematical elements at the same time that each
problem is formulated and tackled. IT beginners will understand the mathe-
matical elements through the examples provided by CVPR problems, and re-
searchers with an IT background will go ahead to the details of each problem.
Anyway, basic notions of the Theory of Probability and Bayesian Analysis are
needed, and additional formal details and demonstrations, lying beyond the
understanding of each problem, will be embedded in text at the level necessary
to follow the exposition. Let us give a panoramic view of each chapter.
The first topic dealt with in Chapter 2 is interest points and edges de-
tection. A recent definition of saliency implies both the computation of the
intensity information content in the neighborhood of a pixel, and how such in-
formation evolves along different scales. Information is quantified by entropy.
However, the bottleneck of this method is the scale-space analysis, and also
4 1 Introduction

the huge amount of computation needed to extend it to the affine-invariant


case or the multidimensional case. This is critical when these features are used
for image matching. However, when information about the class of images is
available, it is possible to reduce significantly the computational cost. In order
to do so, we introduce concepts like Chernoff Information and Bhattacharyya
coefficient, coming from the formulation of edge detection as a statistical in-
ference problem, which is subsequently presented and discussed in detail. We
emphasize the interest of this methodology for: (i) evaluating models for edges
(and, in general, for other image features); (ii) evaluating and combining cues
(filters); and (iii) adaptation between data sets (classes of images). Finally,
we introduce the problem of edge linking when significant clutter is present,
and the approach derived from applying the Method of Types.
Chapter 3 is devoted both to contour-based and region-based segmentation.
We first present a modern approach to apply active contours (active polygons)
to textured regions. This leads us to introduce the Jensen–Shannon (JS) di-
vergence, as a cost function, because it can efficiently account for high-order
statistics. However, as such divergence relies on entropy and this, in turn,
relies on estimating the density function, the Jaynes’ maximum entropy (ME)
principle may be applied for estimating the shape of the density functions,
though practical reasons (efficiency) impose the use of bypass approximations
of entropy. Next key IT issue is how to measure the adequacy of the contour-
based segmentation. The Minimum Description Length (MDL) principle is
one of the answers. We firstly consider the problem of fitting a deformable
contour to a shape in terms of finding the simplest contour that best explains
its localization. Simplicity preclude a high number of descriptive parameters,
and good localization implies enough contrast between the image statistics of
inside and outside the shape (region model). From the algorithmic point of
view, this can be solved by trying with all the number of parameters within
a given range and finding, for each of them, the optimal placement of the
contour and the statistics of the region model. Alternatively, one may use
a jump-diffusion scheme to find the optimal number of regions in the im-
age, as well as their descriptive parameters. In this case, MDL is implicit in
the problem formulation and this optimal number of regions must be found.
However, the algorithmic solution presented in this chapter (DDMCMC3 ) is
a combination of high-level (top-down) hypothesis generator (Markov Chain
Monte Carlo simulation), which suggests changing of the number and type
of models (jumps), and a discriminative (bottom-up) computation that adds
image-level knowledge to the jumps probabilities and accelerates convergence.
The next interesting concept is the use of ME principle for learning a given
intensity model. In this regard, our choice is a method with an interesting con-
nection with belief propagation. Finally, in this chapter we look again at the
relationship between discriminative and generative approaches, their possible

3
Data driven Markov Chain Monte Carlo.
1.2 Detailed Organization of the Book 5

integration, and the connection between segmentation and object recognition.


In this regard, an interesting fact is the quantification of convergence in terms
of Kullback–Leibler divergences.
Chapter 4 addresses registration, matching, and recognition mainly rooted
in the elements: (i) alignment using mutual information; (ii) point-sets regis-
tration with JS divergence; (iii) deformable shape matching and comparison
with Fisher information; and (iv) structural learning with MDL. The first
topic, alignment using mutual information, refers to a large body of work
devoted to exploit statistics, through mutual information, for finding the op-
timal 2D transformation between two images, or for even aligning 3D data
and model. Firstly, we present the classical approach consisting of a stochastic
maximization algorithm. In this case, the proposed solution relies on density
estimation in order to measure entropy, and thus, it is interesting to previously
introduce the Parzen windows method. On the other hand, when histograms
are used, it is important to analyze the impact of histogram binning in the
final result. This is not a negligible problem, especially when noise and more
general transformation arise. In this latter context, new ideas for representing
joint distributions are very interesting. Another problem when using mutual
information is the nature of the measure itself. It is far from being a met-
ric in the formal sense. Thus, it is interesting to review both derived and
alternative better conditioned measures, for instance, when affine deforma-
tions between model and target image arise. A good example is the sum of
conditional entropies, which has been proved to work well in the multimodal
case (matching between multiple images). It is also interesting to redefine the
Jensen divergence in terms of the Rényi entropy. Furthermore, the estimation
of such entropy is the target of the research in entropic graphs. Consequently,
the redefinition of mutual information in the same terms yields a novel ap-
proach to image registration. Next, Jensen divergence is considered again, in
this case as a metric for learning shape prototypes. The application of Fisher
information to shape matching and comparison is the following topic. To con-
clude this chapter, we have focused on structural matching, mainly in trees
(the simplest case) and, more specifically, in structural learning. This is one
of the open problems in structural pattern recognition, and we present here a
solution that considers the MDL principle as a driving force for shape learning
and comparison.
Chapter 5 covers pattern and image clustering through many IT ap-
proaches/faces to/of the clustering problem, like: (i) mixture models; (ii) the
Information Bottleneck method; (iii) the recent adaptation of Mean-Shift to
IT; and (iv) consensus in clustering ensembles driven by IT. Mixtures are
well-known models for estimation of probability density functions (pdf), as
well as for clustering vectorized data. The main IT concern here is the model
selection problem, that is, finding the minimal number of mixtures which ad-
equately fit the data. Connection with segmentation (and with MDL) is im-
mediate. The classic EM algorithm for mixtures is initialization dependent
and prone to local maxima. We show that such problems can be avoided by
6 1 Introduction

starting with a unique kernel and decomposing it when needed. This is the
entropy-based EM (EBEM) approach, and it consists of splitting a Gaussian
kernel when bi-modality in the underlying data is suspected. As Gaussian dis-
tributions are the maximum-entropy ones among all distributions with equal
variance, it seems reasonable to measure non-Gaussianity as the ratio between
the entropy of the underlying distribution and the theoretical entropy of the
kernel, when it is assumed to be Gaussian. As the latter criterion implies
measuring entropy, we may use either a plug-in method or a bypass one (e.g.
entropic graphs). Another trend in this chapter is the Information Bottleneck
(IB) method, whose general idea has been already introduced. Here, we start
by introducing the measure of distortion between examples and prototypes.
Exploiting Rate Distortion Theory in order to constrain the distortion (oth-
erwise information about examples is lost – maximum compression), a varia-
tional formulation yields an iterative algorithm (Blahut–Arimoto) for finding
the optimal partition of the data given the prototypes, but not for obtain-
ing the prototypes themselves. IB comes from another fundamental question:
distortion measure to use. It turns out that trying to preserve the relevant
information in the prototype about another variable, which is easier than
finding a good distortion measure, leads to a new variational problem and
the Kullback–Leibler divergence emerges as natural distortion measure. The
new algorithm relies on deterministic annealing. It starts with one cluster and
progresses by splitting. An interesting variation of the basic algorithm is its
agglomerative version (start from as many clusters as patterns and build a
tree through different levels of abstraction). There is also recent work (RIC4 )
that addresses the problem of model-order selection through learning theory
(VC-dimension) and eliminates outliers by following a channel-capacity cri-
terion. Next item is to analyze how the yet classic and efficient mean-shift
clustering may be posed in IT terms. More precisely, the key is to minimize
the Rényi’s quadratic entropy. The last topic of the chapter, clustering ensem-
bles, seeks for obtaining combined clusterings/partitions in an unsupervised
manner, so that the resulting clustering yields better quality than individ-
ual ones. Here, combination means some kind of consensus. There are several
definitions of consensus, and one of them, median consensus, can be found
through maximizing the information sharing between several partitions.
After reviewing IT solutions to the clustering problem, Chapter 6 deals
with a fundamental question, feature selection, which has deep impact both
in clustering and in classifier design (which will be tackled in Chapter 7). We
review filters and wrappers for feature selection under the IT appeal. Con-
sidering wrappers, where the selection of a group of features conforms to the
performance induced in a supervised classifier (good generalization for unseen
patterns), mutual information plays an interesting role. For instance, max-
imizing mutual information between the unknown true labels associated to
a subset of features and the labels predicted by the classifier seems a good

4
Robust Information Clustering.
1.2 Detailed Organization of the Book 7

criterion. However, the feature selection process (local search) is complex, and
going far from a greedy solution, gets more and more impractical (exponential
cost), though it can be tackled through genetic algorithms. On the other hand,
filters rely on statistical tests for predicting the goodness of the future classi-
fier for a given subset of features. Recently, mutual information has emerged
as the source of more complex filters. However, the curse of dimensionality
preclude an extended use of such criterion (maximal dependence between fea-
tures and classes), unless a fast (bypass) method for entropy estimation is
used. Alternatively, it is possible to formulate first-order approximations via
the combination of simple criteria like maximal relevance and minimal redun-
dancy. Maximal relevance consists of maximizing mutual information between
isolated features and target classes. However, when this is done in a greedy
manner it may yield redundant features that should be removed in the quasi-
optimal subset (minimal redundancy). The combination is dubbed mRMR.
A good theoretical issue is the connection between incremental feature selec-
tion and the maximum dependency criterion. It is also interesting to combine
these criteria with wrappers, and also to explore their impact on classifica-
tion errors. Next step in Chapter 5 is to tackle feature selection for generative
models. More precisely, texture models presented in Chapter 3 (segmentation)
may be learned through the application of the minimax principle. Maximum
entropy has been introduced and discussed in Chapter 3 as a basis for model
learning. Here we present how to specialize this principle to the case of learning
textures from examples. This may be accomplished by associating features to
filter responses histograms and exploiting Markov random fields (the FRAME
approach: Filters, Random Fields and Maximum Entropy). Maximum entropy
imposes matching between filter statistics of both the texture samples and the
generated textures through Gibbs sampling. On the other hand, filter selection
attending minimal entropy should minimize the Kullback–Leibler divergence
between the obtained density and the unknown density. Such minimization
may be implemented by a greedy process focused on selecting the next feature
inducing maximum decrease of Kullback–Leibler divergence with respect to
the existing feature set. However, as the latter divergence is complex to com-
pute, the L1 norm between the observed statistics and those of the synthesized
texture, both for the new feature, is finally maximized. Finally in Chapter 6,
we cover the essential elements necessary for finding an adequate projection
basis for vectorial data and, specially, images. The main concern with respect
to IT is the choice of the measures for quantifying the interest of a given pro-
jection direction. As we have referred to at the beginning of this introduction,
projection bases whose components are as much independent as possible seem
more interesting, in terms of pattern recognition, than those whose compo-
nents are simply decorrelated (PCA). Independence may be maximized by
maximizing departure from Gaussianity, and this is what many ICA algo-
rithms do. Thus, the concept of neg-entropy (difference between entropy of a
Gaussian and that of the current outputs/components distribution) arises. For
instance, the well-known FastICA algorithm is a fixed-point method driven by
8 1 Introduction

neg-entropy maximization. The key point is how is neg-entropy approximated.


Another classical approach, Infomax, relies on minimizing the mutual infor-
mation between components, which can be done by maximizing the mutual
information between inputs to neural network processors and their associated
outputs. This is followed by the presentation of an improvement consisting of
minimizing the sum of marginal entropies, provided that a density estimation
is available. As we have seen in Chapter 3, the shape of the density function
is determined by the maximum entropy principle. A simple gradient descent
with respect to the parameterized version of the unknown matrix yields the
best projection basis. Finally, we explore how generalized PCA (gPCA) leads
to the simultaneous finding of the optimal subspaces fitting the data and their
classification. In this regard we focus on the key concept of effective dimen-
sion which is the basic element of model-order selection in gPCA. This is a
particularly interesting chapter end because of two reasons. First, it leaves
some open problems, and second, it connects both with the topic of feature
selection (Chapter 5) and the topic of classifier design (Chapter 7).
The last content-exposition chapter of the book is Chapter 7, and cov-
ers classifier design. We devote it to discuss the implications of IT in the
design of classifiers. We will cover three main topics: (i) from decision trees
to random forests; (ii) random forests in detail; and (iii) boosting and IT.
Decision trees are well-known methods for learning classifiers, and their foun-
dations emanate from IT. There has been a huge amount of research devoted
to achieve remarkable improvements. In this book, we are not pursuing an
exhaustive coverage of all of these improvements, but focusing on a couple of
significant contributions: model-based trees and random forests. Model-based
trees arise from changing the classical bottom-up process for building the trees
and considering a top-down approach that relies on a statistical model. Such
model is generated through global optimization, and the optimization crite-
rion might yield a trade-off between accuracy and efficiency. The impact of the
model-based approach is quite interesting. On the other hand, random forests
are tree ensembles yielding a performance comparable to AdaBoost and also
high robustness. We will review both Forest-RI (random input selection) and
Forest-RC (linear combination of inputs). Despite random forests are good
ensembles, they may be improved, for instance, by decreasing the correlation
between individual trees in the forest. It is, thus, interesting to consider dif-
ferent improving strategies. Another class of classifier ensembles constitutes
boosting methods. We will present recent developments including IT elements.
The first one to be presented is infomax boosting, which relies on integrating
good base classifiers, that is, very informative ones in the general AdaBoost
process. Infomax features maximize mutual information with respect to class
labels. For simplifying the process, the quadratic mutual information is cho-
sen. On the other hand, in the JSBoost approach, the most discriminating
features are selected on the basis of Jensen divergence. In the latter case, the
classification function is modified in order to enforce discriminability.
Markov
Chan & Vese Belief
Perceptual random
functional propagation
grouping fields
Maximum
entropy
(segmentation) H Channel
Junctions
Quadratic capacity
Rényi entropy (RIC) AIB
Active Information plane
Contours Segmentation Consensus
polygons and grouping
H clustering
Jensen-Shannon Contour Region
Bayesian A* divergence fitting Competition
MDL
Jensen-Shannon
Phase transition Information Bottleneck
(edges) B-Splines Clustering
Other
adn MDL Jump-diffusion
Sanov’s Th. Jensen-Rényi IT principles Information
(edges) (segmentation) MED AIC-GAIC Bottleneck
Data-driven Markov chain (gPCA) (gPCA)
Detection: BIC
edges, Extended MAP (k-Adventurers) (X-means)
points, Mutual
textures Information KL divergence Gaussianity
(EBEM)
Entropy
H
Saliency H Discriminative Convergence Infomax Minimax Minimax MML
Jensen-Rényi tests for of tests ICA ICA FRAME (mixtures)
(alignment) recognition
MML H
Interest points (graphs)
filtering Jensen-Shannon
MDL
Sanov’s Th. a-MI H (deformable
matching) (trees)
Bhattacharyya (textures) Feature selection
and transform
Recognition Fisher-Rao
Chernoff information
Jensen-
Phase Transition Maximum Iterative Infomax Shannon
entropy Markov
H
(texture) scaling boosting boosting mRMR MD
Features classifier
Method blankets
of Types
Discriminability
(texture)
Exponential
functions CVMs ULS
Measures Weak
Information Rate
learners
projection Bregman distorsion
(boosting)
Classification theory
Principles divergence
Other
IT theories
Theories Error Randomized Decision Random
bound Generalized trees trees forests
maximum
Entropy
1.2 Detailed Organization of the Book

H entropy
estimation

Fig. 1.1. The ITinCVPR tube/underground (lines) communicating several problems (quarters) and stopping at several stations. See
9

Color Plates.
10 1 Introduction

We finish this chapter, and the book, with an in depth review of maximum
entropy classification, the exponential distributions family, and their links with
information projection, and, finally, with the recent achievements about the
implications of Bregman divergences in classification.

1.3 The ITinCVPR Roadmap

The main idea of this section is to graphically describe the book as the map of
a tube/underground communicating several CVPR problems (quarters). The
central quarter is recognition and matching, which is adjacent to all the others.
The rest of adjacency relations () are: interest points and edges  segmenta-
tion  clustering  classifier design  feature selection and transformation.
Tube lines are associated to either measures, principles, and theories. Stations
are associated significant concepts for each line. The case of transfer stations
is especially interesting. Here one can change from one line to another carrying
on the concept acquired in the former line and the special stations associated
to entropy estimation. This idea has been illustrated in Fig. 1.1 and it is a
good point to revisit as the following chapters are understood.
2
Interest Points, Edges, and Contour Grouping

2.1 Introduction
This chapter introduces the application of information theory to the field of
feature extraction in computer vision. Feature extraction is a low-level step in
many computer vision applications. Its aim is to detect visual clues in images
that help to improve the results or speed of this kind of algorithms. The first
part of the chapter is devoted to the Kadir and Brady scale saliency algo-
rithm. This algorithm searches the most informative regions of an image, that
is, the regions that are salient with respect to their local neighborhood. The
algorithm is based on the concept of Shannon’s entropy. This section ends
with a modification of the original algorithm, based on two information the-
oretical divergence measures: Chernoff information and mutual information.
The next section is devoted to the statistical edge detection work by Kon-
ishi et al. In this work, Chernoff information and mutual information, two
additional measures that will be applied several times through this book, are
applied to evaluate classifiers performance. Alternative uses of information
theory include the theoretical study of some properties of algorithms. Specif-
ically, Sanov’s theorem and the theory of types show the validity of the road
tracking detection among clutter of Coughlan et al. Finally, the present chap-
ter introduces an algorithm by Cazorla et al., which is aimed at detecting
another type of image features: junctions. The algorithm is based on some
principles explained here.

2.2 Entropy and Interest Points


Part of the current research on image analysis relies on a process called inter-
est point extraction, also known as feature extraction. This is the first step of
several vision applications, like object recognition or detection, robot localiza-
tion, simultaneous localization and mapping, and so on. All these applications
are based on the processing of a set of images (usually the query and database
F. Escolano et al., Information Theory in Computer Vision and Pattern Recognition, 11

c Springer-Verlag London Limited 2009
12 2 Interest Points, Edges, and Contour Grouping

images) in order to produce a result. Repeating the same operations for all
pixels in a whole set of images may produce an extremely high computational
burden, and as a result, these applications may not be prepared to operate
in real time. Feature extraction in image analysis may be understood as a
preprocessing step, in which the objective is to provide a set of regions of the
image that are informative enough to successfully complete the previously
mentioned tasks. In order to be valid, the extracted regions should be invari-
ant to common transformations, such as translation and scale, and also to
more complex transformations. This is useful, for instance, when trying to
recognize an object from different views of the scene, or when a robot must
compare the image it has just took with the ones in its database in order to
determine its localization on a map.
An extensive number of different feature extraction algorithms have been
developed during the last few years, the most known being the multiscale gen-
eralization of the Harris corner detector and its multiscale modification [113],
or the recent Maximally Stable Extremal Regions algorithm [110], a fast, ele-
gant and accurate approach. However, if we must design an algorithm to search
informative regions on an image, then it is clear that an information theory-
based solution may be considered. In this section we explain how Gilles’s first
attempt based on entropy [68] was first generalized by Kadir and Brady [91]
to be invariant to scale transformations, and then it was generalized again to
be invariant to affine transformations. We will also discuss about the compu-
tational time of this algorithm, its main issue, and how it can be reduced by
means of the analysis of the entropy measure.

2.2.1 Kadir and Brady Scale Saliency Detector

Feature extraction may be understood as the process of looking for visually


salient regions of an image. Visual saliency can be defined as visual unpre-
dictability; an image region is a salient region if it is locally uncommon. Think
of a completely white image containing a black spot, or of an overview of high-
way containing a lonely car; both the black spot and the car are locally uncom-
mon parts of their respective images. In information theory, unpredictability is
measured by means of Shannon’s entropy. Given a discrete random variable X
that can take on possible values {x1 , . . . , xN }, Shannon’s entropy is calculated
as
N
H(X) = − p(xi ) log2 p(xi ) (2.1)
i=1

and it is measured in bits. Low entropy values correspond to predictable or


high informative random variables, that is, random variables in which the
probability of a given random value is much higher than the probability
of the rest of values. On the contrary, higher entropy values correspond to
unpredictable random variables, in which the probability of all their possible
2.2 Entropy and Interest Points 13

random values is similar. If we translate feature extraction problem to the do-


main of information theory, it is obvious that a feature extraction algorithm
must search highest entropy regions on the image.
This is the main idea of Gilles feature extraction algorithm, which
formulates local saliency in terms of local intensity. Given a point x, a
local neighborhood Rx , and a descriptor D that takes values on {d1 , . . . , dL }
(for instance, D = {0, . . . , 255} on a 8 bit gray level image), local entropy is
defined as
L
HD,Rx = − PD,Rx (di ) log2 PD,Rx (di ) (2.2)
i=1

where PD,Rx (di ) is the probability of descriptor D taking the value di in


the local region Rx . Highest entropy regions are considered salient regions
and returned as the output of the algorithm. The most evident drawback of
this approach is that the scale of the extracted regions, given by |Rx |, is a
prefixed parameter, and as a consequence, only image features that lie in a
small range of scales can be detected. Furthermore, Gilles approach is very
sensitive to noise or small changes of the image, and extracted features rarely
are stable over time.
The Gilles algorithm was extended by Kadir and Brady. The scale saliency
algorithm considers salient regions not only in image space, but also in scale
space, achieving scale invariance. A region is considered salient if it is salient
in a narrow range of scales. A summary of this method is shown in Alg. 1.
The result of this algorithm is a sparse three dimensional matrix containing
weighted local entropies for all pixels at those scales where entropy is peaked.
The highest values are selected as the most salient regions of the image, and
then a clustering (non-maximum suppression) process is launched in order to
join high overlapping regions. Figure 2.1 shows a comparison between the two
methods.
This algorithm detects isotropic salient regions, meaning that it is not
invariant to affine transformations like out of the plane rotations. However,

Fig. 2.1. Comparison of Gilles and scale saliency algorithms. Left: original synthetic
image. Center: results for Gilles algorithm, using |Rx | = 9, and showing only one
extracted region for each black circle. Right: scale saliency output, without clustering
of overlapping results.
14 2 Interest Points, Edges, and Contour Grouping

Algorithm 1: Kadir and Brady scale saliency algorithm


Input: Input image I, initial scale smin , final scale smax
for each pixel x do
for each scale s between smin and smax do
L
Calculate local entropy HD (s, x) = − Ps,x (di ) log2 Ps,x (di )
i=1
end
Choose the set of scales at which entropy is a local maximum
Sp = {s : HD (s − 1, x) < HD (s, x) > HD (s + 1, x)}
for each scale s between smin and smax do
if s ∈ Sp then
Entropy weight calculation by means of a self-dissimilarity
measure in scale space
s2
L
WD (s, x) = 2s−1 i=1 | Ps,x (di ) − Ps−1,x (di ) |
Entropy weighting YD (s, x) = HD (s, x)WD (s, x)
end
end
end
Output: A sparse three dimensional matrix containing weighted local
entropies for all pixels at those scales where entropy is peaked

this restriction is easily relaxed if we consider elliptical regions (anisotropic)


rather than circular regions (isotropic). This change is achieved replacing the
scale parameter s by a set of parameters (s, φ, θ), where θ is the orientation of
the ellipse
 and φ the  axis ratio. This way, major and minor axes are calculated
as s/ (φ) and s (φ), respectively. Indeed, this modification increases time
complexity exponentially in the case of an exhaustive search, but an iterative
approach may be applied. Starting from the regions detected by the original
isotropic algorithm as seeds, their ellipse parameters are refined. First WD is
maximized modifying its ratio and orientation, and then HD is also maximized
modifying its scale. These two steps are repeated until there is no change. An
example is shown in Fig. 2.2.

2.2.2 Point Filtering by Entropy Analysis Through Scale Space

The main drawback of this method is its time complexity, due to the fact
that each pixel at each scale must be computed. As a consequence, Kadir and
Brady scale saliency feature extractor is the slowest of all modern feature
extraction algorithms, as recent surveys suggest. However, complexity may
be remarkably decreased if we consider a simple idea: given a pixel x and two
different neighborhoods Rx and Rx , if Rx is homogeneous and |Rx | > |Rx |,
2.2 Entropy and Interest Points 15

Fig. 2.2. Left: original synthetic image, formed by anisotropic regions. Center:
output of the scale saliency algorithm. Right: output of the affine invariant scale
saliency algorithm.

Z
8

15
y 10
250 300
5 150 200
50 100 x
0

Fig. 2.3. Evolution of saliency through scale space. The graph on the right shows
how the entropy value (z axis) of all pixels in the row highlighted on the left image
(x axis) varies from smin = 3 to smax = 20 (y axis). As can be seen, there are not
abrupt changes for any pixel in the scale range.

then the probability of Rx of also being homogeneous is high. Figure 2.3
shows that this idea is supported by the evolution of entropy through scale
space; the entropy value of a pixel smoothly varies through different scales.
Therefore, a preprocess step may be added to the original algorithm in
order to discard several pixels from the image, based on the detection of
homogeneous regions at smax . This previous stage is summarized as follows:
1. Calculate the local entropy HD for each pixel at scale smax .
2. Select an entropy threshold σ ∈ [0, 1].
3. X = {x | maxH D (x,smax )
x {HD (x,smax )}
> σ}.
4. Apply scale saliency algorithm only to those pixels x ∈ X.
It must be noted that the algorithm considers the relative entropy with re-
spect to the maximum entropy value at smax for the image in step 3. This way,
16 2 Interest Points, Edges, and Contour Grouping

an unique threshold may be applied to a set of images, regardless of their ab-


solute entropy values. But a question arises: how should this threshold σ be
chosen? A low threshold may discard small amount of points, and a higher
one may discard points that are part of the most salient regions of the image.
The threshold may be easily chosen a posteriori for an image if its most salient
regions are detected, and then their minimum relative entropy at scale smax is
selected. However, the optimal threshold for another image will be completely
different. The solution is to obtain an appropriate threshold for a set of several
similar images by means of statistical learning and Bayesian analysis.

2.2.3 Chernoff Information and Optimal Filtering

Images belonging to the same image category or environment share similar


intensity and texture distributions, so it seems reasonable to think that the
entropy values of their most salient regions will lay in the same range. A proper
identification of this range for a set of images may be performed by means
of the study of two probability density functions known as pon (θ) and poff (θ).
The pon (θ) pdf defines the probability of a region to be part of the most salient
regions of the image given that its relative entropy value is θ, while poff (θ)
defines the probability of a region to not to be part of the most salient regions
of the image. This way, a previous learning stage should involve a training set
of images from an image category; from these images, pon (θ) and poff (θ) are
calculated. Then, the maximum relative entropy σ being pon (σ) > 0 may be
choosen as an entropy threshold for that image category. This way we avoid to
select a threshold that would discard any of the most salient regions in training
images. When a new image belonging to that category must be processed, all
pixels whose relative entropy at smax is lower than σ are discarded before
applying scale saliency algorithm. However, Fig. 2.4 shows that this approach
may be improved.
As stated in Fig. 2.4, more pixels may be discarded selecting a higher
threshold if we assume the loss of some true salient regions, resulting in a de-
crease of computation time. It would be desirable to reach a trade-off between
a high amount of discarded points and a low number of actual salient regions
discarded. Seeing the examples in the previous figure, it is clear that the less
overlapped the two distributions are, the easiest it is to find a threshold in
order to discard larger nonsalient regions without removing too many salient
ones; if the selected training images are homogeneous enough, the overlap of
pon (θ) and poff (θ) is low. On the contrary, a heterogeneous set of images will
result in similar pon (θ) and poff (θ) distributions, and it will be difficult to
select an adequate threshold.
Thus, the splitting of training images in several homogeneous categories is
a crucial step. Although addressing the point of selecting an optimal partition
of the training images is out of the scope of this chapter, we should highlight a
statistical measure that may help to decide if the partition is optimal: Chernoff
2.2 Entropy and Interest Points 17

0.07
θ1 θ2 θ3
0.06

0.05

0.04

0.03

0.02

0.01

0
0 0.2 0.4 0.6 0.8 1

Fig. 2.4. pon (θ) (solid plot) and poff (θ) (dashed plot) distributions estimated from
all pixels of a set of images belonging to a same image category. Threshold θ1 ensures
that the filter does not remove any pixel from training images that are not part of
the most salient regions; however, in the case of new input images, we may find
salient regions, the relative entropy of which is lower than this threshold. Choosing
threshold θ2 increases the amount of filtered points in exchange for increasing the
probability of filtering salient regions of the image. Finally, threshold θ3 assumes a
higher risk, as far as more salient and not salient regions of the image will be filtered.
However, the probability of a nonsalient pixel to be removed from the image is still
higher than in the case of a salient one.

Information. The expected error rate of a likelihood test based on pon (φ)
and poff (φ) decreases exponentially with respect to C(pon (φ), poff (φ)), where
C(p, q) is the Chernoff Information between two probability distributions p
and q, and is defined by
⎛ ⎞
J
C(p, q) = − min log ⎝ pλ (yj )q 1−λ (yj )⎠ (2.3)
0≤λ≤1
j=1

where {yj : j = 1, . . . , J} are the variables that the distributions are de-
fined over (in this case, the probality of each relative entropy value in [0, 1]).
Chernoff Information quantifies the easiness of knowing from which of the two
distributions came a set of values. This measure may be used as an homogene-
ity estimator for an image class during training. If the Chernoff Information
of a training set is low, then images in that image class are not homogeneous
enough and it must be splitted into two or more classes. A related measure
is Bhattacharyya Distance. Bhattacharyya Distance is a particular case of
Chernoff Information in which λ = 1/2:
⎛ ⎞
J
BC(p, q) = − log ⎝ p 2 (yj )q 2 (yj )⎠
1 1
(2.4)
j=1
18 2 Interest Points, Edges, and Contour Grouping

Figure 2.4 is again useful to understand how pixels on an image should


be discarded. The log-likelihood ratio log(pon (θ)/poff (θ)) is zero when
pon (θ) = poff (θ), that is, when the probability of a pixel with relative entropy
θ at smax to be part of the most salient regions is equal to the probability of
not be part of them. In the other hand, positive values correspond to relative
entropies that are more likely to be associated to the most salient regions.
Thus, the log-likelihood ratio of these two densities may be used to discard
points of an image. A threshold T must be chosen for an image class, so that
any pixel from an image belonging to the same image class may be discarded
if log(pon (θ)/poff (θ)) < T .
Once again, information theory provides a tool capable of estimating a
valid range in which this threshold should be chosen. The Kullback–Leibler
divergence or relative entropy between two distributions p and q, given by

J
p(yj )
D(p||q) = p(yj ) log (2.5)
j=1
q(yj )

measures the dissimilarity between p and q. It estimates the coding efficiency


loss of assuming that the distribution is q when the real distribution is p. The
range of valid T values is given by (see Section 2.4.3)
−D(poff (θ)||pon (θ)) < T < D(pon (θ)||poff (θ)) (2.6)
Selecting the minimum T value in this range ensures a good trade-off
between a relatively high amount of discarded points and low error. More
pixels can be filtered increasing T in this range and assuming that the error of
the final results will increase depending on the Chernoff Information between
pon (θ) and poff (θ). In fact, the Kullback–Leibler divergence and the Chernoff
Information between these two distributions are related. If Chernoff value is
low, distributions are similar and it is difficult to extract a good threshold to
split points into homogeneous and non-homogeneous; as a consequence, the
value of T must be selected from a narrower range.

2.2.4 Bayesian Filtering of the Scale Saliency Feature Extractor:


The Algorithm
After introducing all the factors that take part in this approach to decrease
the computation time of Kadir and Brady scale saliency algorithm, let us put
them together to explain how it can be applied. First, we summarize how to
extract a valid threshold from a set of training images belonging to the same
image class or category:
1. Calculate pon (θ) and poff (θ) probability distributions from all points in a
set of training images, considering if these points are part (on) or not (off)
of the final displayed most salient features of its corresponding image, and
being θ the relative entropy value of a pixel at smax with respect to the
maximum entropy value of any pixel of its image at the same scale.
2.2 Entropy and Interest Points 19

2. Evaluate C(poff (θ), poff (θ)). A low value means that the image class is not
homogeneous enough, and it is not possible to learn a good threshold. In
this case, split the image class into new subclasses and repeat the process
for each of them.
3. Estimate the range limits −D(pon (θ)||poff (θ)) and D(poff (θ)||pon (θ)).
4. Select a threshold in the range given by Kullback–Leibler divergence
−D(poff (θ)||pon (θ)) < T < D(pon (θ)||poff (θ)). The minimum T value in this
range is a conservative good trade-off between efficiency and low error rate.
Higher T values will increase error rate accordingly to C(poff (θ), poff (θ)).

Then, new images belonging to the same image category can be filtered
before applying the scale saliency algorithm, discarding points that probably
are not part of the most salient features:

1. Calculate the local relative entropy θx = HDx /Hmax at smax for each pixel
x, where Hmax is the maximum entropy value for any pixel at smax .
2. X = {x| log ppoff
on (θ)
(θ) > T }, where T is the learned threshold for the image
class that the input image belongs to.
3. Apply the scale saliency algorithm only to pixels x ∈ X.

In order to demonstrate the validity of this method, Table 2.1 shows some
experimental results, based on the well-known Object Categories dataset from
Visual Geometry Group, freely available on the web. This dataset is composed
of several sets of images representing different image categories. These results
were extracted following these steps: first, the training process was applied to
each image category, selecting a random 10% of the images as training set.
The result of the first step is a range of valid thresholds for each image cate-
gory. Chernoff Information was also estimated. Then, using the rest of images
from each category as test set, we applied the filtering algorithm, using two
different thresholds in each case: the minimum valid value for each category
and T = 0. Table 2.1 shows the results for each image category, including
the mean amount of points (% points) filtered and the mean amount of time
(% time) saved for each image category, depending on the used threshold. The
last column shows the mean localization error of the extracted features ():

1  d(Ai , Bi ) + d(Bi , Ai )
N
= (2.7)
N i=1 2

where N is the number of images in the test set, Ai represents the clustered
most salient regions obtained after applying the original scale saliency algo-
rithm to image i, Bi represents the clustered most salient regions obtained
from the filtered scale saliency algorithm applied to image i, and

d(A, B) = min a − b (2.8)
b∈B
a∈A
20 2 Interest Points, Edges, and Contour Grouping

Table 2.1. Application of the filtered scale saliency to the Visual Geometry Group
image categories database.
Test set Chernoff T % Points % Time 
Airplanes side 0.415 −4.98 30.79 42.12 0.0943
0 60,11 72.61 2.9271
Background 0.208 −2.33 15.89 24.00 0.6438
0 43.91 54.39 5.0290
Bottles 0.184 −2.80 9.50 20.50 0.4447
0 23.56 35.47 1.9482
Camel 0.138 −2.06 10.06 20.94 0.2556
0 40.10 52.43 4.2110
Cars brad 0.236 −2.63 24.84 36.57 0.4293
0 48.26 61.14 3.4547
Cars brad bg 0.327 −3.24 22.90 34.06 0.2091
0 57.18 70.02 4.1999
Faces 0.278 −3.37 25.31 37.21 0.9057
0 54.76 67.92 8.3791
Google things 0.160 −2.15 14.58 25.48 0.7444
0 40.49 52.81 5.7128
Guitars 0.252 −3.11 15.34 26.35 0.2339
0 37.94 50.11 2.3745
Houses 0.218 −2.62 16.09 27.16 0.2511
0 44.51 56.88 3.4209
Leaves 0.470 −6.08 29.43 41.44 0.8699
0 46.60 59.28 3.0674
Motorbikes side 0.181 −2.34 15.63 27.64 0.2947
0 38.62 51.64 3.7305

is an Euclidean distance based measure between the most salient clustered


regions. It must be noted that this distance is not only calculated in image
space, but also in scale space; thus, errors in both localization and scale are
considered. As can be seen, using the more conservative threshold yields a
good trade-off between saved time and error, but using T = 0 produces no-
ticeable improved results, keeping a low error rate. Generally, better results
are obtained when Chernoff Information value is high, being more points dis-
carded or decreasing error rate. Some examples of application are shown in
Fig. 2.5.

2.3 Information Theory as Evaluation Tool:


The Statistical Edge Detection Case
Edges are another kind of features extracted in early steps of several higher
level computer vision algorithms. An edge is a part of the image where a sharp
intensity discontinuity is present. These features are useful in the sense that
they can give an idea of the shape or boundaries of the objects in the image,
2.3 Information Theory as Evaluation Tool 21

Fig. 2.5. Examples of filtering previous to scale saliency algorithm, applied to


three images belonging to different image categories (cars brad bg, background and
airplanes side). From left to right: original image, results of the original scale saliency
algorithm, results using the minimum valid threshold of the image category, results
using T = 0. In all cases, black regions represent filtered points.

Fig. 2.6. Left: an example of natural scene containing textured regions. Center:
output of the Canny algorithm applied to the left image. A high amount of edges
appear as a consequence of texture and clutter; these edges do not correspond to
object boundaries. Right: an example of ideal edge detection, in which most of the
edges are part of actual object boundaries.

information that may be used in applications like object detection, medical


imaging, or written characters recognition. The most common edge detection
algorithms used in the literature are based on convolution masks, being Canny,
Sobel and Prewitt some of these methods; however, their main drawback is
that they do not provide an accurate response in the case of natural scenes
with quite background clutter, as can be seen in Fig. 2.6.
We can deal with this problem by means of a statistical learning approach,
quite similar to the method explained above to remove pixels from an image
before applying the Kadir and Brady feature extractor. The work by Konishi
et al. [99] is a good example. Their aim was to prove that it is possible to
robustly extract edges from an image using simple filters at different scales.
22 2 Interest Points, Edges, and Contour Grouping

But more interesting is that they use Chernoff Information and conditional
entropy to evaluate the effect on edge extraction performance of several as-
pects of their approach: the use of a set of scales rather than only one scale,
the quantization of histograms used to represent probability densities, and the
effect of representing the scale space as an image pyramid, for instance. Thus,
information theory is introduced as a valid tool for obtaining information
about statistical learning processes.
Conditional entropy H(Y |X) is a measure of the remaining entropy or
uncertainty of a random variable Y , given another random variable X, the
value of which is known. A low value of H(Y |X) means that the variable X
yields high amount of information about variable Y , making easier to predict
its value. In the discrete case, it can be estimated as


N 
N 
M
H(Y |X) = p(xi )H(Y |X = xi ) = − p(xi ) p(yj |xi ) log p(yj |xi )
i=1 i=1 j=1
(2.9)

We will explain below how this measure may be applied to the context of
classification evaluation.

2.3.1 Statistical Edge Detection

The Kadir and Brady feature extractor filter described above is inspired in
statistical edge detection. It implies learning two probability distributions
named pon (φ) and poff (φ), giving the probability of a pixel that is part of an
edge or not, depending on the response of a certain filter φ. In the work by
Konishi et al., the set of filters used to edge detection was a simple one (gradi-
ent magnitude, Nitzberg, Laplacian) applied to different data obtained from
the images (gray-scale intensity, complete color intensities and chrominancy).
Then, the log-likelihood ratio may be used to categorize any pixel I(x) on
an image I, classifying it as an edge pixel if this log-likelihood ratio is over a
given threshold:
pon (φ(I(x)))
log >T (2.10)
poff (φ(I(x)))
These distributions may be learnt from a image dataset that includes a
groundtruth edge segmentation indicating the real boundaries of the objects
in a sequence of images (like, for instance, the right image in Fig. 2.6). Two
well-known examples are the South Florida and the Sowerby datasets, which,
unfortunately, are not freely available. The presence of poff (φ) is an improve-
ment over traditional edge detection methods. The only information used by
edge detection algorithms based on convolution masks is local information
close to the edges; poff (φ) may help to remove edges produced by background
clutter and textures.
2.3 Information Theory as Evaluation Tool 23
0.08 0.08

0.07 0.07

0.06 0.06

0.05 0.05

0.04 0.04

0.03 0.03

0.02 0.02

0.01 0.01

0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1

Fig. 2.7. pon (φ) (solid line) and poff (φ) (dashed line) for two example filter based
classifiers. The overlap of the distributions on the left is lower than in the case on
the right. In the first case, Chernoff Information value is 0.508, while in the second
case, the value is 0.193. It can be clearly seen that the first classifier will distinguish
better between on-edge and off-edge pixels.

As explained in previous sections, Chernoff Information is a feasible indi-


cator of the performance of a classifier based on the log-likelihood ratio (see
Eq. 2.10). When Chernoff Information values are high pon (φ) and poff (φ) are
clearly separable, that is, it is easy to know if, given the response of a filter
for a pixel x, this pixel is part or not of an edge. On other hand, when the
Chernoff Information value is low, the overlapping of these two distributions
is too high and it is difficult to characterize positive and negative samples.
An example applied to edge detection problem is shown in Fig. 2.7. Several
examples of classifier evaluation results by Konishi et al., based on Chernoff
Information, can be seen in Fig. 2.8.

2.3.2 Edge Localization

The edge localization problem can be posed as the classification of all pixels on
an image as being an edge or not depending on their distance to their nearest
edge. This problem may be seen as a binary or a multiple class classification.
Binary classification means that pixels are splitted into two categories: pixels
whose nearest edge is below or over a certain distance. In the case of multiple
class classification, each pixel is assigned a category depending on the distance
to its nearest edge.
Let us first deal with binary classification. Given the function w(x), that
assigns to each pixel the distance to its nearest edge, and a threshold w, pixels
can be splitted into two groups or classes:

α1 = {x : w(x) ≤ w}
(2.11)
α2 = {x : w(x) > w}
24 2 Interest Points, Edges, and Contour Grouping

Fig. 2.8. Evolution of Chernoff Information for an edge classifier based on different
filters when applied to full color, gray scale and chrominance information extracted
from the images of the South Florida dataset, when using scale σ = 1, two scales
σ = {1, 2}, and three scales σ = {1, 2, 4}. Chernoff Information increases as more
scales are included; the conclusion is that a multiscale edge detection approach will
perform better than a monoscale one. Furthermore, color information yields even
higher Chernoff Information values, so this information is useful for this task. Further
experiments of Chernoff Information evolution depending on scale in [98] show that
if only one scale can be used, it would be better to choose an intermediate one.
(Figure by Konishi et al. [99] 2003
c IEEE.)

Using the training groundtruth edge information, the p(φ|α1 ) and p(φ|α2 )
conditional distributions, as well as prior distributions p(α1 ) and p(α2 ) must
be estimated. The classification task now is simple: given a pixel x and the
response of the filter φ for that pixel φ(x) = y, Bayes rules yield p(α1 |φ(x) = y)
and p(α2 |φ(x) = y), allowing the classification algorithm to decide on the class
of pixel x. Chernoff Information may also be used to evaluate the performance
of the binary edge localization. A summary of several experiments of Konishi
et al. [98] supports the coarse to fine edge localization idea: coarse step consists
of looking for the approximate localization of the edges on the image using
only an optimal scale σ ∗ . In this case, w = σ ∗ . Then, in the fine step, filters
based on lower scales are applied in order to refine the search. However, the
parameter σ ∗ depends on the dataset.
Multiclass classification differs from the previous task in the number of
classes considered, which is now being greater than two. For instance, we
could split the pixels into five different classes:


⎪ α1 = {x : w(x) = 0}


⎨ α2 = {x : w(x) = 1}
α3 = {x : w(x) = 2} (2.12)



⎪ α4 = {x : 2 < w(x) ≤ 4}

α5 = {x : w(x) > 4}
2.3 Information Theory as Evaluation Tool 25

Once again, and using the training groundtruth, we can estimate p(φ|αi )
and p(αi ) for i = 1 . . . C, C being the number of classes (5 in our example).
Then, any pixel x is assigned class α∗ , being

α∗ = arg max p(φ(x) = y|αi )p(αi ) (2.13)


i=1...C

In this case, conditional entropy is a useful tool to measure the performance


of a filter φ:


C
H(φ|y) = − p(αi |φ = y)p(y) log p(αi |φ = y) (2.14)
y i=1

The conditional entropy quantifies the decrease of uncertainty about the


class αi a pixel belongs to after an observation φ = y was done. The per-
formance of a filter φ1 is better than the performance of a filter φ2 , if its
conditional entropy is lower (in opposition to the Chernoff Information case,
for which higher values mean higher performances); lower conditional entropy
yields lower uncertainty, and as a consequence, it is easier to categorize a pixel.
It must be noted that H(φ|y) is, in any case, lower or equal to H(αi ), that
is, the response of a filter will decrease the a priori uncertainty that we have
about the category of a pixel, or at least it will not increase it. A compari-
son of H(φ|y) with respect to H(αi ) may be a useful relative measure of the
goodness of a filter. Several experiments driven by Konishi et al. are based on
conditional entropy as an estimator of the performance of different filters in
the multiclass edge localization problem [98].
Filter evaluation based on Chernoff Information and conditional entropy
implies that the conditional distributions shown during this section must
be represented as histograms. In the work by Konishi et al., decision trees
are the base of probability distributions quantization. Since decision trees are
introduced in Chapter 7, we do not give a detailed explanation on this part
of the algorithm. However, we have mentioned it in order to remark an im-
portant aspect of Chernoff Information as classifier evaluation: its ability to
detect overlearning.
A brief summary of how to perform quantization to build the histograms
follows: starting from the whole filter space (the space representing all the
possible filter responses; it could be a multidimensional space), a greedy algo-
rithm looks for the cut in any axis that maximizes Chernoff Information. This
cut produces two subspaces; for each of them, a new cut must be found that
also maximizes Chernoff Information. This step is repeated with all the new
subspaces until the desired number of cuts is reached. An advantage of this
process is that histogram bins will be adapted to data: the parts of the filter
space that need more discretization will be represented by a higher amount of
bins. Another advantage is that a multidimensional filter can be represented
using a notably lower number of bins than in the case of a regular quantiza-
tion. As we can see in Fig. 2.9 (left), only a small number of cuts are needed to
26 2 Interest Points, Edges, and Contour Grouping

Fig. 2.9. Left: two examples of how Chernoff Information evolves depending on
the number of cuts of the decision tree, that will define the probability distribution
quantization. An asymptote is reached soon; most of the information can be obtained
using a low amount of histogram bins. Right: example of overlearning. As long as the
number of cuts increases an asymptote is reached, but, at certain point, it starts to
increase again. This is an evident sign of overlearning. (Figure by Konishi et al. [99]
2003
c IEEE.)

reach an asymptote, that is, only a relative low amount of bins are needed to
obtain a good quantization. On the other hand, in Fig. 2.9 (right), we can see
overlearning effect. If the number of cuts is high enough, an abrupt increase of
Chernoff Information value appears. This abrupt increase is produced by the
fact that the chosen quantization is overfitted to the training data, for which
the classification task will be remarkably easier; however, the obtained classi-
fier will not be valid for examples different to those in the training dataset.

2.4 Finding Contours Among Clutter


In this section, we move to a related topic to introduce the Sanov’s theorem
and the method of types [43] in the context of contour finding. Sanov’s theorem
deals with rare events and their chance to occur. Given a set xN of N i.i.d.
samples obtained from an underlying distribution Ps , a histogram or type
can be built from each sample. By the law of large numbers, as N → ∞,
this histogram will converge to the underlying distribution. The set of typical
types TP s for a distribution Ps is

TP s = {xN : D(Pxn ||Ps ) ≤ } (2.15)


The probability that xN ∈ TP s tends to 1 as N → ∞. On the other
hand, the probability of a nontypical type xN ∈ / TP s tends to 0 as N → ∞.
Sanov’s theorem puts a bound on the probability of these nontypical types
(rare events).
Let x1 , x2 , . . . , xN be a sample sequence φ i.i.d. from a distribution Ps (x)
with alphabet size J and let E be any closed set of probability distributions.
Then

2−N D(φ ||Ps ) ∗

J
≤ p(φ ∈ E) ≤ (N + 1)J 2−N D(φ ||Ps ) (2.16)
(N + 1)
2.4 Finding Contours Among Clutter 27

Fig. 2.10. The triangle represents the set of probability distributions, E being a
subset within this set. Ps is the distribution which generates the samples. Sanov’s
theorem states that the probability that a type lies within E is determined by
distribution P ∗ in E which is closer to Ps .

where φ∗ = arg minφ∈E (Dφ||Ps ) is the distribution in E closest to Ps . Sanov’s


theorem yields two important implications. It states that when estimating the
probability of a set of rare events, we only need to consider the most likely of
these rare events (Fig. 2.10). It also states that the probability of rare events
decreases exponentially with the divergence between the rare event (its type)
and the true distribution.

2.4.1 Problem Statement

We base our discussion on the road tracking work of Coughlan et al. [42], which
in turn, is based on a previous work by Geman and Jedymak. The underlying
problem to solve is not the key question here, as long as the main objective
of their work is to propose a Bayesian framework, and from this framework,
analyze when this kind of problems may be solved. They also focus on studying
the probability that the A∗ algorithm yields an incorrect solution.
Geman and Jedymack’s road tracking is a Bayesian Inference problem
in which only a road must be detected in presence of clutter. Rather than
detecting the whole road, it is splitted into a set of equally long segments.
From an initial point and direction, and if road’s length is N , all the possible
QN routes are represented as a search tree (see Fig. 2.11), where Q is the
tree’s branching factor. There is only a target route, and thus in the worst
case, the search complexity is exponential. The rest of paths in the tree are
considered distractor paths.
We now briefly introduce problem’s notation. Each route in the tree is
represented by a set of movements {ti }, where ti ∈ {bv }. The set {bv } forms
an alphabet Q corresponding to the Q possible alternatives at each segment’s
28 2 Interest Points, Edges, and Contour Grouping

F2 F3 True F3
F2

F1 F1

Fig. 2.11. Left: search tree with Q = 3 and N = 3. This search tree represents all
the possible routes from initial point (at the bottom of the tree) when three types of
movement can be chosen at the end of each segment: turning 15◦ to the left, turning
15◦ to the right and going straight. Right: Search tree divided into different sets: the
target path (in bold) and N subsets F1 , . . . , FN . Paths in F1 do not overlap with the
target path, paths in F2 overlap with one segment, and so on.

end (turn 15◦ left, turn 15◦ right, and so on). Each route has an associated
prior probability given by

N
p({ti }) = pG (ti ) (2.17)
i=1
where pG is the probability of each transition bi . From now on, we assume
that all transitions are equiprobable. A set of movements {ti } is represented
by a set of tree segments X = {x1 , . . . , xN }. Considering X the set of all
QN tree segments, it is clear that X ∈ X . Moreover, an observation yx is
made for each x ∈ X , being Y = {yx : x ∈ X }. In road tracking systems,
observation values are obtained from a filter previously trained with road and
non-road segments. Consequently, distributions pon (yx ) and poff (yx ) apply to
the probability of yx of being obtained from a road or not road segment. Each
route {ti } with segments {xi } is associated with a set of observations {yxi }
that get values from an alphabet {aμ } with size J.
Geman and Jedymak formulate road tracking in Bayesian Maximum a
Posteriori (MAP) terms:
p(Y |X)p(X)
p(X|Y ) = (2.18)
p(Y )
where prior is given by

N
p(X) = pG (ti ) (2.19)
i=1
and
 
p(Y |X) = pon (yx ) poff (yx )
x∈X x∈X /X
 pon (yx ) 
i
= poff (yx )
poff (yxi )
i=1..N x∈X
 pon (yx )
i
= F (Y )
poff (yxi )
i=1..N
2.4 Finding Contours Among Clutter 29

In the latter equation, F (Y ) = x∈X poff (yx ) is independent of X, and can
be ignored. In order to find the target route, p(Y |X)p(X) must be maximized.

2.4.2 A∗ Road Tracking

Coughlan et al. redesigned the algorithm to be solved by means of an A∗


search. The aim of the algorithm is to search the route {ti } with associated
observations {yi } that will be maximize a reward function. This route may
be different to the one given by MAP estimate. We define reward function as


N   
N  
pon (yi ) pG (ti )
r({ti }, {yi }) = log + log (2.20)
i=1
poff (yi ) i=1
U (ti )

being yi = yxi and U the uniform distribution, so


N
log U (ti ) = −N log Q (2.21)
i=1

Reward function may be also expressed as

r({ti }, {yi }) = N φα + N ψβ (2.22)

where the components of vectors α and β are

pon (αμ )
αμ = log , μ = 1, . . . , J (2.23)
poff (αμ )
pG (bv )
βv = log , v = 1...Q (2.24)
U (bv )

and φ and ψ are normalized histograms:

1 
N
φμ = δy ,α , M = 1, . . . , J (2.25)
M i=1 i μ

1 
N
ψv = δt ,b , v = 1, . . . , Q (2.26)
N i=1 i v

It must be noted that the Kronecker delta function δi,j is present in the
latter equation.
Coughlan et al. address this question: given a specific road tracking prob-
lem, may the target road be found by means of MAP? Answering this question
is similar to estimating the probability that the reward of any distractor path
is higher than the one of the target path. In this case, the problem cannot be
solved by any algorithm. For instance, let us consider the probability distri-
bution p1,max (rmax /N ) of the maximum normalized reward of all paths in F1
30 2 Interest Points, Edges, and Contour Grouping

0.1 0.1 0.1

0.08 0.08 0.08

0.06 0.06 0.06

0.04 0.04 0.04

0.02 0.02 0.02

0 0 0
−60 −40 −20 0 20 40 60 −60 −40 −20 0 20 40 60 −60 −40 −20 0 20 40 60

Fig. 2.12. Three cases of p1,max (rmax /N ) (solid line) vs. p̂T (rT /N ) (dashed line).
In the first case, the reward of the target path is higher than the largest distractor
reward, and as a consequence the task is straightforward. In the second case, the
task is more difficult due to the overlapping of both distributions. In the last case,
the reward of the target path is lower than the largest distractor rewards. Thus, it
is impossible to find the target path.

Fig. 2.13. Road tracking (black pixels) from initial point (at the top of each graph)
in presence of increasing clutter. The value of the parameter K will increase from
left to right, being the first example a simple problem and the last one an almost
impossible one.

(see Fig. 2.11) and the probability distribution p̂T (rT /N ) of the normalized
reward of the target path. Figure 2.12 compares both. Using a similar method
to Sanov’s Theorem, it is possible to obtain a parameter K given by

K = D(pon ||poff ) + D(pG ||U ) − logQ (2.27)

that clarifies if the task can be solved. When K > 0, p̂T (rT /N ) lies at the
right of p1,max (rmax /N ), thus finding the target path is straightforward. If
K ≈ 0, both distributions overlap and detecting the target path is not as
simple. Finally, when K < 0, p̂T (rT /N ) lies at the left of p1,max (rmax /N ), and
it is impossible to find the target path. An example is shown in Fig. 2.13.
The parameter K intuitively describes the difficulty of the task. For in-
stance, the term D(pon ||poff ) is directly related to the filter quality. The higher
this divergence is, the easier is to discriminate road and non-road segments.
Clearly a better filter facilitates the road tracking task. The term D(pG ||U ),
on the other hand, refers to the a priori information that is known about
the shape of the road. If pG = U , there is no prior information. Finally, Q
measures the amount of present distractor paths. Therefore, K increases in
the case of having a low error road detection filter, high a priori information
and few distractor paths.
2.4 Finding Contours Among Clutter 31

2.4.3 A∗ Convergence Proof

Coughlan et al. adopt an inadmissible heuristic that converges faster than


admissible heuristics, but may lead to incorrect results. This heuristic is
HL + HP , L being the likelihood and P the geometric prior. Thus, a length
M subpath has a reward of (N − M )(HL + HP ). Due to the fact that the
N (HL + HP ) term does not vary for any subpath, it is dropped. This makes
−M (HL + HP ) the heuristic in this problem.
In order to demonstrate the validity of the A∗ algorithm results, two reward
functions must be defined: a reward for length M < N subpaths in the tree
that completely overlap with the target path (Son (M )), and a reward for
length N paths in the tree that do not overlap at all with the target path
(Soff (N )):

Son (M ) = M {φon α − HL } + M {ψ on β − Hp} (2.28)


Soff N = N {φoff α − HL } + N {ψ off β − Hp} (2.29)

The following demonstration adopts the Bhattacharyya heuristic for the


A∗ algorithm. It simplifies the analysis and helps prove stronger convergence
results. The heuristic should be lower than the normalized reward obtained
from the target path and higher than the normalized reward obtained from
any distractor path. If we consider HL (the analysis is similar in the case
of HP ), the expected reward for the target path is D(pon ||poff ), and it is
−D(poff ||pon ) for distractor paths, and as a consequence:

−D(poff ||pon ) < HL < D(pon ||poff ) (2.30)

If we pretend to use Sanov’s theorem, the heuristic must be expressed as


the expected reward of data distributed according to the geometric mixture
of pon , poff given by pλ (y) = p1−λ λ
on (y)poff (y)/Z[λ] (Z[λ] being λ dependent
normalization constant). Taking λ = 1/2 (that is, the Bhattacharyya heuris-

tic), we are choosing a heuristic HL∗ = y pλ=1/2 (y) log ppoff
on (y)
(y) that is midway
between target and distractors. Expressing the Bhattacharyya heuristic in
reward notation (see Eq. 2.22) yields

HL∗ = φBh α, HP∗ = ψ Bh β (2.31)

where
1 1 1 1
{pon (y)} 2 {poff (y)} 2 {pG (t)} 2 {U (t)} 2
φBh (y) = , ψBh (t) = (2.32)
Zφ Zψ

and Zφ , Zψ are normalization constants.


We need to summarize two theorems, the proofs of which may be read
elsewhere, in order to come to further conclusions in this section. The first one
puts an upper bound on the probability that any false segment is searched, and
32 2 Interest Points, Edges, and Contour Grouping

is based on the fact that A∗ algorithm searches the segment with the highest
reward. This first theorem states that the probability that A∗ searches the
last segment of a particular subpath An,i is less or equal to P {∃m : Soff (n) ≥
Son (m)}. The second theorem bounds this probability by something that can
be evaluated, and states that:


P {∃m : Soff (n) ≥ Son (m)} ≤ P {Soff ≥ Son (m)} (2.33)
m=0

The probability that A∗ yields incorrect results is given by P {Soff (n) ≥


Son (m)}. The correctness of the algorithm can be analyzed searching a bound
of this probability by means of Sanov’s theorem. The bound is given by
2
Q2 −(nΨ1 +mΨ2 )
P {Soff (n) ≥ Son (m)} ≤ {(n + 1)(m + 1)}J 2 (2.34)

where Ψ1 = D(φBh ||poff ) + D(ψ Bh ||U ), and Ψ2 = D(φBh ||pon ) +


D(ψ Bh ||pG ). The first step to prove this assertion is to define E, the
set of histograms corresponding to partial off paths with higher reward than
the partial on path:

E = {(φoff , ψ off , φon , ψ on ) : n{φoff α − HL∗ + ψ off β − HP∗ } (2.35)


≥ m{φon α − HL∗ + ψ on β − HP∗ }} (2.36)

By Sanov’s theorem, we can give a bound in terms of φoff , ψ off , φon and


on
ψ that minimizes

f (φoff , ψ off , φon , ψ on ) = nD(φoff ||poff ) + nD(ψ off ||U ) (2.37)


+ mD(φ ||pon ) + mD(ψ ||pG )
on on
(2.38)
 
+ τ1 { φoff (y) − 1}+τ2 { φon (y) − 1} (2.39)
y y
 
+ τ3 { ψ off (t) − 1} + τ4 { ψ on (t) − 1} (2.40)
t t

+ γ{m{φ α − HL + ψ β − HP∗ }
on ∗ on
(2.41)
− n{φ αoff
− HL∗ + ψ off β − HP∗ }} (2.42)

where the τ ’s and γ are Lagrange multipliers. The function f is convex, and
so it has an unique minimum. It mustbe noted that f can be splitted into
four terms of form nD(φoff ||Poff ) + τ1 { y φoff (y) − 1} − nγφoff α, coupled by
shared constants. These terms can be minimized separately:

pγon p1−γ p1−γ pγ pγG U 1−γ on∗ p1−γ


G U
γ
φoff∗ = off
, φon∗ = on off , ψ off∗ = ,ψ =
Z[1 − γ] Z[γ] Z2 [1 − γ] Z2 [γ]
(2.43)
2.5 Junction Detection and Grouping 33

subject to the constraint given by Eq. 2.36. The value γ = 1/2 yields the
unique solution. Moreover, γ = 1/2 yields

φoff∗ α = HL∗ = φon∗ α, ψ off∗ β = HP∗ = ψ on∗ β (2.44)

The solution occurs at φon∗ = φoff∗ = φBh , and ψ on∗ = ψ off∗ = ψ Bh .


Thus, substituting into the Sanov bound gives Eq. 2.34. Another conclusion
that can be extracted from Eq. 2.34 is that the probability of exploring a
distractor path to depth n falls off exponentially with n:


P {∃m : Soff (n) ≥ Son (m)} ≤ P {Soff (n) ≥ Son (m)} (2.45)
m=0

C2 (Ψ2 )2−nΨ1
2
Q2
≤ (n + 1)J (2.46)

where

 2
Q2 −mΨ2
C2 (Ψ2 ) = (m + 1)J 2 (2.47)
m=0

2.5 Junction Detection and Grouping

This section introduces a feature extraction algorithm by Cazorla et al. [36],


which is based on the concepts and algorithms described in the two previous
sections. The algorithm integrates junction detection and junction grouping.
The aim of junction grouping is to remove false positives and detect false
negatives. It may also be applied to obtain a mid-level representation of the
scene.

2.5.1 Junction Detection

Based on the Kona junction representation, junctions are modeled as piecewise


constant regions called wedges. A junction may be described by means of a
parametric model (see Fig. 2.14) θ = (xc , yc , r, M, {θi }, {Ti }), where (xc , yc )
and r are the center and the radius of the junction, respectively, M is the
number of wedges incident with the center, {θi } is the wedge limits and {Ti }
is the intensity distribution for each wedge. SUSAN algorithm detects center
candidates (xc , yc ). It selects all candidates for which the number of close
neighbors sharing its intensity is below a given threshold. The radius r is left
as a free parameter that must be chosen by the user. In order to obtain the set
of wedges {θi }, the algorithm computes the averaged accumulated contrast
for each θ ∈ [0, 2π] along the radius in these directions:

1
N
I˜θ = li Ii (2.48)
r i=1
34 2 Interest Points, Edges, and Contour Grouping

T2 q
q2 q1 I7
T1 I8
I5 I6
Xc, Yc I3 I4
q3
I1 I2
T3 R
(xc, yc)

Fig. 2.14. Left: junction parametric model. Right: radius along direction θi
discretized as a set of segments li . (Figure by Cazorla and Escolano 2003
c IEEE.)

Fig. 2.15. Top left: an example image. Top right: value of the log-likelihood ratio
log(pon /poff ) for all pixels. Bottom: magnitude and orientation of the gradient. In
the case of orientation, gray is value 0, white is π and black is −π. (Courtesy of
Cazorla.)

where N is the number of segments li needed to discretize the radius in di-


rection θi and Ii is the intensity of segment li (see Fig. 2.14).
Wedges may be found looking for peaks in Eq. 2.48. However, a better
approach to wedge detection is given by the log-likelihood test seen in Section
2.3.1:
1
N
pon (Ii )
I˜θ = li log (2.49)
r i=1 poff (Ii )
Examples shown in Fig. 2.15 demonstrate that the filter can be improved
adding gradient information. Thus, each pixel is represented as Ii = (Ii , θi ),
where Ii is gradient magnitude and θi is a local estimation of the real
2.5 Junction Detection and Grouping 35

orientation θ∗ in which lies the pixel. Now, the orientations for which Eq.
2.50 is peaked and over a given threshold are selected as wedge limits:

1
N
pon (Ii |θ∗ )
I˜θ = li log (2.50)
r i=1 poff (Ii )

The pon and poff distributions in Eq. 2.50 are given by

pon (Ii |θ∗ ) = pon (Ii )pang (θi − θ∗ ) (2.51)


poff (Ii ) = poff (Ii )U (θi ) (2.52)

where U is the uniform distribution and pang (θi − θ∗ ) is the probability that
θi is the correct orientation. Although pang may be empirically estimated, in
this case, its maximum when the difference is 0 or π. Finally, junctions given
by Eq. 2.50 are pruned if M < 2 or M = 2, and the two wedges relative
orientation is close to 0 or π. Figure 2.16 shows some examples of application.

2.5.2 Connecting and Filtering Junctions

The junction detection method introduced in the previous section is completed


with a connecting paths search step. These paths connect junction pairs along
edges between them. A connecting path P with length L, initial point at
(xc , yc ) and wedge limit θ is a set of segments p1 , p2 , . . . , pL that may have
variable or fixed length. Path curvature should be smooth, and thus, we also
define orientations α1 , α2 , . . . , αL−1 where αj = θj+1 −θj is the angle between
segments pj+1 and pj .
The algorithm that searches paths is based on Bayesian A∗ explained in
Section 2.4. From junction center (x0c , yc0 ) and given a wedge orientation θ0 ,
the algorithm follows a search tree in which each segment pj can expand Q
branches, and thus, QN being possible paths. We recall the fact that the
method described in Section 2.4 does not search the best path, but an unique
path in clutter. The optimal path P ∗ is the one that maximizes


L
pon (pj ) 
L−1
pG (αj+1 − αj)
E({pj , αj }) = log + log (2.53)
j=1
poff (pj ) j=1 U (αj+1 − αj)

the first term being the intensity reward and the second term the geometric
reward. The log-likelihood in the first term for a fixed length F segment is
given by
1 
N
pon (pj ) pon (Ii |θ∗ )
log = li log (2.54)
poff (pj ) F i=1 poff (Ii )
36 2 Interest Points, Edges, and Contour Grouping

Fig. 2.16. Examples of junction detection. (Courtesy of Cazorla.)

Regarding the second term, pG (αj+1 − αj) models a first-order Markov
chain of orientation variables αj :

C
pG (αj+1 − αj) ∝ exp − |αj+1 − αj | (2.55)
2A

where A is the maximum angle between two consecutive segments and C


models the path curvature. Finally, uniform distribution is added in geometric
reward in order to express both terms of Eq. 2.53 in the same range of values.
The algorithm evaluates the last L0 segments of the path and prune them
where both rewards are below a given threshold:
2.5 Junction Detection and Grouping 37

(z+1)L0 −1
1  pon (pj )
log <T (2.56)
L0 poff (pj )
j=zL0
(z+1)L0 −1
1  pG (αj+1 − αj)
log < T̂ (2.57)
L0 U (αj+1 − αj)
j=zL0

given the threshold contraints (see Section 2.4):

−D(poff ||pon ) < T < D(pon ||poff ) (2.58)


−D(U ||pG ) < T̂ < D(pG ||U ) (2.59)

The thresholds may only be selected in a range of valid values. We recall


here some of the properties given in Section 2.2.3. First of all, higher threshold
values prune more segments than lower and conservative thresholds. On the
other hand, the divergence between pon and poff distributions (and conversely,
the divergence between U and pG ) also affects the amount of pruned seg-
ments. If these distributions are similar, the range of valid values is narrower,
and as a consequence, this amount of pruned segments decreases.
In order to decrease computational burden, an inadmissible additional
rule is introduced. This rule prioritizes the stability of longer paths that have
survived more prunings, over shorter paths. Let Lbest be the highest reward
partial, path found. All length Lj paths for which Lbest − Lj > zL0 are
pruned. Parameter z models the minimum allowed length difference between
the best partial path and the rest. Lower z values remove more paths, increas-
ing the risk to discard the true target path.
The junction connection algorithm searches a path from each junction
center following each unvisited wedge limit, marking them as visited. The
path exploration step finishes when the center (xfc , ycf ) of another junction is
in the neighborhood of the last segment of the best partial path. This last
segment is linked to the wedge limit θf of the destination junction, and is
marked as visited. Marking a wedge limit as visited prevents the algorithm
to look for a path from it. If the A∗ finishes and a junction is not found, two
possible outcomes are defined, depending on the length of the last explored
path. If it is below L0 , the algorithm discards it. Otherwise, the last position
is stored and marked as terminal point, and this terminal point may link with
another path in future steps of the algorithm. The algorithm finishes when
there are no unvisited wedge limits. Unconnected junctions are considered
false positives and removed. Similar images to the ones shown in Fig. 2.17
produce a 50% of false junctions; 55% of these false positives are removed
due to junction grouping. As stated before, junction grouping also provides a
mid-level visual representation of the scene that may be useful for high-level
vision tasks.
38 2 Interest Points, Edges, and Contour Grouping

Fig. 2.17. Examples of application of the connecting path search algorithm. (Cour-
tesy of Cazorla.)

Problems
2.1 Understanding entropy
The algorithm by Kadir and Brady (Alg. 1) relies on a self-dissimilarity mea-
sure between scales. This measure is aimed at making possible the direct
comparison of entropy values, in the case of different image region sizes. In
general, and given a pixel on an image, how does its entropy vary with re-
spect to the pixel neighborhood size? In spite of this general trend, think of
an example in which the entropy remains almost constant in the range of
2.5 Junction Detection and Grouping 39

scales between smin and smax . In this last case, would the Kadir and Brady
algorithm select this pixel to be part of the most salient points on the image?
2.2 Symmetry property
Symmetry is one of the properties of entropy. This property states that entropy
remains stable if data ordering is modified. This property strongly affects
the Kadir and Brady feature extractor, due to the fact that it may assign
the same saliency value to two visually different regions, if they share the
same intensity distribution. Give an example of two visually different and
equally sized isotropic regions, for which p(0) = p(255) = 0.5. For instance,
a totally noisy region and a region splitted into two homogeneous regions.
Although Eq. 2.2 assigns the same saliency to both regions, which one may
be considered more visually informative? Think of a modification of Kadir
and Brady algorithm that considers the symmetry property.
2.3 Entropy limits
Given Eq. 2.2, show why homogeneous image regions have minimum saliency.
In which cases will the entropy value reach its maximum?
2.4 Color saliency
Modify Alg. 1 in order to adapt it to color images. How does this modification
affect to the algorithm complexity? Several entropy estimation methods are
presented through next chapters. In some of these methods, entropy may be
estimated without any knowledge about the underlying data distribution (see
Chapter 5). Is it possible to base Alg. 1 on any of these methods? And how
this modification affects complexity?
2.5 Saliency numeric example
The following table shows the pixel intensities in a region extracted from
an image. Apply Eq. 2.2 to estimate the entropy of the three square shape
regions (diameters 3, 5 and 7) centered on the highlighted pixel. Did you find
any entropy peak? In this case, apply self-dissimilarity to weight this value
(see Alg. 1).

0 200 200 200 200 200 0


0 0 200 0 0 0 0
0 0 200 200 0 0 200
0 0 0 0 0 0 200
0 0 200 200 0 0 200
200 0 0 0 0 0 200
0 0 0 0 0 200 200
2.6 Conditional entropy as filter evaluator
Let φ1 and φ2 be two local filters applied to edge detection. The output of
both filters for any pixel is a discrete value in the set {0, 1, 2}. Each pixel is
labeled using an alphabet α = {0, 1}; pixels belonging to an edge are labeled
as 1, and pixels that are not part of any edge are labeled as 0. The following
40 2 Interest Points, Edges, and Contour Grouping

table shows the real label of six pixels, and the output of both filters for each
of them. Evaluate φ1 and φ2 by means of conditional entropy. Which filter
discriminates better between on edge and off edge pixels?

α φ1 φ2
0 0 0
0 1 2
0 1 2
1 1 1
1 2 0
1 2 1

2.7 Kullback–Leibler divergence as filter evaluator


Plot the pon and poff distributions corresponding to the two filters φ1 and φ2
from Prob. 2.6. Perform an evaluation of both filters, based on Kullback–
Leibler divergence and Bhattacharyya distance. Is any of these two mea-
sures more informative than the other one with respect to the quality of
the filters? Check that Kullback–Leibler divergence is not symmetric, that is,
D(pon ||poff ) = D(poff ||pon ). Think of a simple divergence measure, based on
Kullback–Leibler, which satisfies the symmetry property.

2.8 Conditional entropy and not informative classifiers


Conditional entropy may also be expressed as

H(Y |X) = H(Y, X) − H(X) (2.60)

where the joint entropy is



H(X, Y ) = − p(x, y) log p(x, y) (2.61)
x y

Conditional entropy, H(Y |X), is a measure of the remaining unpredictabil-


ity of Y given X. Thus, H(Y |X) = H(Y ) is true only if X and Y are indepen-
dent variables. In the case of edge detection, conditional entropy is a measure
of the unpredictability of a pixel class (edge or non-edge), given a filter. Know-
ing all these facts, design the worst possible filter, a filter that does not give
any information about the class of any pixel.

2.9 Sanov’s theorem


Apply Sanov’s theorem to estimate the possibility of obtaining more than 800
sixes when throwing 1,000 fair dices.

2.10 MAP and road tracking


Check, using Eq. 2.27, if the road tracking problem can be solved using any
of the two filters shown in Prob. 2.6. Suppose a branching factor of 3 and
pG = U .
2.6 Key References 41

2.6 Key References

• T. Kadir and M. Brady, “Scale, Saliency and Image Description”, Inter-


national Journal of Computer Vision, 45(2): 83–105 (2001)
• S. Konishi, A. Yuille and J. Coughlan, “A Statistical Approach to Multi-
scale Edge Detection”, Image and Vision Computing, 21(1): 37–48 (2003)
• J. M. Coughlan and A. Yuille, “Bayesian A∗ Tree Search with Ex-
pected O(N) Node Expansions for Road Tracking”, Neural Computation,
14(8):1929–1958 (2002)
• M. Cazorla, F. Escolano, D. Gallardo and R. Rizo, “Junction Detection
and Grouping with Probabilistic Edge Models and Bayesian A∗ ”, Pattern
Recognition, 35(9):1869–1881 (2002)
• M. Cazorla and F. Escolano, “Two Bayesian Methods for Junction Detec-
tion”, IEEE Transaction on Image Processing, 12(3):317–327 (2003)
3
Contour and Region-Based Image
Segmentation

3.1 Introduction
One of the most complex tasks in computer vision is segmentation. Seg-
mentation can be roughly defined as optimally segregating the foreground
from the background, or by finding the optimal partition of the image into
its constituent parts. Here optimal segregation means that pixels (or blocks
in the case of textures) in the foreground region share common statistics.
These statistics should be significantly different from those corresponding to
the background. In this context, active polygons models provide a discrimina-
tive mechanism for the segregation task. We will show that Jensen–Shannon
(JS) divergence can efficiently drive such mechanism. Also, the maximum en-
tropy (ME) principle is involved in the estimation of the intensity distribution
of the foreground.
It is desirable that the segmentation process achieves good results (com-
pared to the ones obtained by humans) without any supervision. However,
such unsupervision only works in limited settings. For instance, in medical
image segmentation, it is possible to find the contour that separates a given
organ in which the physician is interested. This can be done with a low de-
gree of supervision if one exploits the IT principle of minimum description
length (MDL). It is then possible to find the best contour, both in terms of
organ fitting and minimal contour complexity. IT inspires methods for finding
the best contour both in terms of segregation and minimal complexity (the
minimum description length principle).
There is a consensus in the computer vision community about the fact
that the maximum degree of unsupervision of a segmentation algorithm is
limited in the purely discriminative approach. To overcome these limitations,
some researchers have adopted a mixed discriminative–generative approach to
segmentation. The generative aspect of the approach makes hypotheses about
intensity or texture models, but such hypotheses are contrasted with discrimi-
native (bottom-up) processes. Such approaches are also extended to integrate

F. Escolano et al., Information Theory in Computer Vision and Pattern Recognition, 43



c Springer-Verlag London Limited 2009
44 3 Contour and Region-Based Image Segmentation

segmentation with recognition. In both cases (bottom-up segmentation and/or


recognition), information theory provides a formal framework to make these
tasks feasible with real images.

3.2 Discriminative Segmentation with Jensen–Shannon


Divergence
3.2.1 The Active Polygons Functional
Concerning the problem of capturing the contour of a region by means of an
active contour [93], it is necessary to (i) devise a proper discrete represen-
tation of the contour, (ii) design differential equations that implement the
motion flow of the contour points toward their optimal placement (the one
minimizing a given energy function), and (iii) remove, as much as possible,
the initialization-dependent behavior of the process. Region competition [184]
unifies several region-based segmentation approaches by integrating statistics-
based motion flows, smoothness-preserving flows, dynamics for region growing,
and merging criteria.1 Focusing here on the use of statistical models within
active contours, a more evolved treatment of the question, besides the use
of a simplistic polygonal discretization of the contour, consists in the Active
Polygons [162, 163] computational model. This new model is driven by an
information theory measure: the Jensen divergence.
In continuous terms, a contour is Γ : [a, b] ⊂ R → R2 defined around
some region R ⊂ R2 . Considering there exists a function f : R2 → R2 , the
divergence theorem for the plane states that the integral of f inside R is
equivalent to the integral of F · n, being f = ∇ · F = ∂F1 /∂x + ∂F2 /∂y the
divergence, with F = (F1 , F2 )T , and n the unit normal to Γ , that is
  
f (x, y)dxdy = F · nds (3.1)
R Γ =∂R

where ds is the Euclidean arc-length element, and the circle in the right-
hand integral indicates that the curve is closed (the usual assumption), that
is, if we define p as a parameterization of the curve, then p ∈ [a, b] and
Γ (a) = Γ (b). Being t = (dx, dy)T a vector pointing tangential along the
curve, and assuming
 that the curve is positively-oriented (counterclockwise),
we have ds = dx + dy 2 . As the unit normal vector is orthogonal to the
2

tangential one, we have n = (dy, −dx)T and ||n|| = 1 = ds.


Alternatively, we may reformulate Eq. 3.1 in terms of the parameterization
with p, and denote it E(Γ ):
 b
E(Γ ) = (F · n) ||Γp ||dp (3.2)
a   
ds
1
Region competition contemplates also the MDL (Minimum Description Length)
information-theoretic principle, which will be tackled later in this chapter.
3.2 Discriminative Segmentation with Jensen–Shannon Divergence 45

where ||Γp || = ||∂Γ/∂p|| = x2p + yp2 . As the unit tangent t may be defined in
terms of Γp /||Γp ||, the unit normal n may be similarly defined in the following
terms:  
0 1
n = Jt with J = (3.3)
−1 0
and consequently, we have
n||Γp || = JΓp (3.4)
and, thus, we may express E(Γ ) in the following terms:
 b
E(Γ ) = (F · JΓp )dp (3.5)
a

However, we need an expression for the gradient-flow of E(.), that is, the
partial derivatives indicating how the contour is changing. In order to do so,
we need to compute the derivative of E(.), denoted by Et (.), with respect to
the contour. This can be usually done by exploiting the Green’s theorem (see
Prob. 3.2.). Setting a = 0, b = 1:
 1
1
Et (Γ ) = (Ft · JΓp + F · JΓpt )dp (3.6)
2 0

where Γpt is the derivative of Γp . Integrating by parts the second term and
applying the chain rule on the derivatives of F, we remove Γpt :
 1
1
Et (Γ ) = ((DF)Γt · JΓp − (DF)Γp · JΓp )dp (3.7)
2 0

DF being the 2 × 2 Jacobian of F. Then, isolating Γt in the left side of the


dot product, we have:

1 1
Et (Γ ) = (Γt · [JT (DF)T − JT (DF)]Γp )dp (3.8)
2 0

and exploiting the fact that the matrix [.] is antisymmetric (A is antisym-
metric when A = −AT ), its form must be ωJ due to the antisymmetric
operator J:
 
1 1 1
Et (Γ ) = (Γt · ωJΓp )dp = (Γt · ωJΓp )ds (3.9)
2 0 2 Γ

Finally, setting ω = ∇ · F = 2f we obtain



Et (Γ ) = (Γt · ωn)ds (3.10)
Γ
46 3 Contour and Region-Based Image Segmentation

which implies that the gradient flow (contour motion equation) is given by

Γt = f n (3.11)

Given the latter gradient flow, it is now necessary to translate it to a closed


polygon defined by n vertices V1 , . . . , Vn belonging to R2 . First of all, we must
parameterize the contour Γ , which is associated to the polygon, by p ∈ [0, 1].
Then, Γ (p) = L(p − p, V p , V p +1 ), being L(t, A, B) = (1 − t)A + tB
a parameterized line between vectors A and B (vertices in this case). The
indices for the vertices are module n, which means that V0 = Vn . Also,
Γp (p) = V p +1 − V p . Now it is possible to have a polygonal version of
Eq. 3.5:  1
Et (Γ ) = (f (Γ (p))(Γt · J(Γp (p))dp (3.12)
0
where the gradient flow for each vertex Vk is given by
 1
∂Vk
= pf (L(p, Vk−1 , Vk ))dp nk,k−1
∂t 0
 1
+ (1 − p)f (L(p, Vk , Vk+1 ))dp nk+1,k
0
(3.13)

nk,k−1 and nk+1,k being the outward normals to edges Vk−1 , Vk and
Vk , Vk+1 , respectively. These equations allow to move the polygon as shown
in Fig. 3.1 (top). What is quite interesting is that the choice of a speed function
f adds flexibility to the model, as we will see in the following sections.

3.2.2 Jensen–Shannon Divergence and the Speed Function

The key element in the vertex dynamics described above is the definition of
the speed function f (.). Here, it is important to stress that such dynamics
are not influenced by the degree of visibility of an attractive potential like
the gradient in the classical snakes. On the contrary, to avoid such myopic
behavior, region-based active contours are usually driven by statistical forces.
More precisely, the contour Γ encloses a region RI (inside) of image I whose
complement is RO = I \ RI (outside), and the optimal placement Γ ∗ is the
one yielding homogeneous/coherent intensity statistics for both regions. For
instance, assume that we are able to measure m statistics Gj (.) for each region,
and let uj and vj , j = 1, . . . , m be respectively the expectations E(Gj (.))
of such statistics for the inside and outside regions. Then, the well-known
functional of Chan and Vese [37] quantifies this rationale:
m  
1 
E(Γ ) = (Gj (I(x, y) − uj )2 dxdy + (Gj (I(x, y) − vj )2 dxdy
2|I| j=1 RI RO
(3.14)
3.2 Discriminative Segmentation with Jensen–Shannon Divergence 47
0

-50

-100

-150

-200

0 50 100 150

Fig. 3.1. Segmentation with active polygons. Top: simple, un-textured object.
Bottom: complex textured object. Figure by G. Unal, A. Yezzi and H. Krim (2005
c
Springer).

which is only zero when both terms are zero. The first term is nonzero when
the background (outer) intensity/texture dominates the interior of the con-
tour; the second term is nonzero when the foreground (inner) texture domi-
nates the interior of the contour; and both terms are nonzero when there is
an intermediate domination. Thus, the latter functional is zero only in the
optimal placement of Γ (no-domination).
From the information-theory point of view, there is a similar way of for-
mulating the problem. Given N data populations (in our case N = 2 inten-
sity/texture populations: in and out), their disparity may be quantified by the
generalized Jensen–Shannon divergence [104]:
N 
 N
JS = H ai pi (ξ) − ai H(pi (ξ)) (3.15)
i=1 i=1
N
H(.) being the entropy, ai the prior probabilities of each class, i=1 ai = 1,
pi (ξ) the ith pdf (corresponding to the ith region) of the random variable ξ
(in this case pixel intensity I). Considering the Jensen’s inequality
48 3 Contour and Region-Based Image Segmentation
  
ax a γ(xi )
γ i i ≤ i (3.16)
ai ai

γ(.) being a convex function and the ai positive weights, it turns out that, if
X is a random variable, γ(E(X)) ≤ E(γ(X)). In the two latter inequalities,
changing to ≥ turns
 convexity into concavity. Let then P be the mixture of
distributions P = i ai pi . Then, we have

H(P ) = − P log P dξ
   


=− ai pi log ai pi dξ
i i
  
1
= ai pi dξ log
i
P
  1

= ai pi log dξ
i
P
   
1 1
= ai pi log dξ + pi log dξ
i
pi P

= ai (H(pi ) + D(pi ||P )) (3.17)
i

where D(.||.) denotes, as usual, the Kullback–Leibler divergence. Therefore


   
   
JS = H ai pi − ai H(pi ) = ai D pi || ai pi (3.18)
i i i i

As the KL divergence is nonnegative, this is also the case of JS. Therefore we


have  
 
H ai pi ≥ ai H(pi ) (3.19)
i i

and exploiting Jensen’s inequality (Eq. 3.16), we obtain the concavity of the
entropy. In addition, the KL-based definition of JS provides an interpreta-
tion of the Jensen–Shannon divergence as a weighted sum of KL divergences
between individual distributions and the mixture of them.
For our particular case with N = 2 (foreground and background distri-
butions), it is straightforward that JS has many similarities with Eq. 3.14
in terms of providing a useful divergence for contour-driving purposes. Fur-
thermore, JS goes theoretically beyond Chan and Vese functional because
the entropy quantifies higher order statistical interactions. The main prob-
lem here is that the densities pi (ξ) must be estimated or, at least, properly
3.2 Discriminative Segmentation with Jensen–Shannon Divergence 49

approximated. Thus, the theoretical bridge between JS and the Chen and
Vese functional is provided by the following optimization problem:


p (ξ) = arg max − p(ξ) log p(ξ)dξ
p(ξ)

s.t. p(ξ)Gj (ξ)dξ = E(Gj (ξ)), j = 1, . . . , m

p(ξ)dξ = 1 (3.20)

where the density estimation p∗ (ξ) is the maximum entropy constrained to the
verification of the expectation equations by each of the statistics. Thus, this
is the first application in the book of the principle of maximum entropy [83]:
given some information about the expectations of a set of statistics character-
izing a distribution (this is ensured by the satisfaction of the first constraint),
our choice of the pdf corresponding to such distribution (the second constraint
ensures that the choice is a pdf) must be the most neutral one among all the
p(ξ) satisfying the constraints, that is, the more uninformed one that is equiv-
alent to the maximum entropy distribution. Otherwise, our choice will be a
pdf with more information than we have actually specified in the expectations
constraints, which indicates all what is available for making the inference: our
measurements E(Gj (ξ)).
The shape of the optimal pdf is obtained by constructing and deriving the
Lagrangian (after assuming the natural logarithm):
  
L(p, Λ) = − p(ξ) log p(ξ)dξ + λ0 p(ξ)dξ − 1


m  
+ λj p(ξ)Gj (ξ)dξ − E(Gj (ξ))
j=1

∂L m
= − log p(ξ) − 1 + λ0 + λj Gj (ξ) (3.21)
∂p j=1

Λ = (λ0 , . . . , λm ) being the m + 1 Lagrange multipliers. Then, seeking for the


maximum implies

m
0 = − log p(ξ) − 1 + λ0 + λj Gj (ξ)
j=1

m
log p(ξ) = −1 + λ0 + λj Gj (ξ)
j=1
m
p∗ (ξ) = e−1+λ0 + j=1 λj Gj (ξ)
(3.22)
where the multipliers are chosen in such a way that the constraints are satis-
fied. For instance, if we know λ1 , . . . , λm , it is straightforward to find λ0 if we
apply its constraint, and again considering the natural logarithm, we obtain
50 3 Contour and Region-Based Image Segmentation

1= p(ξ)∗ dξ
 m
1= e−1+λ0 + j=1 λj Gj (ξ) dx
 
m
1 = (e−1 eλ0 ) e j=1 λj Gj (ξ) dx
  
Z(Λ,ξ)

log 1 = (λ0 − 1) + log Z(Λ, ξ)


1 − λ0 = log Z(Λ, ξ) (3.23)
Z(Λ, ξ) being the so-called partition function, which is used to normalize the
maximum entropy distribution. Therefore, we obtain the typical expression of
such distribution: m
1
p∗ (ξ) = e j=1 λj Gj (ξ) (3.24)
Z(Λ, ξ)
The main problem is how to obtain the multipliers λ1 , . . . , λm . This is usually
done through iterative methods (there is as many nonlinear equations as con-
straints) as we will see along the book. Alternatively, some researchers have
addressed the approximation of the maximum entropy (ME) distribution and
the entropy itself. For instance, Hyvärinen [77] assumes that the ME distri-
bution is not very far from a Gaussian distribution with the same mean and
variance. Actually, the second Gibbs theorem states that the ME distribu-
tion among all the ones with the same mean and variance is the Gaussian
one. Consequently, assuming that the random variable ξ has been standard-
ized (zero mean and unit variance),
√ we may assume that the ME distribution
is close to ϕ(ξ) = exp(−ξ 2 /2)/ 2π. This implicitly means that we must add
two additional functions Gm+1 (ξ) = ξ and Gm+2 (ξ) = ξ 2 , and their respective
constraints. Furthermore, in order to reduce the nonlinear equations derived
from the constraints to linear ones, one may also choose Gj (.) that are or-
thogonal among them and also to all polynomials of second degree. Thus, the
new set of constraints, for j, i = 1, . . . , m, are the following:

1 if i = j
ϕ(ξ)Gj (ξ)Gi (ξ)dξ =
0 otherwise

ϕ(ξ)Gj (ξ)ξ k dξ = 0 for k = 0, 1, 2 . (3.25)

Thus, assuming near-Gaussianity leads to


2 m
p∗ (ξ) = Z̃ −1 ϕ(ξ)eλm+1 ξ+(λm+2 + 2 )ξ + j=1 λj Gj (ξ)
1
(3.26)

where Z̃ −1 = 2πZ −1 . The latter equation may be simplified a little bit more
if we take the first-order Taylor expansion ex ≈ 1 + x
⎛ ⎞
  m
1
p∗ (ξ) ≈ Z̃ −1 ϕ(ξ) ⎝1 + λm+1 ξ + λm+2 + ξ2 + λj Gj (ξ)⎠ (3.27)
2 j=1
3.2 Discriminative Segmentation with Jensen–Shannon Divergence 51

The orthogonalization
 assumption yields the linearization of the constraints.
For instance, for p∗ (ξ)dξ = 1, we have

 

−1 ⎜
1 = Z̃ ⎝ ϕ(ξ)dξ +λm+1 ϕ(ξ)ξdξ +
     
1 0

   
1
m

+ λm+2 + ϕ(ξ)ξ 2 dξ + λj ϕ(ξ)Gj (ξ)dξ ⎟ ⎠ (3.28)
2
   j=1   
1 0

therefore
  
1
1 = Z̃ −1 1 + λm+2 + (3.29)
2
and, similarly, we also obtain

p∗ (ξ)ξdξ = Z̃ −1 λm+1 = 0

p∗ (ξ)ξ 2 dξ = Z̃ −1 (1 + 3(λm+2 + 1/2)) = 1

p∗ (ξ)Gj (ξ)dξ = Z̃ −1 λj = E(Gj (ξ)), j = 1, . . . , m (3.30)

Then, we obtain: Z̃ −1 = 1,λm+1 = 0,λm+2 = −1/2, and λj = E(Gj (ξ)) (that


is, the λj are estimated from the samples). Consequently, the final expression
for the pdf is ⎛ ⎞

m
p∗ (ξ) = ϕ(ξ) ⎝1 + uj Gj (ξ)⎠ (3.31)
j=1

where uj = E(Gj (ξ)).


In addition, the entropy H(p∗ (ξ)) can be approximated by (i) exploiting
Eq. 3.31, (ii) taking into account the Taylor expansion (1 + x) log(1 + x) =
x + x2 /2, (iii) grouping terms for expressing the entropy of ϕ(ξ) in the first
term, and (iv) applying the orthogonality constraints. Such approximation is

1 2
m

H(p (ξ)) = − p∗ (ξ) log p∗ (ξ)dξ ≈ H(ν) − u (3.32)
2 j=1 j

where ν = ϕ(ξ). At this point of the section, we have established the math-
ematical basis for understanding how to compute efficiently the Jensen–
Shannon (JS) divergence between two regions (N = 2 in Eq. 3.15):

JS = H(a1 p∗1 (ξ) + a2 p∗2 (ξ)) − a1 H(p∗1 (ξ)) − a2 H(p∗2 (ξ)) (3.33)
52 3 Contour and Region-Based Image Segmentation

Here is the key to clarify the following approximation:

P = a1 p∗1 (ξ) + a2 p∗2 (ξ)


⎛ ⎛ ⎞ ⎛ ⎞⎞
m 
m
≈ ⎝a1 ϕ(ξ) ⎝1 + uj Gj (ξ)⎠ + a2 ϕ(ξ) ⎝1 + vj Gj (ξ)⎠⎠
j=1 j=1
⎛ ⎛ ⎞ ⎛ ⎞⎞

m 
m
= ϕ(ξ) ⎝a1 ⎝1 + uj Gj (ξ)⎠ + a2 ⎝1 + vj Gj (ξ)⎠⎠
j=1 j=1
⎛⎛ ⎞ ⎛ ⎞⎞

m 
m
= ϕ(ξ) ⎝⎝a1 + a1 uj Gj (ξ)⎠ + ⎝a2 + a2 vj Gj (ξ)⎠⎠
j=1 j=1
⎛ ⎞

m
= ϕ(ξ) ⎝(a1 + a2 ) + [Gj (ξ)(a1 uj + a2 vj )]⎠ (3.34)
  
j=1
1
 
where uj = RI Gj (I(x, y))/|RI |dxdy and vj = RO Gj (I(x, y))/|RO |dxdy.
Then, exploiting the same rationale for deriving Eq. 3.32, we obtain

1
m
H(P ) ≈ H(ν) − (a1 uj + a2 vj )2 (3.35)
2 i=1

Finally, we obtain the approximated JS:


 m
ˆ = H(ν) − 1
JS (a1 uj + a2 vj )2
2 i=1
⎛ ⎞ ⎛ ⎞
1 m
1 m
−a1 ⎝H(ν) − u2 ⎠ − a2 ⎝H(ν) − v2 ⎠
2 j=1 j 2 j=1 j

1 m
= a1 a2 (uj − vj )2 (3.36)
2 j=1

The negative of JSˆ is denoted as an energy function E to minimize. Setting


a1 = |RI |/|I| and a2 = |RO |/|I|, we have that incorporating also the variations
ˆ with respect to the sizes of each area, we obtain
of JS

1  |RI ||RO |
m
ˆ = ∂Γ
∇JS = (2(uj − vj )(∇uj − ∇vj ))n
∂t 2 j=1 |I|2

1  |RI | 1  |RO |
m m
+ (uj − vj )2 n − (uj − vj )2 n, (3.37)
2 j=1 |I|2 2 j=1 |I|2
  
m |RI |−|RO |
1
2 j=1 |I|2
(uj −vj )2 n
3.3 MDL in Contour-Based Segmentation 53

being the partial derivatives of ui and uj with respect to the contour

Gj (I(x, y) − uj )
∇uj = n
|RI |
Gj (I(x, y) − vj )
∇vj = − n, (3.38)
|RO |

and n the outward unit normal. Then, replacing the latter partial derivatives
in Eq. 3.37, taking into account that |I| = |RI | + |RO |, and rearranging terms,
we finally obtain
∇JSˆ = ∂Γ = f n (3.39)
∂t
where

1 
m
f= (uj − vj )((Gj (I(x, y)) − uj ) + (Gj (I(x, y)) − vj )) (3.40)
2|I| j=1

f being the gradient flow of the Chan and Vese functional (Eq. 3.14). Thus f is
defined in the terms described above, and the connection between JS and the
contour dynamics is established. This f is the one used in Fig. 3.1 (bottom)
when we use as generator functions such as G1 (ξ) = ξe−ξ /2 , G2 (ξ) = e−ξ /2
2 2

and G3 (ξ) = |ξ|.

3.3 MDL in Contour-Based Segmentation

3.3.1 B-Spline Parameterization of Contours

Considering the snake-inspired definition of a closed contour: Γ (t) =


(x(t), y(t)), being, for instance, t ∈ [0, 2π], in practice, the latter arc-length
range is discretized and we understand Γ = {(x(t), y(t))T : t = (Ni2π −1) , i =
0, . . . , N − 1} as an ordered sequence of, say N , 2D points. What is more
interesting, from the point of view of the complexity analysis of the contour,
is that given the latter N points (samples), its continuous version Γ (t) may
be inferred from the samples by means of either Fourier analysis [149] or
spline methods [6, 23]. These methods allow to parameterize the contours by
implicitly introducing a smoothing constraint when a small number of pa-
rameters are selected (terms or control points, respectively). The important
question here is to infer the optimal number of parameters at the same time
that the image is segmented, that is, the contour is optimally placed.
For the sake of simplicity, let us discuss here one of the latter parame-
terizations, for instance, the B-spline one [58] (although both methods are
explored in [57]). Let B M = {BM k (t) : k = 0, . . . , K − M − 1} a set of NB
54 3 Contour and Region-Based Image Segmentation

B-splines, where NB = K − M , and let {t0 ≤ t1 ≤ . . . ≤ tK } be a set of knots.


Each BMk (t) is a polynomial of typically low order M ≥ 2 (degree M − 1):

M −1 t − tk −1 tk+M − t
BM
k (t) = Bk (t) + BM
k+1 (t)
tk+M −1 − tk tk+M − tk+1
1 if tk ≤ t ≤ tk+1
B1k (t) = (3.41)
0 otherwise

whose support is [tk , tk+M ] and are smoothly joined at knots, that is, M − 2
continuous derivatives exist at the joints, and the (M − 2)th derivative must
be equal in the
 joint (C M −2 continuity). B-splines satisfy BM
k (t) ≥ 0 (non-
negativity), k Bk (t) = 1 for t ∈ [tM , tNB ] (partition of unity), and period-
M

icity. Finally, the NB B-splines define a nonorthogonal basis for a linear space.
With regard to the latter, one-dimensional functions are uniquely defined as
B −1
N
y(t) = k (t) t ∈ [tM −1 , tNB ]
ck BM (3.42)
k=0

where ck ∈ R are the so-called control points, where the kth control point
influences the function only for tk < t < tk+M . If we have NB = K −M control
points and K knots, then the order of the polynomials of the basis is exactly
K −NB = M , that is, we have B M (for instance M = 4 yields cubic B-splines).
For instance, in Fig. 3.2 (left), we have a B-spline composed from NB = 7
cubic (M = 4) basis functions and K = 11 knots (K − NB = M ), where
t0 = t1 = t2 = t3 = 0, t4 = 1.5, t5 = 2.3, t6 = 4, and t7 = t8 = t9 = t10 = 5.
All the control points are 0 except c3 = 1. Therefore, we have a plot of B43 (t)
Furthermore, as the multiplicity of 0 and 5 is both 4, in practice, we have

0.7 9
8
0.6 P2
7
0.5 6
P3
5
0.4
4
y

0.3 3
2
0.2
1 Bspline
controls
0.1 P1 0 polygon
P4 samples
−1
0 −4 −3 −2 −1 0 1 2 3 4 5
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 x

Fig. 3.2. Left: a degenerated one-dimensional example y(t) = B43 (t) with the four
cubic polynomials P1 . . . P4 describing it. Right: a 2D B-spline basis with M = 4
(cubic) for describing a contour (in bold). Such continuous contour is given by the
interpolation of N = 11 samples (first and last coincide) that are indicated by
∗ markers. There are NB = 11 control points, although two points of them are
(0, 0)T , indicated by circular markers and we also draw the control polygon.
3.3 MDL in Contour-Based Segmentation 55

only NB = 4 basis functions. If we do not use the B-form, we can also see
the four polynomials P1 . . . P4 used to build the function and what part of
them is considered within each interval. As stated above, the value of y(t) is
0 for t < t0 = 0 and t > t10 = 5, and the function is defined in the interval
[tM −1 = t3 = 0, tNB = t7 = 5].
Following [58], we make the cubic assumption and drop M for the sake
of clarity. Thus, given a sequence of N 2D points Γ = {(xi , yi )T : i =
0, 1, . . . , N − 1}, and imposing several conditions, such as the periodicity cited
above, the contour Γ (t) = (x(t), y(t))T is inferred through cubic-spline inter-
polation, which yields both the knots and the 2D control points ck = (cxk , cyk )T .
Consequently, the basis functions can be obtained by applying Eq. 3.41. There-
fore, we have
  B −1
N NB −1  x 
x(t) ck
Γ (t) = = ck Bk (t) = Bk (t) t ∈ [tM −1 , tNB ]
y(t) cyk
k=0 k=0
(3.43)
In Fig. 3.2 (right), we show the continuous 2D contour obtained by interpolat-
ing N = 11 samples with a cubic B-spline. The control polygon defined by the
control points is also showed. The curve follows the control polygon closely.
Actually, the convex hull of the control points contains the contour. In this
case, we obtain 15 nonuniformly spaced knots, where t0 = . . . = t3 = 0 and
t11 = . . . = t15 = 25.4043.
In the latter example of 2D contour, we have simulated the period-
icity of the (closed) contour by adding a last point equal to the first.
Periodicity in B-splines is ensured by defining bases satisfying BM k (t) =
+∞
B M
j=−∞ k+j(tK −t0 ) (t) : j ∈ Z, being t K − t 0 the period. If we have K+1
knots, we may build a B-splines basis of K functions B0 , . . . , BK−1 simply by
constructing B0 and shifting this function assuming a periodic knot sequence,
that is, tj becomes tjmodK . Therefore, tK = t0 , so we have K distinct knots.
Here, we follow the simplistic assumption that knots are uniformly spaced.
The periodic basis functions for M = 2 and M = 4 are showed in Fig. 3.3. In
the periodic case, if we set K, a closed contour can be expressed as
  
K−1
x(t)
Γ (t) = = ck Bk (t) t ∈ R (3.44)
y(t)
k=0

As in practice what we have is a discrete description of the contour, our


departure point is something like two columns with the x and y coordinates
of the, say N points:
⎛ ⎞ ⎛ ⎞
x0 y0 x(s0 ) y(s0 )
⎜ .. ⎟ = ⎜ ⎟
Γ = (x y) = ⎝ ... . ⎠ ⎝
..
.
..
. ⎠ (3.45)
xN −1 yN −1 x(sN −1 ) y(sN −1 )
56 3 Contour and Region-Based Image Segmentation
1 0.7
2 2 2 2
B0 B1 B2 B3 4 2 4 4
0.9 B3 B0 B1 B2
0.6
0.8
0.7 0.5

0.6
0.4
B(t)

B(t)
0.5
0.4 0.3

0.3 0.2
0.2
0.1
0.1
0 0
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
tk tk

Fig. 3.3. Periodic basic functions for M = 2 (left) and for M = 4 (right). In both
cases, the K = 5, distinct knots are t0 = 1 . . . t4 = 4, and we show the K − 1
functions; those with a nonzero periodic fragment are represented in dashed lines.

where, under the assumption of periodical basis and uniformly spaced K


distinct knots, we may set tj = j, j = 0, . . . , K − 1, and thus then, si =
iK/N, i = 0, . . . , N − 1 are the mapped positions of the discrete points in the
contour parameterization. Therefore, we have the following two sets of associ-
ation pairs: X = {(si , xi )} and Y = {(si , yi )}, with i = 0, . . . , N − 1 for both
cases. How to obtain the control points θ (K) = (cT0 . . . cTK−1 )T = (θ x(K) θ y(K) )?
It is key to build the N × K matrix B(K) , which will satisfy

Γ = B(K) θ (K) ≡ {x = B(K) θ x(K) , y = B(K) θ y(K) } (3.46)

The columns of B(K) conceptually consist in the discretization, accordingly


to the si , of each of the K (cubic) basic functions: Bij = Bj (si ). This matrix
is independent both on x and y, and needs only to be calculated once. For
instance, the least squares solution to x = B(K) θ x(K) consists of
x
θ̂ (K) = arg min ||x − B(K) θ (K) ||2 = (BT(K) B(K) )−1 BT(K) x (3.47)
θ (K)   
B†(K)

B†(K) being the pseudo-inverse, and the inverse (BT(K) B(K) )−1 always exists.
Furthermore, an approximation of the original x can be obtained by
x x
x̂(K) = B(K) θ̂ (K) = B(K) B†(K) θ̂ (K) (3.48)
  
B⊥
(K)

where B⊥(K) is the so-called projection matrix because x̂(K) is the projection
of x onto the range space of K dimensions of B(K) : R(B(K) ). Such space is
3.3 MDL in Contour-Based Segmentation 57

1
0.2
0.5
0
0 −0.2
9 9
8 8
7 7
6 6
5 5
4 4
3 3 0
0 40 20
2
40 20 2 80 60
80 60 120 100
1 140 120 100 1 140
180 160 180 160

0.8
0.8
0.7 K=N−1
K=19
0.7 K=9
0.6
0.6
0.5
0.5
yi

y(t)

0.4
0.4
0.3
0.3

0.2 0.2

0.1 0.1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
xi x(t)

Fig. 3.4. Top: structure of B(K) (left) for N = 174 samples and K − 1 = 9 basis,
and structure of its orthogonal matrix spanning the range space. In both cases, each
row is represented as a strip. Bottom: contours inferred for different values of K(see
text).

the one expanded by an orthonormal basis with the same range of B(K) . The
structure of both B(K) and the corresponding orthonormal matrix is showed
in Fig. 3.4 (top-left and top-right), respectively.
Applying to y and θ x(K) the above rationale and fusing the resulting equa-
tions, we have that
x y
θ̂ (K) = (θ̂ (K) θ̂ (K) ) = (B†(K) x B†(K) y) = B†(K) Γ
Γ̂ = (x̂ ŷ) = (B⊥ ⊥ ⊥
(K) x B(K) y) = B(K) θ̂ (K) (3.49)

Given, for example, N = 174 samples of a 2D contour (Fig. 3.4, bottom


left), we may use the method described above for obtaining approximate con-
tours for different values of K (Fig. 3.4, bottom right). For K = N − 1, we
obtain a pretty good approximation. However, such approximation does not
differ too much from the one obtained with K = 19 control points. Finally,
the upper concavity of the contour is lost, and also the upper convexity is
58 3 Contour and Region-Based Image Segmentation

smoothed, with K = 9. Thus, two key questions remain: What is the optimal
value of K?, and What is the meaning of optimal in the latter question? As
we will see next, information theory will shed light on these questions.

3.3.2 MDL for B-Spline Parameterization

B-splines with K basis represent a model Mk (say of order K) for encoding a


representative contour of N hand-drawn 2D points (the crude data D). The
minimum description length (MDL) principle introduced by Rissannen [138–
140] states that the optimal instance of a model of any order M is the one
that minimizes2 :
L(D|M) + L(M) (3.50)
where L(.) denotes the length, in bits, of the description of the argument.
Formally, a description method C is a way of mapping the source symbols or
sequences x of one alphabet A with sequences of symbols C(x) (codewords) of
another alphabet, typically B = {0, 1}, in a way that not two different sources
may be assigned to the same codeword. A code is a description method in
which each source is associated to at most one codeword.Prefix or instanta-
neous codes are those wherein no codeword can be a prefix of another one.
Let LC (a), a ∈ A denote the length in bits of C(a). Then, given the alphabet
A = {1, 2, . . . , m} and a prefix code C whose codewords are defined over the
alphabet B of length D, the following (Kraft) inequality is satisfied [43]:

D−LC (a) ≤ 1 (3.51)
a∈A

and conversely, given a set of codewords satisfying this inequality, there exists
a prefix code with these lengths. Moreover, when there is a probability dis-
tribution associated to the alphabet P (a), the codewords’ lengths
 satisfying
the Kraft inequality and minimizing the expected length a∈A (a)LC (a)
P
are LC (a) = − logD P (a) = logD P (a)
1
, and the expected length is exactly
the entropy under logD if we drop the rounding up to achieve integer lengths.
Such code, the Shannon–Fano one (see the allocation of codewords in [43] –
pp. 101–103), allows to relate lengths and probabilities, which, in turn, is
key to understand MDL in probabilistic terms. Therefore, Eq. 3.50 can be
rewritten in the following terms:

− log P (D|M) + L(M) (3.52)

so MDL relies on maximizing P (D|M), which is the probability density of


D given M . In the case of contours, it is easy to see that, as D ≡ Γ , the

2
This is, in the Grünwald terminology [67], the crude two-part version of MDL.
3.3 MDL in Contour-Based Segmentation 59

probability of observing Γ , given an instance of a model M ≡ θ (K) , can be


described by a Gaussian with white noise with given covariance Σ. Factorizing
P (D|M), we have that

P (Γ |θ (K) , Σ) = P (x|θ x(K) , σx2 ) · P (y|θ y(K) , σy2 ) , (3.53)

where, considering z = {x, y} in the following, we have that

P (Γ |θ z(K) , σz2 ) = G(z − B(K) θ z(K) , σz2 )



2 −N
||z − B(K) θ z(K) ||2
≡ (2πσz ) 2 exp − (3.54)
2σz2

which is consistent with the maximum likelihood estimation through least


z
squares: θ̂ (K) = arg maxθ(K) {log P (Γ |θ z(K) , σz2 )} = B†(K) Γ . However, the
MDL estimation is posed in the following terms:

(θ̂ (K) , σ̂x2 , σ̂y2 ) = arg min 2 ,σ 2


{− log P (Γ |θ (K) , σx2 , σy2 ) + L(θ (K) , σx2 , σy2 )}
K,θ (K) ,σx y

(3.55)
although we may rewrite L(θ (K) , σx2 , σy2 ) = L(θ (K) ) if we assume that the
variances have constant description lengths. The MDL model order selection
problem consists in estimating
 ! "
K ∗ = arg min L(θ (K) ) + min
2
min
x
(− log P (x|θ x(K) , σx2 ))
K σx θ (K)
! "
+ min min (− log P (y|θ y(K) , σy2 )) (3.56)
2σy θy
(K)

Any of the two minimization problems between brackets may be solved as


follows:
! "
minσz2 min
z
(− log P (z|θ z(K) , σz2 ))
θ (K)
! z "
N 2
||z − B(K) θ̂ (K) ||2
= min log(2πσz ) +
σz2 2 2σz2
! ⊥
"
N ||z − B (K) z|| 2
= min log(2πσz2 ) +
σz2 2 2σz2
# $
N 2 ||z − ẑ(K) ||2 N
= min log(2πσz ) + = log(2πσ̂z2 (K)e) (3.57)
σz2 2 2σz2 2
  
f (σz2 )
60 3 Contour and Region-Based Image Segmentation

where σ̂z2 (K) = arg minσz2 f (σz2 ) ≡ ||z − ẑ(K) ||2 /N , which is consistent with
being the optimal variance dependant on the approximation error. Therefore,
defining σ̂x2 (K) and σ̂y2 (K) in the way described below, we have

N
K ∗ = arg min L(θ (K) ) + (log(2πσ̂x2 (K)e) + log(2πσ̂y2 (K)e))
K 2

N
= arg min L(θ (K) ) + (log(σ̂x2 (K)) + log(σ̂y2 (K)))
K 2
% & '(
= arg min L(θ (K) ) + N log σ̂x2 (K)σ̂y2 (K) (3.58)
K

As we have seen above, the Shannon–Fano coding is finally related to the


goodness of fitting the data with the model of order K, but what about the
definition of L(θ (K) )? The simplistic assumption is considering that each pa-
rameter has a fixed description length, say λ, and thus, L(θ (K) ) = λK, but
what is the more convenient value for λ? In this regard, another simplistic
assumption is to adopt its asymptotical value for large N : 12 log N . However,
this asymptotical setting is not valid for control points because they do not
depend too much on the number of hand-drawn points. However, considering
that the parameters (control points) can be encoded with a given precision,
let δ z be the difference between the parameters with two different precisions,
being the more precise θ (K) . Then, we define the error between the two ver-
sions of z: z = B(K) δ z . Thus, maxi (i ) ≤ ξ, where ξ is the maximal absolute
error for any of the coordinates. Considering finally that the control points
will be inside the image plane, of dimensions Wx × Wy in pixels, the following
criterion (simplified from [58]):

Wx Wy
λ = log + = log(Wx Wy ) and, thus L(θ (K) ) = K log(Wx Wy )
ξ ξ
(3.59)

with ξ = 1 for discrete curves, reflects the fact that an increment of image
dimensions is translated into a smaller fitting precision: the same data in a
larger image need less precision. Anyway, the latter definition of λ imposes a
logarithmically smoothed penalization to the increment of model order.

3.3.3 MDL Contour-based Segmentation

Given the latter B-spline adaptive (MDL-based) model for describing con-
tours, the next step is how to find their ideal placement within an image,
for instance, in the ill-defined border of a ultrasound image. Given an image
I of Wx × Wy pixels containing an unknown contour Γ = B(K) θ (K) , pixel
intensities are the observed data, and, thus, their likelihood, given a contour
(hypothesis or model), is defined as usual P (I|θ (K) , Φ), Φ being the (also un-
known) parameters characterizing the intensity distribution of the image. For
3.3 MDL in Contour-Based Segmentation 61

instance, assuming that the image consists of an object in the foreground,


which is contained in the background, Φ = (Φin , Φout ) where Φin and Φout
characterize, respectively, the homogeneous (in terms of intensity model) re-
gions corresponding to the foreground and the background. This setting is
adequate for medical image segmentations wherein we are looking for cavi-
ties defined by contours, and the intensity models (distributions) will depend
on the specific application. Anyway, assuming that the pixel distribution is
independent of the contour, the latter likelihood may be factorized as follows:
⎛ ⎞
 
P (I|θ (K) , Φ) = ⎝ P (Ip |θ (K) , Φin ) · P (Ip |θ (K) , Φout )⎠ (3.60)
p∈I(Γ ) p∈O(Γ )

where Ip is the intensity at pixel p = (i, j), and I(Γ ) and O(Γ ) denote,
respectively, the regions inside and outside the closed contour Γ . Therefore,
the segmentation problem can be posed in terms of finding
) *
(θ̂ K ∗ , Φ̂) = arg min − log P (I|θ (K) , Φ) + K log(Wx W y) (3.61)
K,θ (K) ,Φ

or equivalently in model-order complexity terms



) *
K ∗ = arg min K log(Wx Wy ) − max log P (I|θ (K) , Φ) (3.62)
K θ (K) ,Φ

Apparently, an adequate computational strategy for solving the latter opti-


mization problem consists of devising an algorithm for solving the inner max-
imization one for a fixed K and then running this algorithm within a range
of K values. However, given that Φ is also unknown, the maximization al-
gorithm must be partitioned into two intertwined algorithms: Contour-fitting
and Intensity-inference. In the first one, K and Φ are fixed and we obtain
θ̂ (K) . In the second algorithm, K is fixed and both the refinement of θ̂ (K) and
Φ are obtained.

Algorithm 2: GPContour-fitting
Input: I, K, Φ, a valid contour Γ̂ (0) ∈ R(B(K) ), and a stepsize 
Initialization Build B(K) , compute B⊥ (K) , and set t = 0.
while ¬ Convergence(Γ̂ (t) ) do
Compute the gradient: δΓ ← ∇ log P (I|θ (K) , Φ)|Γ =Γ̂ ( t)
Project the gradient onto R(B(K) ): (δΓ )⊥ ← B⊥ (K) δΓ
Update the contour (gradient ascent): Γ̂ (t+1) ← Γ̂ (t) + (δΓ )⊥
t←t+1
end
Output: Γ̂ ← Γ̂ (t)
62 3 Contour and Region-Based Image Segmentation

Algorithm 3: MLIntensity-inference
Input: I, K, and a valid contour Γ̂ (0) ∈ R(B(K) )
Initialization Set t = 0.
while ¬ Convergence(Φ̂(t) , Γ̂ (t) ) do
Compute the ML estimation
% Φ̂(t) given Γ̂ (t) : (
(t)
Φ̂in = arg maxΦin p∈I(Γ̂ (t) ) P (Ip |θ (K) , Φin )
% (
(t)
Φ̂out = arg maxΦout p∈O(Γ̂ (t) ) P (Ip |θ (K) , Φout )

Run the fitting algorithm: Γ̂ (t + 1) =GPContour-fitting(I, K, Φ̂(t) , Γ̂ (t) )


t←t+1
end
Output: Φ̂ ← Φ̂(t) and Γ̂ ← Γ̂ (t)

Contour-fitting Algorithm

Given K and Φ, we have to solve the following problem:


 
) * maxΓ {P (I|Γ, Φ)}
max log P (I|θ (K) , Φ) = (3.63)
θ (K) ,Φ s.t. Γ ∈ R(B(K) )

that is, we must maximize the likelihood, but the solutions must be con-
strained to those contours of the form Γ = B(K) θ (K) , and thus, belong to
the range space of B(K) . Such constrained optimization method can be solved
with a gradient projection method (GPM) [20]. GPMs consist basically in
projecting successively the partial solutions, obtained in the direction of the
gradient, onto the feasible region. In this case, we must compute in the t−th
iteration the gradient of the likelihood δΓ = ∇ log P (I|θ (K) , Φ)|Γ =Γ̂ ( t) . Such
gradient has a direction perpendicular to the contour at each point of it (this
is basically the search direction of each contour point, and the size of this 
window one-pixel wide defines the short-sightness of the contour). Depend-
ing on the contour initialization, and also on , some points in the contour
may return a zero gradient, whereas others, closer to the border between the
foreground and the background, say p pixels, may return a gradient of mag-
nitude p (remember that we are trying to maximize the likelihood along the
contour). However, as this is a local computation for each contour point, the
global result may not satisfy the constraints of the problem. This is why δΓ is
projected onto the range space through (δΓ )⊥ = B⊥ (K) δΓ , and then we apply
the rules of the usual gradient ascent. The resulting procedure is in Alg. 2.

Intensity-inference Algorithm

This second algorithm must estimate a contour Γ̂ in addition to the region pa-
rameters Φ, all for a fixed K. Therefore, it will be called Alg. 3. The estimation
of Φ depends on the image model assumed. For instance, if it is Gaussian, then
3.4 Model Order Selection in Region-Based Segmentation 63

2 2
we should infer Φin = (μin , σin ) and Φout = (μout , σout ); this is easy to do if
we compute these parameters from the samples. However, if the Rayleigh dis-
tribution is assumed, that is, P (Ip |θ (K) , Φ = σ 2 ) = (Ip /σ 2 ) · exp{−Ip /(2σ 2 )}
(typically for modelling speckle noise in ultrasound images), then we should
2 2
infer Φin = (σin ) and Φout = (σout ) (also from the samples). Given Alg. 3, we
are able to obtain both θ̂ (K) and Φ̂ for a fixed K. Then, we may obtain the
second term of Eq. 3.62:
) *
max log P (I|θ (K) , Φ) = log P (I|θ̂ (K) , Φ̂)
θ (K) ,Φ

|I(Γ̂ )| |O(Γ̂ )|
∝− 2
log(σ̂in (K)) − 2
log(σ̂out (K))
2 2
1 % (
=− 2
log(σ̂in 2
(K)σ̂out (K))|I|
2
1
= 2 2 (K))Wx Wy
(3.64)
log(σ̂in (K)σ̂out

independently of having a Gaussian or Rayleigh intensity model.

MDL Contour fitting and intensity inference

Once we have an algorithm for solving a fixed K, the MDL solution may be
arranged as running this algorithm for a given range of K and then selecting
the K ∗ . Thus, it is important to exploit the knowledge not only about the type
of images to be processed (intensity), but also about the approximate com-
plexity of their contours (in order to reduce the range of exploration for the
optimal K). In Fig. 3.5, we show some summarizing results of the technique
described above.

3.4 Model Order Selection in Region-Based


Segmentation
3.4.1 Jump-Diffusion for Optimal Segmentation

Optimal segmentation is an open problem. A main issue is the model or-


der selection. It consists of knowing the optimal number of classes. There is a
trade-off between the simplicity of the model and its precision. Complex mod-
els fit better the data; however, we often prefer the simplest model that can
describe data with an acceptable precision. George Box stated in Robustness
in Statistics (1979), “All models are wrong, but some are useful.” In com-
puter vision and pattern recognition, unnecessarily complex models are not
practical because they offer a poor generalization over new patterns.
The jump-diffusion strategy aims to solve both the problem of adjusting a
model to the data, and selecting the model order. “Jump” refers to the latter,
64 3 Contour and Region-Based Image Segmentation

Fig. 3.5. Results of the MDL segmentation process. Top: synthetic image with
same variances and different means between the foreground and background, and
the optimal K (Courtesy of Figueiredo). Bottom: experiments with real medical
images and their optimal K. In both cases, initial contours are showed in dashed
lines. Figure by M.A.T. Figueiredo, J.M.N. Leitao and A.K. Jain (2000
c IEEE).
3.4 Model Order Selection in Region-Based Segmentation 65

while “diffusion” refers to fitting the models. It is an optimization algorithm


that iteratively simulates two types of moves [64, 65, 141]:
• Reversible jumps: move between subspaces of different dimensions, driven
by a Metropolis–Hastings sampler. These jumps change both the type of
region model for each segment, as well as the total number of models.
• Stochastic diffusions: Stochastic steepest descent described by the
Langevin equations within each continuous subspace. This descent con-
sists of adjusting the parameters of each region model (selected in the last
jump) so that it fits the data better.

Bayesian formulation of the problem

In the following subsections, we explain the algorithm applied to a simple


toy-example (Fig. 3.6). The data we want to segment are 1D synthetic signals
with Gaussian noise with mean μ = 0 and variance σ 2 . Let us denote the data
as I(x), x ∈ [0, 1].
To formulate the problem, in the first place, we have to define how we want
to model the image. The model has to be designed according to the nature of
the data present in the image. For the 1D toy example, we could define some
simple region models such as straight lines, arcs, and some more complex pat-
terns like waves. Let us index the models with li ∈ {line, circle, wave, . . .},
where i is the number of region given by the segmentation. Each one of the
models is parameterized by some variables θ, which make possible its adjust-
ment to the data. For example, the line is parameterized by θ = (a, b), where
a is the slope and b is the intercept. A circular arc could be parameterized
by θ = (cx , cy , r), where the center of the arc is (cx , cy ) and r is the radius.
Finally, the wave could be described by its slope, intercept, wave length and
wave height. However, for simplicity, in the following examples we will use
only the line model for fitting the data (Fig. 3.7). We will still use the index
li ∈ {line} to maintain generalization in the notation and formulation of the
problem. For an example with two models (line and arc), see [159].
Each region model corresponds to some definite interval of the signal. Let
us call x0 < x1 < · · · < xk with x0 = 0, xk = 1 the change points, which limit
each region (li , θi ) to the description of the interval [xi − 1, xi ) of the signal.

0.5

0.4

0.3

0.2

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Fig. 3.6. A noisy signal composed of several segments.


66 3 Contour and Region-Based Image Segmentation

0.5

0.4

0.3

0.2

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.5

0.4

0.3

0.2

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.5

0.4

0.3

0.2

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Fig. 3.7. Three different segmentations of the signal.

The global model I0 (x; li , θi ) of the signal I(x), or “world scene” is therefore
described by the vector of random variables:

W = (k, {x1 , . . . , xk−1 }, {(l1 , θ1 ), . . . , (lk , θk )}) (3.65)

Assuming that
• In this 1D example the individual likelihoods for each region I0 (x, li , θi ),
xi − 1 ≤ x < xi decay exponentially with the squared error between the
model and the actual signal:
&  xi '
− 2σ12 xi −1 (I(x)−I0 (x;li ,θi ))
2
dx
P (I|li , θi ) = e (3.66)

• The prior p(W) is given by penalizing


– the number k − 1 of regions: P (k) ∝ e−λ0 k
– and the number |θi | of parameters: P (θi |li ) ∝ e−λ|θi |
• All region models are equally likely a priori, that is, p(li ) is uniform.
Then, the standard Bayesian formulation of the posterior probability is
& k x '
− 2σ12 i (I(x)−I0 (x;li ,θi ))2 dx k
P (W |I) ∝ e i=1 xi−1
e−λ0 k e−λ i=1 |θi |
(3.67)
3.4 Model Order Selection in Region-Based Segmentation 67

The maximum a posteriori (MAP) solution comes from maximizing the pos-
terior probability (Eq. 3.67). In energy minimization terms, the exponent of
the posterior is used to define an energy function:
k 
1  xi k
E(W ) = (I(x) − I0 (x; l i , θ i ))2
dx + λ 0 k + λ |θi | (3.68)
2σ 2 i=1 xi−1 i=1

This energy function has to be minimized in a space with variable number of


dimensions because the number of regions k is unknown. One way of solving
the problem would be to use a greedy search, which starts with a high k and
fuses regions, or to start with a low k and part regions, according to some
metric, for example, the error. At each step, the models would have to be
optimized for a fixed k in order to perform the evaluation. The jump-diffusion
algorithm embodies the search of k in both directions and the optimization
of the region models in a single optimization algorithm.

Reversible jumps

The posterior probability P (W |I) (Eq. 3.67) is distributed over a countable


number of solutions subspaces Ωi of varying dimension. The union of these
subspaces Ω = ∪∞ n=1 Ωn forms the complete solution space.
To search over the solution space, there is the need to perform reversible
jumps from one subspace to another. In the 1D example, we define two types
of jumps: merge two adjacent regions, and split a region (see Fig. 3.8 for a
graphical example). If there is more than one region model, then a third type
of jump has to be defined, to change the model of a region, for example,
from li = line to li = arc. In the jump-diffusion formulation, the jumps are
realized by a Metropolis move over a Markov chain of states. The world scene
W = (n, ψ) denotes the state of the stochastic process at time t, where ψ ∈ Ωn
are the state variables that define the intervals and parameters of a current

Jump of ’split’ type, iteration 459 Jump of ’split’ type, iteration 460

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Fig. 3.8. A jump of “split” type. Left: plot of W and I. Right: plot of W  and I.
68 3 Contour and Region-Based Image Segmentation

set of region models in a subspace Ωn . Similarly, W  = (m, φ) with φ ∈ Ωm


and m = n is the destination state of the jump in the Markov chain. Then,
the probability to accept the jump W → W  is given by:

 p(W  |I)dφ g(W  → W )dψ
α(W |W ) = min 1, × (3.69)
p(W |I)dψ g(W → W  )dφ
where g(· ) denotes the forward and backward proposal probabilities, which
are given by the probability of choosing a destination space and the density
of the parameters in that destination space:
g(W  → W ) = q(W  → W )q(ψ|n)
(3.70)
g(W → W  ) = q(W → W  )q(φ|m)
In the standard jumping scheme, the densities q(ψ|n) and q(φ|m) are usually
set to uniform distributions. This makes the process computationally very
slow, which is an important drawback, and it is discussed in Section 3.4.2.
Let us see a numerical example with the jump W → W  plotted in
Fig. 3.8. If uniform densities are supposed, then the probability of a jump
will be given by the energies E(W ) and E(W  ) of the states before and
after the jump. The jump with the lowest energy will be the most probable
one. The energy is calculated using Eq. 3.68, which consists, basically, of
the error between the models and the actual signal and the penalizations to
complex models. The trade-off between error and complexity is regularized
by λ and λ0 . The variance σ 2 is approximately the variance of the signal,
and I(x) is the actual signal. The model is denoted as I0 (x; li , θi ), and it
depends on W . For the toy-example, we only have the linear region model
I0 (x, line, (a, b)) = ax + b, and in the state W , there are two regions k = 2.
The first region has parameters θ1 = (0.6150, 0.2394), and the second region
has parameters θ2 = (0.2500, 0.1260). The change point between the regions
is x2 = 0.5128, so the state is denoted as
W = (2, {0, 0.5128, 1},
(3.71)
{(line, (0.6150, 0.2394)), (line, (0.2500, 0.1260))})
and the error would be
k  xi
(I(x) − I0 (x; li , θi ))2 dx
x −1
i=1 i 0
= (I(x) − 0.6150x − 0.2394)2 dx (3.72)
0.5128
 0.5128
+ (I(x) − 0.2500x − 0.1260)2 dx
1

The error is calculated in a similar way for W  , which in the example is


W  = (3, {0, 0.5128, 0.6153, 1},
{(line, (0.6150, 0.2394)),
(3.73)
(line, (−1.3270, 1.1078)),
(line, (0.2493, 0.1204))})
3.4 Model Order Selection in Region-Based Segmentation 69

Table 3.1. Energies E(W ) and probabilities P (W |I) of the destination states of
the jumps from state W (Eq. 3.71 and Fig. 3.8, left). The jumps considered are
splitting region 1, splitting region 2, merging regions 1 and 2, and remaining in the
same state.
   
W Wactual Wsplit1 Wsplit2 Wmerge12
E(W ) 0.1514 0.1987 0.0808 13.6823 14.1132
P (W |I) 0.2741 0.2089 0.5139 0.0030 1.0000

For the signal I(x) of the example, we obtain the energy values E(W ) = 0.0151
and E(W  ) = 0.0082. The probabilities p(W |I) and p(W  |I) are given by the
normalization of the inverse of the energy. There are several possible moves
from the state W : splitting region 1, splitting region 2, merging regions 1
and 2, or remaining in the same state. In Table 3.1, the energies and prob-
abilities of the considered destination states are shown. It can be seen that,
considering uniform densities q(ψ|n) and q(φ|m), the most probable jump is
to the state in which region 2 is split.

Stochastic diffusions

In a continuous subspace with fixed region models, stochastic diffusions can


be performed in order to adjust the parameters of the model so that it fits
better the data. To do this, the energy function defined in Eq. 3.68 has to be
minimized. The stochastic diffusion uses the continuous Langevin equations
that simulate Markov chains with stationary density p(ψ) ∝ exp(−E(ψ)/T ),
where ψ is a variable of the model and T is a temperature that follows an
annealing
 scheme. The equations have a term that introduces normal noise
2T (t)N (0, (dt)2 ) to the motion. This term, called Brownian motion, also
depends on the temperature T and is useful for overcoming local minima.
Then, the motion equation for a variable at time t is defined as

dE(W ) 
dψ(t) = − dt + 2T (t)N (0, (dt)2 ) (3.74)

Given this definition, let us obtain the motion equations for the variables of
the toy-example, which are the change points xi and the parameters θi = (a, b)
of the linear region models. We have to obtain the expression of the derivative
of the energy E(W ) with respect to the time t for each one of these variables.
Then, the motion equation for a change point xi is calculated as

dxi (t) dE(W ) 


= + 2T (t)N (0, 1) (3.75)
dt dxi

Let us calculate the derivative dE(W )/dxi . In the definition of E(W ) in


Eq. 3.68, we can see that for a fixed k and fixed region models, the penalization
terms are constant (c). Therefore, in the derivative of E(W ), these become
70 3 Contour and Region-Based Image Segmentation

null. On the other hand, the summation adds k terms and only two of them
contain xi , the rest are independent, so they are also null in the derivative.
For compactness, let us denote the error between model and signal as f and
its indefinite integral as F :
fi (x) = (I(x) − I0 (x; li , θi ))2
(3.76)
Fi (x) = (I(x) − I0 (x; li , θi ))2 dx
Then the derivative of the energy is calculated as
 k 

dE(W ) d 1  xi
= fi (x)dx + c
dxi dxi 2σ 2 i=1 xi−1
  xi  xi+1 
1 d
= ··· + fi (x)dx + fi (x)dx + · · ·
2σ 2 dxi xi−1 xi
1 d (3.77)
= 2
(Fi (xi ) − Fi (xi+1 ) + Fi+1 (xi+1 ) − Fi+1 (xi ))
2σ dxi
1 d
= (Fi (xi ) − Fi+1 (xi ))
2σ 2 dxi
1 + ,
= 2
I(xi ) − I0 (xi ; li , θi ))2 − I(xi ) − I0 (xi ; li−1, θi−1 ))2

Finally, the expression obtained for dE(W
dxi
)
is substituted in the xi motion
equation, Eq. 3.75. The resulting equation is a 1D case of the region compe-
tition equation [184], which also moves the limits of a region according to the
adjacent region models’ fitness to the data.
The motion equations for the θi parameters have an easy derivation in the
linear model case. The motion equation for the slope parameter ai results in
dai (t) dE(W ) 
= + 2T (t)N (0, 1)
dt dai
k 
1 d  xi 
= 2
(I(x) − ai x + bi )2 dx + 2T (t)N (0, 1) (3.78)
2σ dai i=1 xi−1
1  + , 
xi
= 2
2 ax2 + bx − xI(x) + 2T (t)N (0, 1)
2σ x=x
i−1

Similarly, the motion equation for the intersect parameter bi is


dbi (t) dE(W ) 
= + 2T (t)N (0, 1)
dt dbi
k 
1 d  xi 
= 2
(I(x) − ai x + bi )2 dx + 2T (t)N (0, 1) (3.79)
2σ dbi i=1 xi−1
1  
xi
= 2 (ax + b − I(x)) + 2T (t)N (0, 1)
2σ 2 x=x
i−1

In Fig. 3.9 the result of applying the motion equations over the time t,
or number of iterations is represented. It can be seen that motion equations
3.4 Model Order Selection in Region-Based Segmentation 71
Diffusion of ’b’ parameter, iteration 10 Diffusion of ’a’ parameter, iteration 30

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Diffusion of both ’a’ and ’b’ parameters, iteration 50 Diffusion of the limits, iteration 50

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Fig. 3.9. Diffusion of ‘b’, ‘a’, both parameters together and the limits.

modify the parameters in the direction that approximates the model closer to
the data. The parameters increments get larger when the model is far from
the data and smaller when the model is close to the data.
As we already explained, diffusions only work in a definite subspace Ωn
and jumps have to be performed in some adequate moment, so that the pro-
cess does not get stuck in that subspace. In the jump-diffusion algorithm,
the jumps are performed periodically over time with some probability. This
probability, referred to the waiting time λ between two consecutive jumps,
follows a Poisson distribution, being κ the expected number of jumps during
the given interval of time:

λκ e−λ
fP oisson (κ; λ) = (3.80)
κ!
A simulation of the random jumps with Poisson probability distribution in
time is shown in Fig. 3.10.

3.4.2 Speeding-up the Jump-Diffusion Process

The jump-diffusion process simulates a Markov chain, which searches over


the solution space Ω by sampling from the posterior probability p(W |I).
72 3 Contour and Region-Based Image Segmentation

Poisson events simulation

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
time (iterations)

Fig. 3.10. Poisson events simulation.

Reversibility of the jumps is introduced as a tool for guaranteeing irreducibility


of the Markov chain. A Markov chain is said to be irreducible if any state can
be reached from any other state, after some sequence of states χ0 , χ1 , · · · , χs :
p(χs = Ωm |χ0 = Ωn ) > 0, ∀m, n/m = n (3.81)
In theory, reversibility is not strictly necessary for the jump-diffusion process
to achieve the optimal solution. Moreover, after a number of iterations, the
jump-diffusion process can achieve the optimal solution with a probability
close to one. However, the high number of iterations could make the process
computationally unfeasible.

The speed bottlenecks


To make the jump-diffusion process viable, the search through the solution
space has to be made more direct so that convergence to a good solution is
achieved in a smaller number of iterations. The search can be tuned by using
more suitable probability models for the proposal probabilities. In a general
case, the forward proposal probability defined in Eq. 3.70 can be divided into
three cases:

⎨ q(θi |li , [xi−1 , xi )) switch
q(φ|m) = q(θ|l, [xi−2 , xi )) merge (3.82)

q(x|[xi−1 , xi ))q(θa |la , [xi−1 , x))q(θb |lb , [x, xi )) split
where the first one is switching the interval [xi−1 , xi ) from its associated region
model to another (li , θi ) model; the second case refers to merging two adjacent
intervals into a single (l, θ) model; the third model is splitting the interval
[xi−1 , xi ) into two different region models, (la , θa ) for the interval [xi−1 , x),
and (lb , θb ) for the interval [x, xi ). In the toy-example, we only have one region
model, so the proposal probabilities do not need to consider switching from
one model to another, as the model parameters are already changed by the
diffusions.
If the proposal probabilities are considered to follow a uniform distribu-
tion, then the jumps work as a random selection of new models, and as a
consequence, the proposals are rejected most of the times. This increases very
much the number of iterations, even though in theory, it is highly probable
that the process finally achieves the optimal solution. In practice, a good
design of the proposal probabilities is necessary.
3.4 Model Order Selection in Region-Based Segmentation 73

Data-driven techniques

In [159], the data-driven Markov chain Monte Carlo scheme is explained and
exploited. The idea is to estimate the parameters of the region models, as well
as their limits or changepoints, according to the data that these models have
to describe. These estimations consist of bottom-up strategies, for example,
a simple strategy for estimating the changepoints xi is to place them where
edges are detected. Actually, taking edges as those points with a gradient
above some threshold is not a good strategy. A better way is to take the
edgeness measure as probabilities, and sample this distribution for obtaining
the changepoints. For example, in Fig. 3.11, we show the edgeness of the
toy-example data. If a split has to be performed of a region in the interval
[0.2, 0.4) to two new region models for the intervals [0.2, x) and [x, 0.4), the
new changepoint x would be placed with a high probability on 0.3 because the
edgeness corresponding to this interval defines such probability distribution.
The estimation of the parameters θi of the linear model of the toy-example
can be performed by taking the slope and the intersect of the most voted line
of the Hough transformation of the underlying data. However, a probabilis-
tic approach is more suitable for the jump-diffusion process. It consists of
computing the importance proposal probability by Parzen windows centered
at the lines of the Hough transformation. When a new model is proposed in
the interval [xi−1 , xi ), its important proposal probability is


N
q(θi |li , [xi−1 , xi ) = ωj G(θi − θj ) (3.83)
j=1

where G(x) is a Parzen window, N is the number of candidate lines of the


Hough transformation, and ωi are the accumulated weights of the Hough
transformation votes.
The interesting concept that data-driven proposal probabilities design in-
troduces is the fact that a top-down process is aided by bottom-up processes.
This is a way of mixing generative and discriminative methods in a single
algorithm. In Fig. 3.12, we compare the evolution of the toy-example energy
E(W ) during the jump-diffusion process in four cases: the pure generative

Fig. 3.11. Edgeness measure of the signal.


74 3 Contour and Region-Based Image Segmentation

Jump−diffusion convergence

Pure generative
Data−driven (edges)
Data−driven (Hough transform)
Data−driven (Hough and edges)
Energy

500 1000 1500 2000 2500 3000


Step

Fig. 3.12. Jump-diffusion energy evolution. See Color Plates.

scheme; the data-driven approach using edges; Hough transformation; and


both of them. We can see that bottom-up methods help the jump-diffusion
convergence and make it computationally more viable. In probabilistic terms,
a good design of the proposal probabilities causes the increase of the ratio
p(m, φ|I)
= e−ΔE , (3.84)
p(n, ψ|I)
and the jump proposals are more frequently successful.

3.4.3 K-adventurers Algorithm


The data-driven jump-diffusion process has the disadvantage of not defining
a stop criterion. Although the energy of the model has a pronounced descent
during the first iterations, a convergence to some low energy value cannot be
detected. Low energies are obtained in several different subspaces Ωn . This
is due to the well-known fact that for most problems, segmentation has more
than one good solution. Actually, as explained in [159], computing different
solutions can be necessary for intrinsically ambiguous scenes, and for providing
robustness to the segmentation process, given that the designed probability
distributions are not perfect.
3.4 Model Order Selection in Region-Based Segmentation 75

The solution space could contain a large amount of different solutions Wi ,


some of them very similar to each other. We are not interested in maintain-
ing all of them, but in selecting the most important ones. The K-adventurers
algorithm is designed for sampling from the probability distribution formed
by the solutions, so that the most representative ones are selected. For exam-
ple, if we have a trimodal distribution of solutions and we want a number of
K = 3 solutions to represent the solution space, then there is a high probabil-
ity that the K-adventurers algorithm yields the three solutions corresponding
to the modes of the distribution. It is not necessary for K to be related to
the modality of the distribution. We select the number of important solutions
depending on how well we want to describe the solution space. Nevertheless,
if the distribution of solutions is very complex, a larger number K of repre-
sentative solutions is needed for obtaining a good sampling. This strategy is
an implicit way of model order selection. It is probabilistic, and depending
on the kinds of solutions present in their distribution, different model orders
could be found.
The selection of important solutions is performed by defining the objective
to minimize a Kullback–Leibler divergence D(p||p̂) between the Bayesian pos-
terior probability p(W |I) (already defined in Eq. 3.67) and the non parametric
probability p̂(W |I), which is represented by a fixed-size set S of K selected
samples:
K K
p̂(W |I) = ωi G(W − Wi , σi2 ), ωi = 1 (3.85)
i=1 i=1
where ωi are weights proportional to p(W |I) and they sum 1. G is a Gaussian
window in the solution space Ω, with a σi2 variance. The objective set S ∗ of
selected solutions (ωi , Wi ), i = 1, . . . , K is therefore defined as
S ∗ = arg min D(p||p̂) (3.86)
S/|S|=K

where the Kullback–Leibler divergence (KL-divergence) consists of


  
p(W |I)
D(p||p̂) = p(W |I) log dW (3.87)
p̂(W |I)
The K-adventurers algorithm iteratively calculates an approximation of
S ∗ . Each time the jump-diffusion algorithm performs a successful jump and
its energy is minimized by the diffusions process, the solution W is taken by
the K-adventurers algorithm, and the KL-divergence is recomputed in order
to update the S ∗ˆ set of solutions if necessary. The algorithm is as follows:
The algorithm starts with fixed size set of K solutions, which initially
are the same solution. This set is iteratively updated after each new solution
yielded by the jump-diffusion process. The iterations are denoted with the
while sentence in Alg. 4.
The stop criterion is not defined; however, the ergodicity of the Monte
Carlo Markov chain process guarantees that significant modes will be visited
76 3 Contour and Region-Based Image Segmentation

Algorithm 4: K-adventurers
Input: I, successive solutions (ωK+1 , WK+1 ) generated by a jump-diffusion
process
Initialize S ∗ˆ with one initial (ω1 , W1 ) solution K times
while ∃ (ωK+1 , WK+1 ) ← jump-diffusion do
S+ ← S ∗ˆ ∪ {(ωK+1 , WK+1 )}
for i = 1, 2, . . . , K + 1 do
S−i ← S+ /{(ωK+1 , WK+1 )}
p̂ ← S−i
di = D̂(p||p̂)
end
i∗ = arg min di
i
S ∗ˆ ← S−i∗
end
Output: S ∗ˆ

over time and there will be a convergence to the p distribution. The num-
ber of iterations necessary for a good approximation of S ∗ depends on the
complexity of the search space. A good stop criterion is to observe whether
S ∗ undergoes important changes, or on the contrary, remains similar after a
significant number of iterations.
In each while iteration, a for sentence iterates the estimation of the KL-
divergence D̂(p||p̂−i ) between p and each possible set of K solutions, consid-
ering the new one, and subtracting one of the former. The interesting point
is the estimation of the divergence, provided that p consists of a solutions set
whose size increases with each new iteration. The idea which Tu and Zhu pro-
pose in [159] is to represent p(W |I) by a mixture of N Gaussians, where N is
the number of solutions returned by the jump-diffusion process, which is the
same as the number of iterations of the while sentence. These N solutions are
partitioned into K disjoint groups. Each one of these groups is represented by
a dominating solution, which is the closest one to some solution from the S ∗
set of selected solutions. The name of the algorithm is inspired by this basic
idea: metaphorically, K adventurers want to occupy the K largest islands in
an ocean, while keeping apart from each other’s territories.
As shown in Eq. 3.85, the probability distributions are modelled with a sum
of Gaussians centered at the collected solutions. These solutions are largely
separated because of the high dimensionality of the solutions space. This is the
reason for forming groups with dominating solutions and ignoring the rest of
the solutions. More formally, for selecting K << N solutions from the initial
set of solutions S0 , a mapping function from the indexes of S ∗ˆ to the indexes
of S0 is defined τ : {1, 2, . . . , K} → {1, 2, . . . , N }, so that

S ∗ˆ = {(ωτ (i) , Wτ (i) ); i = 1, 2, . . . , K} (3.88)


3.4 Model Order Selection in Region-Based Segmentation 77

which, similarly to Eq. 3.85, encodes the nonparametric probability density


model:
1 K
p̂(W ) = K ωτ (i) G(W − Wτ (i) , στ2(i) ) (3.89)
i=1 ω τ (i) i=1

In the experiments performed in [159], the same variance is assumed for all
Gaussians.
Given the former density model, the approximation of D(p||p̂) can be
defined. From the definition of the KL-divergence, we have
N 
 p(W )
D(p||p̂) = p(W ) log dW
n=1 Dn p̂(W )
N 
 
N
= ωi G(W − Wi ; σ 2 )
n=1 Dn i=1
N
i=1 ωi G(W − Wi ; σ )
2
× log N dW (3.90)
j=1 ωτ (j) G(W − Wτ (j) ; σ )
N 1 2
j=1 ωτ (j)

Where Di , i = 1, 2, . . . , K are the disjoint groups, each one of them dominated


by a single solution. A second mapping c : {1, 2, . . . , N } → {1, 2, . . . , K} is
defined so that
ωc(i)
p̂(W ) ≈ N G(W − Wτ (c(n)) ; σ 2 ), W ∈ Di , i = 1, 2, . . . , N (3.91)
j=1 ω τ (j)

Provided that the energy of each mode p(Wi |I) is defined as E(Wi ) =
− log p(Wi ), the approximation of D(p||p̂) is formulated as
N 

D̂(p||p̂) = ωn G(W − Wn ; σ 2 )
#
n=1 Dn
$
N ωn G(W − Wn ; σ 2 )
· log j=1 ωτ (j) + log dW
⎡ ωτ (c(n)) G(W − Wτ (c(n)) ; σ 2 ) ⎤
N  N
ωn (Wn − Wτ (c(n)) )2
= ⎣log ωτ (j) + log + ⎦
ω 2σ 2
n=1 j=1 τ (c(n))

N  N # $
(Wn − Wτ (c(n)) )2
= log ωτ (j) + ωn E(Wτ (c(n)) − E(Wn ) +
j=1 n=1
2σ 2
(3.92)

In [159], the goodness of the approximation D̂(p||p̂) is experimentally


demonstrated and it is compared to the actual KL-divergence D(p||p̂) and
to |p − p̂|. Finally, the definition of a distance measure between solutions W1
and W2 is a delicate question. For the 1D toy-example, a distance measure
between solutions could be the error between the generated model and the
78 3 Contour and Region-Based Image Segmentation

actual data. When several region models are considered, the model type has
to be considered in the distance measure too. Also, the number of regions is
another important term to be considered in the measure.
In Fig. 3.13, we illustrate a solution space of the 1D toy-example. The
K = 6 selected solutions are marked with red rectangles. These solutions are
not necessarily the best ones, but they represent the probability distribution of
the solutions yielded by the jump-diffusion process. In Fig. 3.14, the energies
of these solutions are represented. Finally, in Fig. 3.15 the models of the

Solution space and 6 representative solutions

← solution 100
← solution 86
←solution 76
← solution 58
← solution 36
← solution 14

180
160
140
120
100
80
60
40
# solution 20
0

Fig. 3.13. The solution space in which the K-adventurers algorithm has selected
K = 6 representative solutions, which are marked with rectangles. See Color Plates.

Energies of the representative solutions


Energy (posterior p)

↑ ↑ ↑ ↑ ↑ ↑
14 36 58 76 86 100

0 20 40 60 80 100 120 140 160 180


#Solution

Fig. 3.14. The energies of the different solutions. The representative solutions se-
lected by the K-adventurers algorithm are marked with arrows.
3.5 Model-Based Segmentation Exploiting The Maximum Entropy Principle 79
Solution #14 Solution #36

0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1


Solution #58 Solution #76

0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1

Solution #86 Solution #100

0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1

Fig. 3.15. Result of the K-adventurers algorithm for K = 6: the six most represen-
tative solutions of the solution space generated by jump-diffusion.

six solutions are shown. As already explained, an advantage of generative


approaches is the possibility to generate data, once there is model with its
parameters estimated. In conclusion, the jump-diffusion algorithm integrates
both top-down and bottom-up processes. The K-adventurers algorithm, based
on information theory, iteratively selects the most important solutions by
estimating their probability distribution in the solution space.

3.5 Model-Based Segmentation Exploiting


The Maximum Entropy Principle
3.5.1 Maximum Entropy and Markov Random Fields

In Section 3.2, we have introduced the maximum entropy (ME) principle and
how it is used to find the approximated shape, through the less biased pdf,
given the statistics of the sample. Here, we present how to use it for segmenting
parts of a given image whose colors are compatible with a given ME model.
Therefore, ME is the driving force of learning the model from the samples
and their statistics. Such model is later used in classification tasks, such as
the labeling of skinness (skin-color) of pixels/regions in the image [84]. This
is a task of high practical importance like blocking adult images in webpages
(see for instance [180]). A keypoint in such learning is the trade-off between
the complexity of the model to learn and its effectiveness in ROC terms. Let
X = {(Ii , yi ) : i = 1, . . . , M } be the training set, where Ii is a color image
that is labeled as yi = 1 if it contains skin, and yi = 0 otherwise. In [84], the
80 3 Contour and Region-Based Image Segmentation

Compaq Database [89] is used. In such database, skin is manually segmented


and M = 18,696 RGB images3 are used. Therefore, if xs is the color of the
sth pixel, then xs ∈ S = {0, . . . , 255}3 .
The training set allows us to approximate p(x, y) with an independent
model C0 called the baseline model:

C0 : ∀s ∈ S, ∀xs ∈ S, ∀ys ∈ {0, 1} : p(xs , ys ) = q(xs , ys ) (3.93)

q(xs , ys ) being the proportion of pixels in the training set with color xs and
skinness ys (two tridimensional histograms, one for each skinness, typically
quantized to 32 bins, or a four dimensional histogram in the strict sense).
Using a modified version of the usual expectation constraints in ME, we have
that
p(xs , ys ) = Ep [δxs (xs )δys (ys )] (3.94)
where δa (b) = 1 if a = b and 0 otherwise. Then, the shape of the ME distri-
bution is 
p(x, y) = eλ0 + s∈S λ(s,xs ,ys ) . (3.95)
Thus, assuming q(xs , ys ) > 0, we find the following values for the multipliers:
λ0 = 0 and λ(s, xs , ys ) = log q(xs , ys ). Consequently, the ME distribution is

p(x, y) = q(xs , ys ) (3.96)
s∈S

and the probability of belonging to a class is given by the Bayes theorem:

  q(xs |ys )q(ys )  q(xs |ys )q(ys )


p(y|x) = q(ys |xs ) = = 1
y =0 q(xs |ys )q(ys )
q(xs )
s∈S s∈S s∈S s
(3.97)
Such probability is computed from two tridimensional histograms: q(xs |ys = 0)
and q(xs |ys = 1). Such probability is expressed in terms of gray levels in the
second column of Fig. 3.16.
The model in Eq. 3.96 is consistent with an unrealistic, but efficient inde-
pendence assumption (the skinness of a pixel is independent of that of their
neighbors). A more realistic assumption is the Markovian one: the skinness
of a pixel s depends only on its neighbors Ns (for instance, it is usual to
use a 4-neighborhood). This is coherent with the fact that skin pixels belong
to larger regions. More formally, although there are excellent monographs
on the subject (see [175]), the pixels in the image are the nodes (random
variables) of an undirected graph and the edges are defined by the type
of neighborhood chosen (for instance: north, south, east and west). Given
two neighbors s and t, which are denoted by < s, t >, let q(ys , yt ) be the
expected proportion of observations of (ys = a, yt = b), that is, we have
3
Although the skin hue is invariant to the ethnic group, skinness depends on
illumination conditions, which are usually unknown.
3.5 Model-Based Segmentation Exploiting The Maximum Entropy Principle 81

Fig. 3.16. Skin detection results. Comparison between the baseline model (top-
right), the tree approximation of MRFs with BP (bottom-left), and the tree
approximation of the first-order model with BP instead of Alg. 5. Figure by B.
Jedynak, H. Zheng and M. Daoudi (2003
c IEEE). See Color Plates.

the four quantities: q(ys = 0, yt = 0), q(ys = 0, yt = 1), q(ys = 1, yt = 0), and
q(ys = 1, yt = 1). Here, the aggregation of horizontal and vertical quantities
yields an implicit assumption of isotropy, which is not an excessive unrealistic
simplification. Let D be the following model:
D : ∀ < s, t >∈ S × S p(ys = 0, yt = 0) = q(0, 0), p(ys = 1, yt = 1) = q(1, 1)
(3.98)
82 3 Contour and Region-Based Image Segmentation

The ME entropy model for C0 ∩ D is obtained by starting the formulation of


the expectation constraints:

∀ys ∈ {0, 1} , ∀yt ∈ {0, 1} , p(ys , yt ) = Ep [δys (ys )δyt (yt )] (3.99)

thus, the solution has the yet familiar exponential shape, but depending on
more multipliers enforcing the constraints, labels of neighbors < s, t > are
equal:

p(x, y) = eH(x,y,Λ) , Λ = (λ0 (.), . . . , λ3 (.))


 
H(x, y, Λ) = λ0 + λ1 (s, xs , ys ) + λ2 (s, t)(1 − ys )(1 − yt )
s <s,t>

+ λ3 (s, t)ys yt (3.100)
<s,t>

Therefore, p(xs , ys ), the marginal for xs , may be posed as


 
p(xs , ys ) = p(x, y) = eλ0 +λ1 (s,xs ,ys ) g(s, ys ) (3.101)
xt ,t=s yt ,t=s

where g(s, ys ) is a function independent of xs . Thus, the marginal for ys is


 
p(ys ) = p(xs , ys ) = eλ0 g(s, ys ) eλ1 (s,xs ,ys ) (3.102)
xs xs

and, applying the Bayes theorem we have

p(xs , ys )
p(xs |ys ) =
p(ys )
eλ0 +λ1 (s,xs ,ys ) g(s, ys )
= 
eλ0 g(s, ys ) xs eλ1 (s,xs ,ys )
eλ1 (s,xs ,ys ) q(xs |ys )
=  = (3.103)
xs q(xs |ys )
e λ1 (s,xs ,ys )
xs

because p(xs |ys ) = q(xs |ys ), and λ1 (s, xs , ys ) = log q(xs |ys ) when positivity
is assumed. Consequently, the resulting model is
 
p(x, y) ≈ q(xs |ys )e <s,t> a0 (1−ys )(1−yt )+a1 ys yt (3.104)
s∈S

where a0 = λ2 (s, t) and a1 = λ3 (s, t) are constants, which must be set to


satisfy the constraints. Then,

p(y|x) ≈ q(xs |ys )p(y) (3.105)
s∈S
3.5 Model-Based Segmentation Exploiting The Maximum Entropy Principle 83

1 
p(y) = e[ <s,t> a0 (1−ys )(1−yt )+a1 ys yt ] (3.106)
Z(a0 , a1 )
being Z(a0 , a1 ) the normalization (partition) function
%  (
Z(a0 , a1 ) = e[ <s,t> a0 (1−ys )(1−yt )+a1 ys yt ] (3.107)
y

Thus, the prior model p(y) enforces that two neighboring pixels have the same
skinness, which discards isolated points and, thus, smooths the result of the
classification. Actually, such model is a version of the well-known Potts model.
An interesting property of the latter model is that for any < s, t >, we
have p(ys = 1, yt = 0) = p(ys = 0, yt = 1).

3.5.2 Efficient Learning with Belief Propagation

Following an increasing complexity (and more


1 realism) in the proposed models
for skin detection, the third one C1 ⊂ (C0 D) ⊂ C0 consists of imposing the
following constraints:

C1 : ∀ < s, t >∈ S × S, ∀xt , xs ∈ C, ∀ys , yt ∈ {0, 1} :


p(xs , xt , ys , yt ) = q(xs , xt , ys , yt )
(3.108)

q(xs , xt , ys , yt ) being the expected proportion of times in the training set that
the two 4-neighboring pixels have the realization (xs , xt , ys , yt ) independently
on their orientation. Therefore, the ME pdf must satisfy

p(xs , xt , ys , yt ) = Ep [δxs (xs )δxt (xt )δys (ys )δyt (yt )] (3.109)

and, thus, the ME solution is



p(xs , xt , ys , yt ) ≈ e[ <s,t> λ(xs ,xt ,ys ,yt )]
(3.110)

which implies estimating 2563 × 2563 × 2 × 2 Lagrange multipliers when as-


suming that a color pixel may take 2563 values. Obviously, it is impossible
to evaluate the partition function. The complexity of inference in MRFs is
highly influenced by the fact that the neighborhoods < s, t > define a undi-
rected graph. However, if undirectedness is relaxed, a tree approximation is
chosen. As a tree is a connected graph without loops, any pairwise MRF over
a tree can be written in the following terms [123]:
 p(zs , zt ) 
p(z) ≈ p(zs ) (3.111)
<s,t>
p(zs )p(zt )
s∈S
84 3 Contour and Region-Based Image Segmentation

Thus, setting z = (x, y) we have


 q(xs , xt , ys , yt ) 
p(x, y) ≈ q(xs , ys ) (3.112)
<s,t>
q(xs , ys )q(xt , yt )
s∈S

However, the latter model demands the computation of one 10-dimensional


(sparse) histogram and, thus, it is quite prone to overfitting. This problem
may be circumvented if we exploit the color gradient xt − xs and apply it to
the following approximation:

q(xs , xt |ys , yt ) ≈ q(xs |ys )q(xt − xs |ys , yt ) (3.113)

whose evaluation requires six histograms of three dimensions. The above sim-
plifications result in the following model:

C ∗ : ∀xt , xs ∈ C, ∀ys , yt ∈ {0, 1} :


p(xs , ys ) = q(xs , ys )
p(xy , yt ) = q(xt , yt )
p(xs − xt , ys , yt ) = q(xs − xt , ys , yt )
(3.114)

As the entropy of p(xs , xt , ys , ys ) is given by



H(p) = − p(xs , xt , ys , ys ) log p(xs , xt , ys , ys ) (3.115)
xs ,xt ,ys ,ys

the ME solution is given by

p∗ (xs , xt , ys , yt ) = Pλ ∩ C ∗ (3.116)

that is, by the pdfs satisfying the above constrains in C ∗ and having the form
1 [λ(xs ,ys )+λ(xs ,ys )+xt ,yt )+λ(xs −xt ,ys ,yt )]
Pλ = e (3.117)

 [λ(xs ,ys )+λ(xs ,ys )+xt ,yt )+λ(xs −xt ,ys ,yt )]
being Zλ = xs ,xt ,ys ,ys e the partition.
In the latter formulation, we have reduced significantly the number of mul-
tipliers to estimate, but estimation cannot be analytic. The typical solution
is to adapt to this context the iterative scaling method [47] (we will see an
advanced version in the last section on the book, where we will explain how
to build ME classifiers). The algorithm is summarized in Alg. 5.
Finally, there is an alternative mechanism based on belief propagation [177]
using Bethe trees. Such a tree is rooted at each pixel whose color we want
to infer, and we have trees Tk of different depths where k denotes the depth.
Here, we assume a 4 neighborhood so that the root generates a child for each
of its four neighbors. Then the children generate a node for each of its four
3.5 Model-Based Segmentation Exploiting The Maximum Entropy Principle 85

Algorithm 5: Iterative Scaling for Marginals


Input: Marginals: q(xs , ys ), q(xt , yt ), q(xt − xs , ys , yt )
Initialize
∀xt , xs ∈ C, ∀ys , yt ∈ {0, 1} :
Lambdas: λ(xs , ys ) = λ(xt , yt ) = λ(xt − xs , ys , yt ) = 0.0 ,
repeat
Marginal: p(xs , ys ) = p(xt , yt ) = p(xs − xt , ys , yt ) = 0.0
Partition function: Zλ = 0.0
foreach (xs , xt , ys , yt ) do
Calculate: g(xs , xt , ys , yt ) = e[λ(xs ,ys )+λ(xs ,ys )+λ(xs −xt ,ys ,yt )]
Update marginals and partition function:
p(xs , ys ) ← p(xs , ys ) + g(xs , xt , ys , yt )
p(xt , yt ) ← p(xt , yt ) + g(xs , xt , ys , yt )
p(xt − xt , ys , yt ) ← p(xt − xt , ys , yt ) + g(xs , xt , ys , yt )
Zλ ← Zλ + g(xs , xt , ys , yt )
end
foreach Marginal do
p(xs , ys ) ← p(xZsλ,ys )
p(xt , yt ) ← p(xt ,yt )

p(xt − xt , ys , yt ) ← p(xt −xZtλ,ys ,yt )
end
foreach λ do
q(xs ,ys )
λ(xs , ys ) = ln p(x s ,ys )
q(xt ,yt )
λ(xt , yt ) = ln p(xt ,yt )
q(xt −xs ,ys ,yt )
λ(xt − xs , ys , yt ) = ln p(xt −xs ,ys ,yt )
Update all λ ← λ + λ
end
until convergence of all λ() ;
Output: The λ parameters of the marginal model solution p∗ (Eq. 3.116).

neighbors that is not yet assigned to a node, and so on. We may have the
following general description for a pairwise model:
 
p(y|x) = ψ(xs , xt , ys , yt ) φ(xs , ys ) (3.118)
<s,t> s∈S

where ψ(xs , xt , ys , yt ) are pairwise compatibilities and φ(xs , ys ) are individual


compatibilities between colors and their labels. The BP propagation algorithm
for trees relies on computing k times (k is the tree depth) the following vari-
ables called messages
 
mts (ys ) ← φ(xs , ys )ψ(xs , xt , ys , yt ) mut (yt ) (3.119)
ys u∈N (t),u=s

that is, messages mab is sent from a to b by informing about the consistency of
a given label. These messages are initialized with value 1. There is one message
86 3 Contour and Region-Based Image Segmentation

per value of the label at a given node of the tree. Then, the probability of
labeling a pixel as skin at the root of the tree is given by

p(ys = 1|xs , s ∈ Tk ) ≈ φ(xs , ys ) mts (ys ) (3.120)
t∈N (s)

This method is quite faster than the Gibss sampler. In Fig. 3.16 we show
different skin detection results for the methods explored in this section.

3.6 Integrating Segmentation, Detection


and Recognition

3.6.1 Image Parsing

What is image parsing? It is more than segmentation and more than recogni-
tion [154]. It deals with their unification in order to parse or decompose the
input image I into its constituent patterns, say texture, person and text, and
more (Fig. 3.17). Parsing is performed by constructing a parsing graph W. The
graph is hierarchical (tree-like) in the sense that the root node represents the
complete scene and each sibling represents a pattern which that be, in turn,
decomposed. There are also horizontal edges between nodes in the same level
of hierarchy. Such edges define spatial relationships between patterns. Hierar-
chical edges represent generative (top-down) processes. More precisely, a graph
W is composed of the root node representing the entire scene and a set of K
siblings (one per pattern). Each of these siblings i = 1, . . . , K (intermediate
nodes) is a triplet of attributes (Li , ζi , Θi ) consisting of the shape descriptor Li
determining the region R(Li ) = Ri (all regions corresponding to the patterns
must be disjoint and their union must be the scene); the type (family) of visual
pattern ζi (faces, text characters, and so on); and the model parameters Θi (see
Fig. 3.21). Therefore, the tree is given by W = (K, {(Li , ζi , Θi ) i = 1, . . . , K})
where K is, of course, unknown (model order selection). Thus, the posterior
of a candidate generative solution W is quantified by

p(W |I) = p(I|W )p(W )




K 
K
= p(IR(Li ) |Li , ζi , Θi ) p(K) p(Li )p(ζi |Li )p(Θi |ζi ) , (3.121)
i=1 i=1

where it is reasonable to assume that p(K) and p(Θi |ζi ) are uniform and the
term p(ζi |Li ) allows to penalize high model complexity and may be estimated
from the training samples (learn the best parameters identifying samples of
a given model type, as in the example of texture generation described in
Chapter 5). In addition, the model p(Li ), being Li = ∂R(Li ) the contour, is
assumed to decay exponentially with its length and the enclosed area, when
3.6 Integrating Segmentation, Detection and Recognition 87

a football match scene

sports field spectator


person

point process

face texture curve groups texture persons


text
color region texture

Fig. 3.17. A complex image parsing graph with many levels and types of patterns.
(Figure by Tu et al. 2005
c Springer.) See Color Plates.

it is referred to generic visual patterns and faces. However, when considering


the shape of an alphabetic character or digit, such model relies on the tem-
plates and the allowed deformations. In this latter case, severe rotations and
distortions are penalized. In addition, assuming a B-spline representation of
the contour, high elastic deformation with respect to the reference template
is penalized also by an exponential decay.
On the other hand, several intensity models for computing the likelihood
p(IR(Li ) |Li , ζi , Θi ) are considered. The first one (p1 ) is the constant intensity
modeled by the two parameters of a Gaussian. The second one (p2 ) is the
nonparametric clutter/texture model given by the factorization of intensity
frequencies inside a given region (ni is the number of pixels with intensity
value j ∈ {1, . . . , G}, and hj is the frequency of the jth histogram bin). The
third is a shading model p3 where each pixel intensity is characterized by a
Gaussian defined over its difference between this intensity and a quadratic
from. The same model is used for the 62 characters (including 10 digits and
26 × 2 = 52) in both lower and uppercase. Thus, C = (5, . . . , 66) and the
quadratic form is Jp=(x,y) = ax2 + bxy + cy 2 + dx + ey + f . Finally, the face
model p4 is given by a multidimensional Gaussian over the difference between
the region and its de-projection over a face eigenspace n (principal components
{ui }, and eigenvectors (φ1 , . . . , φn ), that is u = i=1 ui φi ) learnt from the
samples.
88 3 Contour and Region-Based Image Segmentation

p1 (IR(L) |L, ζ = 1, Θ) = G(Ip − μ; σ 2 ) , Θ = (μ, σ)
p∈R(L)


G
n
p2 (IR(L) |L, ζ = 2, Θ) = hj j , Θ = (h1 , . . . , hG )
j=0

p3 (IR(L) |L, ζ ∈ {3, C}, Θ) = G(Ip − Jp ; σ 2 ) , Θ = (a, . . . , f, σ)
p∈R(L)

p4 (IR(L) |L, ζ = 4, Θ) = G(IR(L) − u; Σ) , Θ = (φ1 , . . . , φn , Σ) (3.122)

Under a pure generative strategy, the inference of W ∈ Ω may be posed


in terms of sampling of the posterior W ∼ p(W|I) ∝ p(I|W)p(W), for in-
stance, as W∗ = arg maxW∈Ω p(W|I). However, the Ω is huge; consider the
finite space of all parsing graphs, and imagine the possible types of transitions
between each graph (create nodes, delete nodes, and change node attributes).
Such hugeness recommends data-driven Markov chains (DDMCMC) to re-
duce the temporal complexity of the sampling, as we have seen in Section 3.4.
Therefore, generative processes should be complemented by discriminative
(bottom-up) ones for finding subgraphs and propose them to the top-down
process in both an intertwined and competitive processing for composing the
parts of the image. Discriminative processes are fast, but they are prone to er-
rors and loose the global context. Let W = (w1 , . . . , wK ), being wi the nodes,
denote a state of Ω. The pure generative approach let us synthesize the in-
put image I through sampling the posterior p(W|I). However, discriminative
methods, instead of providing global posteriors quantifying the probability
that the image is generated, give conditional probabilities of how the parts
are generated q(wj |Tj (I)). Part generation is driven by a set of bottom-up
tests Tj , like the application of an boosted classifier (as we will see below
and also, in more detail, in Chapter 7). Each test Tj relies on a set of local
image features Fj,n (I), that is, Tj (I) = (Fj,1 (I), . . . , Fj,n (I)), j = 1, . . . , K.
Among these features, one may consider edge cues, binarization cues, face
region cues, text region cues, shape affinity cues, region affinity cues, model
parameter cues, and pattern family cues among others. Probabilities for edge
cues may be learnt from data when using the statistical detectors presented
in Chapter 2. Binarization cues are used to propose boundaries of text char-
acters and rely on a binarization algorithm that run with different parameter
settings; the discriminative probability is represented nonparametrically by
a weighted set of particles. Regarding both face and text region cues, they
are learnt with a probabilistic version of the Adaboost [61] algorithm. The
output of such probabilistic version is the probability for the presence of a
face/text in a region. In the case of text, edge and binarization cues are
integrated in order to propose the boundary of characters. More precisely,
given a set of training images, for instance with, say, text (the same reasoning
follows for faces), one selects windows containing text as positive examples
and nontext windows as negative ones; the purpose of Adaboost is to learn
3.6 Integrating Segmentation, Detection and Recognition 89

a binary-valued strong classifier h(T (I)) for test T (I) = (h1 (I), . . . , hn (I))
composed of n binary-valued weak classifiers (their performance is slightly
better than chance). Regarding shape-affinity cues and region-affinity ones,
they propose matches between shape boundaries and templates, and estimate
the likelihood that two regions were generated by the same pattern family
and model parameters. Model parameter and pattern family cues are based
on clustering algorithms (mean-shift, for instance see Chapter 5), which de-
pend on the model types. For the case of boosting, the hard classifier T (I)
can be learnt as a linear combination of the weak ones hi (I):
 n 

hf (T (I)) = sign hi (I) = sign(α · T (I)) (3.123)
i=1

where α = (α1 , . . . , αn ) is the vector of weights, and n is the number of


selected features (the elements of T ) from a pool or dictionary D of say m ≥ n
features (unselected features are assumed to have a zero weight). Then, given
a training set of labeled images (positive and negative examples, for instance
faces, nonfaces), X = {(Ii , i ) : i = 1, . . . , M i ∈ {+1, −1}}. The Adaboost
algorithm (see Chapter 5) optimizes greedily the following cost function:


M
(α∗ , T ∗ ) = arg min e−i (α·T (I)) (3.124)
α,T ⊂D
i=1

that is, an exponential loss is assumed when the proposed hard classifier works
incorrectly and the magnitude of the decay is the dot product. The connection
between boosting and the learning of discriminative probabilities q(wj |Tj (I))
is the Friedman’s theorem, which states that with enough training samples
X = M selected features n, Adaboost selects the weights and tests satisfying

eζ(α·T (I))
q( = ζ|I) = (3.125)
e(α·T (I)) + e−(α·T (I))
and the strong classifier converges asymptotically to the ratio test
q( = +1|I)
hf (T (I)) = sign(α · T (I)) = sign (3.126)
q( = −1|I)

Given this theorem, it is then reasonable to think that q(|T (I)) converge
to approximations of the marginals p(|I) because a limited number of tests
(features) are used. Regarding the training process for Adaboost, it is effec-
tive for faces and text. For learning texts, features of different computational
complexities are used [38]. The simplest ones are means and variances of in-
tensity and vertical or horizontal gradients, or of gradient magnitudes. More
complex features are histograms of intensity, gradient direction, and intensity
gradient. The latter two types of features may be assimilated to the statistical
edge detection framework (Chapter 2) so that it is straightforward to design a
90 3 Contour and Region-Based Image Segmentation

Fig. 3.18. Results of face and text detection. False positive and negative appear.
(Figure by Tu et al. 2005
c Springer.)

weak classifier as whether the log-likelihood ratio between the text and non-
text distributions is above a given threshold or not (here, it is important to
consider Chernoff information and obtain peaked on empirical distributions
as we remark in Prob. 3.12). More complex features correspond, for instance,
to edge detection and linkage. When a high number of features are considered,
weak classifiers rely either on individual log-likelihood ratios or on ratios over
pairwise histograms. Anyway, it is impossible to set the thresholds used in
the weak classifiers to eliminate all false positives and negatives at the same
time (see Fig. 3.18). In a DDMCMC approach, such errors must be corrected
by generative processes, as occurs in the detection of number 9, which will be
detected as a shading region and latter recognized as a letter. Furthermore,
in order to discard rapidly many parts of the image that do not contain a
text/face, a cascaded classifier is built. A cascade is a degenerated tree with
a classifier at each level; if a candidate region succeeds at a given level, it is
passed to the following (deeper) one, and otherwise, is discarded. Considering
the number of levels unknown beforehand, such number may be found if one
sets the maximum acceptable false positive rate per layer, the minimum ac-
ceptance rate per layer, and the false overall positive rate [170]. Furthermore,
as the classifiers in each layer may have different computational complexities,
it is desirable to allocate the classifiers to the levels by following also this cri-
terion. Although the problem of finding the optimal cascade, given the false
positive rate, zero false negatives in the training set, and the average com-
plexity of each classifier, is NP-complete, a greedy (incremental) solution has
been proposed in [39]. The rationale of such approach is that, given a maxi-
mum time to classify, it is desirable to choose for a given layer the classifier
maximizing the expected remaining time normalized by the expected number
of regions remaining to be rejected. If the fixed time is not enough to succeed
for a given maximum rate of false positives, additional time is used. Anyway,
this strategy favors the positioning of simple classifiers at the first levels and
speeds up 2.5 times the uniform-time cascade (see results in Fig. 3.19).
Once we have presented both the generative models and the discrimina-
tive methods, it is time to present the overall structure of the bidirectional
3.6 Integrating Segmentation, Detection and Recognition 91

Fig. 3.19. Results of text detection in a supermarket. Good application for the
visually impaired. (Figure by Chen et al. 2005
c IEEE.)

algorithm (see Fig. 3.20). Starting bottom-up, there are four types of com-
putations of q(w|T (I) (one for each type of discriminative task associated to
a node w in the tree). The key insight of the DDMCMC is that these com-
putations are exploited by the top-down processes, that is, these generative
processes are not fully stochastic. More precisely, the top-down flow, that is
the state transitions W → W , is controlled by a Markov chain K(W, W ),
which is the core of the Metropolis–Hastings dynamics. In this case, such ker-
nel is decomposed into four subkernels Ka : a = 1, . . . , 4, each one activated
with a given probability ρ(a, I). In turn, each of the subkernels that alters
the structure of the parsing tree (all except the model switching kernel and
the region-competition kernel moving the borders, which is not included in the
figure) is subdivided into two moves Kar and Kal (the first one for node cre-
ation, and the second for node deletion). The corresponding probabilities ρar
and ρal are also defined.

3.6.2 The Data-Driven Generative Model

The computational purpose of the main kernel K with respect to the parse tree
is to generate moves W → W of three types (node creation, node deletion,
92 3 Contour and Region-Based Image Segmentation

K(W,W⬘)
Markov Kernel

r (1, I) r (2, I) r (3, I) r (4, I)

K1 K2 K3 K4
text face generic
sub-kernel sub-kernel sub-kernel
r 1l r 1r r 2l r 2r r 3l r 3r model
switching
K1l K1r K2l K2r K3l K3r sub-kernel
birth death birth death split merge

generative
inference

discriminative
inference

q (w1ÁTst1(I)) q (w2ÁTst2(I)) q (w3ÁTst3(I)) q (w4ÁTst4(I))


text detection face detection edge partition parameter clustering

I
input image

Fig. 3.20. Bidirectional algorithm for image parsing. (Figure by Tu et al. 2005
c
Springer.)

and change of node attributes) and drive the search toward sampling the
posterior p(W|I). The main kernel is defined in the following terms:
 
K(W |W : I) = ρ(a : I)Ka (W |W : I) where ρ(a : I) = 1 (3.127)
a a

and ρ(a : I) > 0. In the latter definition, the key fact is that both the activa-
tion probabilities and the subkernels depend on the information in the image
I. Furthermore, the subkernels must be reversible (see Fig. 3.21) so that the
main kernel satisfies such property. Reversibility is important to ensure that
the posterior p(W|I) is the stationary (equilibrium) distribution. Thus, keep-
ing in mind that Ka (W |W : I) is a transitionmatrix representing the prob-
ability of the transition W → W (obviously W Ka (W |W : I) = 1 ∀ W),
kernels with creation/deletion moves must be grouped into reversible pairs
(creation with deletion): Ka = ρar Kar (W |W : I) + ρal Kal (W |W : I), being
ρar + ρal = 1. With this pairing, it is ensured that Ka (W |W : I) = 0 ⇔
Ka (W|W : I) = 1 ∀ W, W ∈ Ω, and after that pairing Ka is built in order
to satisfy
p(W|I)Ka (W |W : I) = p(W |I)Ka (W|W : I) (3.128)

which is the so-called detailed balance equation [175] whose fulfillment ensures
reversibility. If all subkernels are reversible, the main one is reversible too.
3.6 Integrating Segmentation, Detection and Recognition 93

Fig. 3.21. Example of parsing transitions. (Figure by Tu et al. 2005


c Springer.)

Another key property to fulfill is ergodicity (it is possible to go from one state
to every other state, that is, it is possible to escape from local optima). In
this case, ergodicity is ensured provided that enough moves are performed.
Reversibility and ergodicity ensure that the posterior p(W|I) is the invariant
probability of the Markov chain. Being μt (W) the Markov chain probability
of state W, we have

μt+1 (W) = Ka(t) (W |W)μt (W) (3.129)
W

Thus, given an initial state W0 with probability ν(W0 ), it happens that


μt (W) approaches the posterior monotonically as time t increases:

W ∼ μt (W) = ν(W0 ) · [Ka(1) ◦ · · · ◦ Ka(t) ](W0 , W) → p(W|I) (3.130)

Monotonicity is important, but quantifying the rate of convergence for a given


subkernel is key to measure its effectiveness or usefulness with respect to other
subkernels. In this regard, the Kullback–Leibler divergence between the poste-
rior and the Markov chain state probability for a given kernel Ka decreases
monotonically and such decreasing is

δ(Ka ) = D(p(W|I)||μt (W)) − D(p(W|I)||μt+1 (W)) ≥ 0 (3.131)

and δ(Ka ) > 0, being only zero when the Markov chain becomes stationary,
that is, when p = μ. Denoting by μt (Wt ) the state probability at time t, the
one at time t + 1 is given by

μt+1 (Wt+1 ) = μt (Wt )Ka (Wt+1 |Wt ) (3.132)
Wt

and the joint probability μ(Wt , Wt+1 ) may be defined as follows:

μ(Wt , Wt+1 ) = μt (Wt )Ka (Wt+1 |Wt ) = μt+1 (Wt+1 )pM C (Wt |Wt+1 )
(3.133)
94 3 Contour and Region-Based Image Segmentation

being pM C (Wt |Wt+1 ) the posterior of state Wt at time t conditioned on state


Wt+1 at time t + 1. On the other hand, the joint probability at equilibrium
(when p = μ) may be defined, after exploiting the detailed balance equation
in the second equality, as
p(Wt , Wt+1 ) = p(Wt )Ka (Wt+1 |Wt ) = p(Wt+1 )Ka (Wt |Wt+1 ) (3.134)
Then, the divergence between the latter two joint probabilities can be ex-
pressed in terms of Wt :
D(p(Wt , Wt+1 )||μ(Wt , Wt+1 ))
  p(Wt , Wt+1 )
= p(Wt , Wt+1 ) log
μ(Wt , Wt+1 )
Wt+1 Wt
  p(Wt )Ka (Wt+1 |Wt )
= p(Wt )Ka (Wt+1 |Wt ) log
μt (Wt )Ka (Wt+1 |Wt )
Wt+1 Wt
 p(Wt ) 
= p(Wt ) log Ka (Wt+1 |Wt )
μt (Wt )
Wt Wt+1

= D(p(Wt )||μ(Wt )) (3.135)


But also, the divergence may be posed in terms of Wt+1 :
D(p(Wt , Wt+1 )||μ(Wt , Wt+1 ))
  p(Wt+1 )Ka (Wt |Wt+1 )
= p(Wt+1 )Ka (Wt |Wt+1 ) log
μt+1 (Wt+1 )pM C (Wt |Wt+1 )
Wt+1 Wt
  p(Wt+1 )
= p(Wt+1 )Ka (Wt |Wt+1 ) log
μt+1 (Wt+1 )
Wt+1 Wt
  Ka (Wt |Wt+1 )
+ p(Wt+1 )Ka (Wt |Wt+1 ) log
pM C (Wt |Wt+1 )
Wt+1 Wt

= D(p(Wt+1 )||μ(Wt+1 ))
  Ka (Wt |Wt+1 )
+ p(Wt+1 ) Ka (Wt |Wt+1 ) log
pM C (Wt |Wt+1 )
Wt+1 Wt

= D(p(Wt+1 )||μ(Wt+1 ))
+Ep(Wt+1 ) [D(Ka (Wt |Wt+1 )||pM C (Wt |Wt+1 ))] (3.136)
Therefore, we have that
δ(Ka ) ≡ D(p(Wt )||μ(Wt )) − D(p(Wt+1 )||μ(Wt+1 ))
= Ep(Wt+1 ) [D(Ka (Wt |Wt+1 )||pM C (Wt |Wt+1 ))] ≥ 0 (3.137)
that is, δ(Ka ) measures the amount of decrease of divergence for the kernel Ka ,
that is, the convergence power of such kernel. It would be interesting to con-
sider this information to speed up the algorithm sketched in Fig. 3.20, where
3.6 Integrating Segmentation, Detection and Recognition 95

each kernel is activated with probability ρ(.). What is done for the moment
is to make the activation probability dependent on the bottom-up processes.
For texts and faces, ρ(a ∈ 1, 2 : I) = {ρ(a : I) + kg(N (I))/Z, being N (I)
the number of text/faces proposals above a threshold ta , g(x) = x, x ≤ Tb ,
g(x) = Tb , x ≥ Tb , and Z = 1 + 2k (normalization). For the rest of kernels,
there is a fixed value that is normalized accordingly with the evolution of ac-
tivation probabilities of the two first kernels ρ(a ∈ 3, 4 : I) = ρ(a ∈ 3, 4 : I)/Z.
Once we have specified the design requirements of the sub-kernels and
characterized them in terms of convergence power, the next step is to design
them according to the Metropolis–Hastings dynamics:

Ka (W |W : I) = Qa (W |W : Ta (I)) min{1, α(W |W : I)} , W = W


(3.138)

being Qa (W |W : Ta (I)) the proposal probability of the transition, and
α(W |W : I) the acceptance probability defined as

 p(W |I)Qa (W|W : Ta (I))
α(W |W : I) = min 1, (3.139)
p(W|I)Qa (W |W : Ta (I))

The key elements in the latter definition are the proposal probabilities
Qa , which consist of a factorization of several discriminative probabilities
q(wj |Tj (I)) for the elements wj changed in the proposed transition W → W .
Thus, we are assuming implicitly that Qa are fast to compute because many
of them rely on discriminative process. For the sake of additional global effi-
ciency, it is desirable that Qa proposes transitions where the posterior p(W |I)
is very likely to be high, and at the same time, that moves are as larger as pos-
sible. Therefore, let Ωa (W) = {W ∈ Ω : Ka (W |W : I) > 0} be the scope,
that is, the set of reachable states from W in one step using Ka . However,
not only large scopes are desired, but also scopes containing states with high
posteriors. Under this latter consideration, the proposals should be designed
as follows:
p(W |I)
Qa (W |W : Ta (I)) ∼  if W ∈ Ωa (W) (3.140)
W ∈ Ωa (W)p(W |I)

and should be zero otherwise. For that reason, the proposals for creat-
ing/deleting texts/faces can consist of a set of weighted particles (Parzen
windows were used in DDMCMC for segmentation in Section 3.4). More pre-
cisely, Adaboost, assisted by a binarization process for detecting character
boundaries, yields a list of candidate text shapes. Each particle z (shape) is
weighted by ω. We have two sets, one for creating and the other for deleting
text characters:

Sar (W) = {(zar


(μ) (μ)
, ωar ) : μ = 1, . . . , Nar }
(ν) (ν)
Sal (W) = {(zal , ωal ) : ν = 1, . . . , Nal } (3.141)
96 3 Contour and Region-Based Image Segmentation

where a = 1 for text characters. Weights ωar for creating new characters
are given by a similarity measure between the computed border and the de-
formable template. Weights ωal for deleting characters are given by their pos-
teriors. The idea behind creating and deleting weights is to approximate the
ratios p(W |I)/p(W|I) and p(W|I)/p(W |I), respectively. Anyway, the pro-
posal probabilities given by weighted particles are defined by

ωar (W ) ωal (W)


Qar (W |W : I) = N (μ)
, Qal (W|W : I) = N (ν)
(3.142)
ar al
μ=1 ωar ν=1 ωal

The latter definition is valid for a = 1, 2, 3, 4. What changes in each case is


the way that particles and weights are built. For faces (a = 2), face bound-
aries are obtained through edge detection, and proposals are obtained in a
similar way as in the case of text characters. For a = 3 (region split and
region merge), the best region for splitting is the worse fitted to its model,
and the best regions to merge are the ones with higher affinity in statisti-
cal terms (and also where a common border exists!). In terms of splitting,
particles are obtained from Canny edge detectors at different scales (levels
of details). On the other hand, merging relies on the proposal of remov-
ing boundary fragments. Finally, for a = 4 (model switching or changing
the region type), the proposal probabilities rely on approximating the ratio
p(W |I)/p(W|I) through particles. Roughly speaking, the weight is domi-
nated by the ratio between the joint probability of the new label ζ  and
new parameters Θ and the joint probability of current model labels and
parameters multiplied by the likelihood of each region given the current la-
bel, region label and region parameters. A similar ratio is key in the com-
putation of the particles for split and merge regions, though in the region
case, the ratio is dominated by an affinity measure between both regions
(see Prob. 3.13).

3.6.3 The Power of Discriminative Processes

The key insight of the DDMCMC approach is exploiting bottom-up knowl-


edge coming from discriminative computations. In the current version of the
algorithm, the sequence of application of tests is not optimized at all. How-
ever, it is reasonable to think that q(wj |T (I)) → p(wj |I) (being T (I) the test
providing the optimal approximation of the true marginal of the posterior)
as new tests are added. In terms of information theory, we should sequence,
at least greedily, the tests, so that the maximum amount of information is
obtained as soon as possible. Choosing such sequence, by selecting at each
time the test with maximum information gain, is the driving mechanism for
learning decision trees, as we will see in Chapter 7. Then, it proceeds to quan-
tify the concept of information gain in the context of image parsing. Thus,
the information gained for a variable w by a new test T+ is the decrease of
3.6 Integrating Segmentation, Detection and Recognition 97

Kullback–Leibler divergence between p(w|I) and its best discriminative esti-


mate, or the increase of mutual information between w and the tests:

EI [D(p(w|I)||q(w|T (I)))] − EI [D(p(w|I)||q(w|T (I), T+ (I)))]


= I(w; T (I), T+ (I)) − I(w; T (I))
= ET,T+ D(q(w|T (I), T+ (I))||q(w|T (I))) ≥ 0 (3.143)

being EI the expectation with respect to p(I) and ET,T+ the one with respect
to the probability of test responses (T, T+ ) induced by p(I). The key insight
behind the equalities above is twofold. Firstly, the fact that the divergence
of q(.) with respect to the marginal p(.) decreases as new tests are added.
Secondly, the degree of decreasing of the average divergence yields a useful
measure for quantifying the effectiveness of a test with respect to other choices.
Regarding the proof of the equalities in Eq. 3.143, let us start by finding a
more compact expression for the difference of expectations EI . For the sake of
clarity, in the following we are going to drop the dependency of the tests on I,
while keeping in mind that such dependency actually exists, that is, T = T (I)
and T+ = T+ (I):

EI [D(p(w|I)||q(w|T ))] − EI [D(p(w|I)||q(w|T, T+ ))]


! " ! "
  p(w|I)   p(w|I)
= p(I) p(w|I) log − p(I) p(w|I) log
w
q(w|T ) w
q(w|T, T+ )
I I
! "
  p(w|I) p(w|I)
= p(I) p(w|I) log − log
w
q(w|T ) q(w|T, T+ )
I
! "
  q(w|T, T+ )
= p(I) p(w|I) log
w
q(w|T )
I
! "
  q(w|T, T+ )
= p(w, I) log (3.144)
w
q(w|T )
I

Next step consists of reducing the difference of mutual informations


I(w; T, T+ ) − I(w; T ) to the final expression in Eq. 3.144. To that end,
we exploit the following properties:

p(I, T ) = p(T |I)p(I) = p(I)


p(w, I) = p(w, I, T, T+ ) = q(w, T, T+ )p(x|T, T+ )
p(w, I) = p(w, I, T ) = q(w, T )p(x|T ) (3.145)

being x the dimensions of w that are independent of T and T+ . Therefore,


the first property is derived by the fact that a test is a deterministic function
of the image, and the other two factorize the joint distribution of w and the
98 3 Contour and Region-Based Image Segmentation

tests in terms of both the dependent and independent dimensions. Then, we


have the following derivation:

I(w; T, T+ ) − I(w; T )
! " ! "
  q(w, T, T+ )   q(w, T )
= q(w, T, T+ ) log − q(w, T ) log
w
q(T, T+ )q(w) w
q(T )q(w)
T,T+ T
! "
  q(w, T, T+ ) q(w, T )
= q(w, T, T+ ) log − q(w, T ) log
w
q(T, T+ )q(w) q(T )q(w)
T,T+
! "
 
= {q(w, T, T+ ) log q(w|T, T+ ) − q(w, T ) log q(w|T )}
T,T+ w
! "
  p(w, I) p(w, I)
= log q(w|T, T+ ) − log q(w|T )
w
q(x|T, T+ ) q(x|T )
T,T+
! "
  p(w, I) p(w, I)
= log q(w|T, T+ ) − log q(w|T )
w
q(x) q(x)
T,T+
! "
 1 
= {p(w, I) log q(w|T, T+ ) − p(w, I) log q(w|T )}
q(Ix ) w
I
! "
  q(w|T, T+ )
= p(w, I) log (3.146)
w
q(w|T )
I

where the change of variables (T, T+ ) by I is bidirectional and it is due to


the fact that the tests distribution is induced by p(I). Then, it is possible to
sum out q(x), being Ix the value of x in each image. Given the last expression
in Eq. 3.146, and the latter considerations about the change of variables, we
have
! "
  q(w|T, T+ )
p(w, I) log
w
q(w|T )
I
! "
  q(w|T, T+ )
= q(w, T, T+ )p(Ix |T, T+ ) log
w
q(w|T )
I
! "
  q(w|T, T+ )
= p(Ix |T, T+ ) q(w, T, Tx ) log
w
q(w|T )
I
! "
  q(w|T, T+ )
= p(Ix |T, T+ ) q(w|T, T+ )q(T, T+ ) log
w
q(w|T )
I
! "
 p(Ix , T, T+ )  q(w|T, T+ )
= q(w|T, T+ )q(T, T+ ) log
p(T, T+ ) w
q(w|T )
I
3.6 Integrating Segmentation, Detection and Recognition 99
! "
 p(Ix , T, T+ )  q(w|T, T+ )
= q(w|T, T+ )q(T, T+ ) log
p(T, T+ ) w
q(w|T )
I
! "
 p(Ix )p(T, T+ )  q(w|T, T+ )
= q(T, T+ ) q(w|T, T+ ) log
p(T, T+ ) w
q(w|T )
I
! "
  q(w|T, T+ )
= p(Ix )q(T, T+ ) q(w|T, T+ ) log
w
q(w|T )
I
! "
  q(w|T, T+ )
= q(T, T+ ) q(w|T, T+ ) log
w
q(w|T )
I
! "
  q(w|T, T+ )
= q(T, T+ ) q(w|T, T+ ) log
w
q(w|T )
T,T+

= ET,T+ D(q(w|T, T+ )||q(w|T )) (3.147)

As the Kullback–Leibler divergence is nonnegative, its expectation is also


nonnegative.

3.6.4 The Usefulness of Combining Generative and Discriminative

As we have seen, discriminative processes are fast but prone to error, whereas
generative ones are optimal but too slow. Image parsing enables competitive
and cooperative processes for patterns in an efficient manner. However, the
algorithm takes 10–20 min to process images with results similar to those pre-
sented in Fig. 3.22, where the advantage is having generative models for syn-
thesizing possible solutions. However, additional improvements, like a better
management of the segmentation graph, may reduce the overall computation
time under a minute.
Bottom-up/top-down integration is not new either in computer vision or
in biological vision. For instance, in the classical of Ullman [161], the counter-
stream structure is a computational model applied to pictorial face recognition,
where the role of integrating bottom-up/top-down processes is compensating
for image-to-model differences in both directions: from pictures to models
(deal with variations of position and scale) and from models to pictures (solve
differences of viewing direction, expression and illumination). However, it is
very difficult and computational intensive to build a generative model for
faces, which takes into account all these sources of variability, unless sev-
eral subcategories of the same face (with different expressions) are stored and
queried as current hypothesis. Other key aspects related to the combination of
the recognition paradigms (pure pictorial recognition vs from parts to objects)
and the capability of generalization from a reduced number of views are clas-
sical topics both in computer and biological vision (see for instance [152]).
Recently emerged schemas like bag of words (BoW) are in the beginning
of incorporating information-theoretic elements (see for instance [181] where
100 3 Contour and Region-Based Image Segmentation

Fig. 3.22. Results of image parsing (center column) showing the synthesized images.
(Figure by Tu et al. 2005
c Springer.)

information-bottleneck, a clustering criterion to be explained in Chapter 5, is


exploited for categorization).

Problems
3.1 Implicit MDL and region competition
The region competition approach by Zhu and Yuille [184] is a good example
of implicit MDL. The criterion to minimize, when independent probability
models are assumed for each region, is the following:


K    
μ
E(Γ, {Θi }) = ds − log p(I(x, y)|Θi )dxdy + λ
i=1
2 ∂Ri Ri

where Γ = ∪K i=1 ∂Ri , the first term is the length of curve defining the boundary
∂Ri , mu is the code length for a unit arc (is divided by 2 because each edge
fragment is shared by two regions). The second term is the cost of coding each
pixel inside Ri the distribution specified by Θi . The minimization attending
MDL is done in two steps. In the first one, we estimate
 the optimal parameters
by solving Θi∗ as the parameters are maximizing (x,y)∈Ri p(I(x, y)|Θi ). In the
3.6 Integrating Segmentation, Detection and Recognition 101

second phase, we re-estimate each contour following the motion equation for
the common contour Γij between two adjacent regions Ri and Rj :
 
dΓij p(I(x, y)|Θi )
= −μκi ni + log ni
dt p(I(x, y)|Θj )
where κi is the curvature, and ni = −nj is the normal (κi ni = κj nj ). Find an
analytical expression for the two steps assuming that the regions are character-
ized by Gaussian distributions. Then, think about the role of the log-likelihood
ratio in the second term of the motion equation. For the Gaussian case, give
examples of how decisive is the log-ratio when the distributions have the same
variance, but closer and closer averages. Hint: evaluate the ratio by computing
the Chernoff information.
3.2 Green’s theorem and flow 
The derivation of energy functions of the form E(Γ ) = R
f (x, y)dxdy is
usually done by exploiting the Green’s theorem, which states that for any
vector (P (x, y), Q(x, y)) we have
      l
∂Q ∂P
− dxdy = (P dx + Qdy) = (P ẋ + Qẏ)ds
R ∂x ∂y ∂R 0

We must choose P and Q so that ∂Q∂x − ∂y = f (x, y). For instance, setting
∂P

 
1 x 1 y
Q(x, y) = f (t, y)dt P (x, y) = − f (x, t)dt
2 0 2 0
Using L(x, ẋ, y, ẏ) = Q(x, y)ẏ + P (x, y)ẋ, show that we can write E(Γ ) =
l
0
L(x, ẋ, y, ẏ)ds. Prove also that using the Euler–Lagrange equations we
finally find that
EΓ (Γ ) = f (x, y)n
being n = (ẏ, −ẋ) the normal, and (P, Q) = (F1 , F2 )
3.3 The active square
Given an image with a square foreground whose intensities follow a Gaussian
distribution with (μ1 , σ1 ) over a background also Gaussian (μ2 , σ2 ). Initialize a
square active polygon close to the convergence point and make some iterations
to observe the behavior of the polygon. Test two different cases: (i) quite
different Gaussians, and (ii) very similar Gaussians.
3.4 Active polygons and maximum entropy
What is the role of the maximum entropy principle in the active polygons
configuration? Why maximum entropy estimation is key in this context? What
is the main computational advantage of active polygons vs active contours?
3.5 Jensen–Shannon divergence
Why is the Jensen–Shannon divergence used for discriminating the foreground
from the background?
102 3 Contour and Region-Based Image Segmentation

3.6 Jensen–Shannon divergence and active polygons


What is the formal connection between the Jensen–Shannon divergence and
the Chen and Vese functional? Think about connections with other kind of
functionals.

3.7 Jensen–Shannon divergence and active contours


Establish a formal link between the Jensen–Shannon divergence as a discrimi-
native tool and the observation model used in B-splines-based contour fitting.
Think about the impact of replacing the observation model based on paramet-
ric distributions (used with B-splines) by a more general model-free criterion.
What should be the impact of this change in the MDL formulation?

3.8 Alternative contour representations and MDL


The quantification of complexity used in the MDL-based algorithm for seg-
menting with contours results in a quite simple expression (firstly depend-
ing on the variances, and later assuming that the cost of a parameter is the
logarithm of the image dimension). However, if the contour representation
changes, it is necessary to reformulate the criterion. Consider, for instance,
the alternative frequency-based representation of the contour in terms of a
Fourier transform (low frequencies retain global aspects and high frequen-
cies retain details). Think about reformulating the contour fitting problem
using this kind of representation. Hint: use Fourier descriptors for describing
the contour. Evaluate the relative flexibility of each representation (B-splines
vs Fourier descriptors) to adapt to complex contours that are not uniformly
smooth.

3.9 Jump-diffusion: energy function


The energy function guides the jump-diffusion process through Ω. Is there a
parameter which controls the complexity of the generated solutions? Look for
it in Eq. 3.68. What would happen if there was no parameter for limiting the
complexity of the models?

3.10 Jump-diffusion: models and stochastic diffusions


Define the equations of an arc model. Which parameters ψ are needed? Derive
the motion equations dψ(t) of the parameters (see Eq. 3.74).

3.11 Jump-diffusion: distance between solutions in K-adventurers


Propose a new distance between solutions W1 − W2 . What properties are
important in this distance measure? Is the distance measure dependent on
the region models of the problem? Is it possible do define an independent
one?

3.12 Using Chernoff information in discriminative learning


In the boosting learning process used for learning discriminative probabil-
ities for text, consider the following types of features: horizontal, vertical,
and diagonal edges, and their statistical edge models (including, for in-
stance, gradient strength and angle) based on being on and off an edge of
3.7 Key References 103

the latter types (Chapter 1). Therefore, the weak classifiers are of the form
ptext
hi (I) = log pnon−text > ti . As we will see in Chapter 7, the αi are computed
sequentially (greedily) within Adaboost (choosing a predefined order). What
is more important is that such values depend on the effectiveness of the clas-
sifier hi (the higher the number of misclassified examples, the lower becomes
αi ). This virtually means that a classifier with αi = 0 is not selected (has no
impact in the strong classifier). In this regard, what types of features will be
probably excluded for text detection?
Considering now the thresholds ti , incrementing them usually leads to a
reduction of false positives, but also to an increment of false negatives. A more
precise threshold setting would depend on the amount of overlap of the on and
off distributions, that is, Chernoff information. How would you incorporate
this insight into the current threshold setting?
3.13 Proposal probabilities for splitting and merging regions
Estimate the weights ω3r and ω3l for splitting and merging regions in the image
parsing approach. Consider that these weights approximate (efficiently) the
ratios p(W|I)/p(W |I) and p(W |I)/p(W|I), respectively. Consider that in a
splitting move, region Rk existing in state W is decomposed into Ri and Rj
in W, whereas two adjacent regions Ri and Rj existing in W are fused into
Rk . Remember that in the approximation, a key ratio is the one between the
compatibility measure of Ri and Rj and the likelihood of the Rk region for
the split case, and the one of the two regions (assumed independent) for the
fusion case.
3.14 Markov model and skin detection
What is the main motivation of using a Markov random field for modeling the
skin color in images? How is the estimated extra computational cost of using
this kind of models?
3.15 Maximum Entropy for detection
The maximum entropy principle is introduced in the context of skin color de-
tection in order to introduce pairwise dependencies independently of the ori-
entation of two 4-neighbors. Explain how the original ME formulation evolves
to the consideration of color gradients and how the ME algorithm estimates
the Lagrange multipliers.

3.7 Key References


• M. Figueiredo, J. Leitão, and A.K. Jain. “Unsupervised Contour Rep-
resentation and Estimation Using B-splines and a Minimum Description
Length Criterion”. IEEE Transactions on Image Processing 9(6): 1075–
1087 (2000)
• G. Unal, A. Yezzi, and H. Krim. “Information-Theoretic Active Polygons
for Unsupervised Texture Segmentation”. International Journal of Com-
puter Vision 62(3):199–220 (2005)
104 3 Contour and Region-Based Image Segmentation

• G. Unal, H. Krim, and A. Yezzi. “Fast Incorporation of Optical Flow into


Active Polygons”. IEEE Transactions on Image Processing 14(6): 745–759
(2005)
• B. Jedynak, H. Zheng, and M. Daoudi. “Skin Detection Using Pairwise
Models”. Image and Vision Computing 23(13): 1122–1130 (2005)
• Z.W. Tu and S.C. Zhu. “Image Segmentation by Data-Driven Markov
Chain Monte Carlo”. IEEE Transactions on Pattern Analysis and Machine
Intelligence 24(5): 657–673 (2002)
• Z.W. Tu, X.R. Chen, A.L. Yuille, and S.C. Zhu. “Image parsing: Unify-
ing Segmentation, Detection and Recognition”. International Journal of
Computer Vision 63(2):113–140 (2005)
• S.C. Zhu and A.L. Yuille. “Region Competition: Unifying Snakes, Re-
gion Growing, and Bayes/MDL for Multiband Image Segmentation”.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(9):
884–900, (1996)
• S. Geman and D. Geman. “Stochastic Relaxation, Gibbs Distributions,
and the Bayesian Restoration of Images”. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 6 (pp. 721–741), (1984).
• X. Chen and A.L. Yuille. “Time-Efficient Cascade for Real Time Object
Detection”. First International Workshop on Computer Vision Applica-
tions for the Visually Impaired. Proceedings of IEEE Conference on Com-
puter Vision and Pattern Recognition, San Diego, CA (2004)
• G. Winkler. Image Analysis, Random Fields and Markov Chain Monte
Carlo Methods: A Mathematical Introduction. Springer, New York (2003)
4
Registration, Matching, and Recognition

4.1 Introduction
This chapter mainly deals with the way in which images and patterns are
compared. Such comparison can be posed in terms of registration (or align-
ment). Registration is defined as the task of finding the minimal amount of
transformation needed to transform one pattern into another (as much as pos-
sible). The computational solution to this problem must be adapted to the
kind of patterns and to the kind of transformations allowed, depending on
the domain. This also happens with the metric or similarity measure used for
quantifying the amount of transformation. In both cases, information theory
plays a fundamental role, as we will describe along the present chapter.
In the case of images, it is reasonable to exploit the statistics of inten-
sities. The concept of mutual information (MI), which quantifies statistical
dependencies between variables, is a cornerstone here because such variables
are instantiated to intensity distributions. Therefore, image registration can
be posed as finding the (constrained) transformation that holds the maximal
dependency between distributions. This rationale opens the door to the quest
for new measures rooted in mutual information.
In the case of contours, registration stands for finding minimal deforma-
tions between the input and the target contours. Here, the space of defor-
mations is less constrained, but these should be as smooth as possible. It is
possible to represent a shape as a mixture or a distribution of keypoints. Such
representation enables the use of other types of information theoretic mea-
sures, like the Jensen–Shannon (JS) divergence. Furthermore, it is possible to
exploit interesting information geometry concepts, like the Fisher–Rao metric
tensor.
Finally, in the case of structural patterns, like graphs, information theory
is a key for driving the unsupervised learning of structural prototypes. For
instance, binary shapes can be described by a tree-like variant of the skeleton,
known as shock-tree. Then, shape information is collapsed into the tree, and
shape registration can be formulated as tree registration using a proper tree
F. Escolano et al., Information Theory in Computer Vision and Pattern Recognition, 105

c Springer-Verlag London Limited 2009
106 4 Registration, Matching, and Recognition

distance. What is more important is that the MDL principle can be applied
to find a prototypical tree for each class of shape.

4.2 Image Alignment and Mutual Information


4.2.1 Alignment and Image Statistics

In computer vision, alignment refers to finding the pose of an object in an


image. The object can be represented as an image, given some imaging model
and a pose of the object. The generated image has to be compared to the
actual image in order to decide whether the pose is the correct one. One
simple example is the alignment of two similar 2D images. Another example
of an alignment problem is the estimation of the pose of a 3D model, given its
2D representation. Similarly, in medical imaging, multispectral images have
to be aligned to 2D images, or to a 3D model.
The object’s imaging model can be denoted as u(x), and its image as v(y).
The model generates the image, given an imaging function and a noise.

v(T (x)) = F (u(x), q) + η (4.1)

The image corresponds to a transformation T of the object, given the imag-


ing function F (u(x), q) and given a noise η. The imaging function not only
depends on the model of the object, but also on some exogenous influences
that are represented by the vector q.
Modeling the imaging function F is usually unfeasible and q is unknown.
However, it is possible to find a consistent alignment of two images without
knowing F and q. We can assume that the most consistent alignment is the
one that produces less discontinuities in the relation between the two images.
A way to evaluate the consistency C(T ) of some particular transformation T
is to calculate the following sum, over points xa and xb , from the model:
 2
C(T ) = − Gψ (u(xb ) − u(xa )) (v(T (xb )) − v(T (xa ))) (4.2)
xa =xb

for a fixed standard deviation ψ of the Gaussian Gψ . In this measure, T is


considered consistent if points that have similar values in the model project
to similar values in the image: |u(xa ) − u(xb )| and |u(T (xa )) − u(T (xb ))|. The
drawback of constancy maximization is that the most consistent transforma-
tion would be the one that matches the points of the model to a constant
region of the image. Also, the assumption that the image is a function of the
model may not always be useful.
If we do not assume the existence of any F , we can still assume that the
best alignment is the one in which the model best predicts the image. Entropy
is a concept that is related to predictability. The most predictable a random
variable is, the lowest its entropy. In the alignment problem, the conditional
4.2 Image Alignment and Mutual Information 107

entropy of the transformated image, given the model, has to be minimized by


searching for a T along the space of transformations, Ω:
+ ,
arg min H v(T (x))|u(x) (4.3)
T ∈Ω

However, conditional entropy also has the constancy problem: it would be low
for an image v(T (x)), which is predictable from the model u(x), and it would
also be low for an image that is predictable by itself, which is the case of
constant images.
Adding a penalization to simple images solves the constancy problem. The
first term of the following minimization objective is the conditional entropy.
The second term awards images with a higher entropy than images with low
entropy: & + , + ,'
arg min H v(T (x))|u(x) − H v(T (x)) (4.4)
T ∈Ω

Given that H(v(T (x))|u(x)) = H(v(T (x)), u(x))−H(u(x)), the latter formula
(Eq. 4.4) corresponds with mutual information maximization:
& + , + , + ,'
arg min H v(T (x)), u(x) − H u(x) − H v(T (x)) (4.5)
T ∈Ω
& + , + , + , '
= arg max H u(x) + H v(T (x)) − H v(T (x)), u(x) (4.6)
T ∈Ω
+ ,
= arg max I u(x), v(T (x)) (4.7)
T ∈Ω

The third term in Eq. 4.6, H(v(T (x)), u(x)), is the one that awards transfor-
mations for which u better explains v. The term H(v(T (x))) contributes to
select transformations that make the model u correspond with more complex
parts of the image v. The term H(u(x)) remains constant for any T , and so
it does not condition the alignment.
An alignment example illustrating the meaning of conditional entropy can
be seen in Fig. 4.1. In this simple example, the images were obtained from
the same sensor, from two slightly different positions. Therefore, due to the
differences of perspective, alignment cannot be perfect. Two cases are repre-
sented: a misalignment and a correct alignment. For the first case, the joint
histogram is much more homogeneous than for the correct alignment, where
the joint histogram is tightly concentrated along the diagonal. For more com-
plex cases, involving different sensors, it is worth noting that even though the
representation of the model and the actual image could be very different, they
will have a higher mutual information when the alignment is the correct one.
A major problem of the latter approach is the estimation of entropy.
Viola and Wells III [171] explain in detail the use of mutual information
for alignment. They also present a method for evaluating entropy and mutual
information called Empirical entropy manipulation and analysis (EMMA),
which optimizes entropy, based on a stochastic approximation. EMMA uses
the Parzen window method, which is presented in the next section.
108 4 Registration, Matching, and Recognition

Fig. 4.1. Alignment problem example: Top: images obtained from the same sensor,
from two slightly different positions. Center-left: a misalignment. Center-right: joint
histogram of the misalignment. Bottom-left: a correct alignment. Bottom-right: joint
histogram of the alignment.

4.2.2 Entropy Estimation with Parzen’s Windows

The Parzen’s windows approach [51, 122] is a nonparametric method for es-
timating probability distribution functions (pdfs) for a finite set of patterns.
The general form of these pdfs is
1 
P ∗ (Y, a) ≡ Kψ (y − ya ) (4.8)
Na y ∈a
a

where a is a sample of the variable Y , Na is the size of the sample, and K(.) is
a kernel of width ψ and centered in ya . This kernel has to be a differentiable
4.2 Image Alignment and Mutual Information 109

function, so that the entropy H can be derived for performing a gradient de-
scent over it. A Gaussian kernel is appropriate for this purpose. Also, let us as-
sume that the covariance matrix ψ is diagonal, that is, ψ = Diag(σ12 , ..., σN2
a
).
This matrix indicates the widths of the kernel, and by forcing ψ to be diag-
onal, we are assuming independence among the different dimensions of the
kernel, which simplifies the estimation of these widths. Then, the kernel is
expressed as the following product:
  2
1 d
1 y j − yaj
Kψ (y − ya ) = d exp − (4.9)
i=1 σi (2π)
d/2
j=1
2 σj

where y j represents the jth component of y, and yaj represents the jth com-
ponent of kernel ya . The kernel widths ψ are parameters that have to be
estimated. In [171], a method is proposed for adjusting ψ using maximum
likelihood.
The entropy of a random variable Y is the expectation of the negative
logarithm of the pdf:

H(Y ) ≡ −EY [log(p(Y ))]


1  1
≈− log(P (yb )) = − log((b)) (4.10)
N Nb
y∈Y

where b refers to the samples used for estimating entropy, and (b) is their
likelihood.
In the alignment problem, the random variable Y has to be expressed as
a function of a set of parameters T . The derivative of H with respect to T is
∂ ∗
H (Y (T ))
∂T
1   K (y − ya ) ∂
 ψ b (yb − ya ) ψ −1
T
≈ (yb − ya ) (4.11)
Nb y ∈a
yb ∈b ya ∈a Kψ (yb − ya )
a
∂T

The first factor of the double summation is a weighting factor that takes
a value close to one, if ya is much closer to yb than to the elements of a.
However, if there is an element in which a is similar to ya , the weighting factor
approaches zero. Let us denote the weighting factor as WY , which assumes a
diagonal matrix ψY of kernel widths for the random variable Y :

KψY (yb − ya )
WY (yb , ya ) ≡  (4.12)
ya ∈a KψY (yb − ya )

The weighting factor helps find a transformation T , which reduces the av-
erage squared distance between those elements that are close to each other,
forming clusters.
110 4 Registration, Matching, and Recognition

4.2.3 The EMMA Algorithm

The abbreviation EMMA means “Empirical entropy Manipulation and


Analysis.” It refers to an aligment method proposed by Viola and Wells [171],
and is based on mutual information maximization. In order to maximize the
mutual information (Eq. 4.6), an estimation to its derivative with respect to
T is needed.
∂ ∂ ∂
I(u(x), v(T (x))) = H(v(T (x))) − H(u(x), v(T (x))) (4.13)
∂T ∂T ∂T
1   2 3 ∂
≈ (vb − va )T Wv (vi , vj )ψv−1 − Wuv (wi , wj )ψvv
−1
(vb − va )
Nb x ∈a
∂T
xb ∈b a

(4.14)
T
Here, vi denotes v(T (xi )) and wi denotes [u(xi ), v(T (xi ))] for the joint den-
sity. Also, the widths of the kernels for the joint density are block diagonal:
−1 −1 −1
ψuv = Diag(ψuu , ψvv ). In the last factor, (v(T (xb )) − v(T (xa ))) has to be
derived with respect to T , so the expression of the derivative depends on the
kind of transformation involved.
Given the derivatives, a gradient descent can be performed. The EMMA
algorithm performs a stochastic analog of the gradient descent. The stochas-
tic approximation avoids falling into local minima. The following algorithm
proved to find successful alignments for sample sizes Na , Nb of about 50
samples.

A ← Na samples selected for Parzen


B ← Nb samples selected for estimating entropy
Repeat:
∂ ∗
T ← T + λ ∂T I
Estimate It for the current width ψ
Until: |It − It−1 | < μ

In the above algorithm, λ is the learning rate and μ is a threshold which


∂ ∗
indicates that convergence is achieved. ∂T I is the approximation of the
derivative of the mutual information (Eq. 4.14), and It and It−1 are the mutual
information estimations in the iteration t and in the previous one.
In Fig. 4.2, we represent two search surfaces generated by translating two
images vertically and horizontally. The first surface is the resulting Normal-
ized Cross Correlation of the alignments, while the second surface is the one
resulting from mutual information estimation. The advantage of mutual infor-
mation is that it produces smaller peaks. Therefore, the search algorithm will
more easily avoid local maxima. Viola and Wells present some [171] illustra-
tive alignment experiments. Apart from view-based recognition experiments
with 2D images, they also present the alignment of a 3D object model to a
4.2 Image Alignment and Mutual Information 111

MI
1 2.5
NCC

0.8 2
0.6 1.5
0.4 1
0.2 0.5
0 0
600 600
400 600 400 600
200 400 200 400
200 200
tx
0 tx 0
−200 0 0
−200 tz −200 −200 tz
−400 −400 −400 −400
−600 −600 −600 −600

Fig. 4.2. Normalized Cross Correlation (left) and Mutual Information (right) of
image alignments produced by translations along both axes. It can be observed that
Mutual Information has smaller local maxima, then the maximum (in the center of
the plot) is easier to find with a stochastic search. Figure by courtesy of J.M. Sáez.

2D image, as well as MRI (magnetic resonance imaging) alignment to data


from another sensor, which is useful for medical applications.

4.2.4 Solving the Histogram-Binning Problem

Mutual information is very useful in computer vision as a similarity mea-


sure, given that it is insensitive to illumination changes. Mutual information
(Eq. 4.15) can be calculated given a marginal entropy and the conditional
entropy, or given the marginal entropies (Eq. 4.16) and the joint entropy
(Eq. 4.17). The latter involves the estimation of joint probabilities.

I(X, Y ) = H(X) + H(Y ) − H(X, Y ) (4.15)


 
=− px (x) log px (x) − py (y) log py (y) (4.16)
x y

+ pxy (x, y) log pxy (x, y) (4.17)
x y

The classical way to estimate the joint probability distributions is to cal-


culate the joint histogram. A joint histogram of two images is a 2D matrix of
B × B size, where each dimension refers to one of the images and B is the
discretization used along the entire scale of intensities. For example, five bins
along the scale, which goes from 0 to 255, would produce the intervals from 0
to 51, from 51 to 102, etc. Each histogram cell cij counts the times at which
two pixels with the same coordinates on both images have intensity i on the
first image and intensity j on the second image, as illustrated in Fig. 4.3.
The joint histogram represents an estimation of a joint distribution. For
tuning the estimation, a crucial parameter is the number of bins B. Let us
see an example. In Fig. 4.4, we can see three different joint histograms of
112 4 Registration, Matching, and Recognition

X Y

0 50

0 50

200 200 200 150

0 2

50

x
100

150

200 1 1

0 50 100 150 200


y

Fig. 4.3. How a classical joint histogram is built. Each cell of the histogram counts
the times which some definite combination of intensities happens between two pixels
in the two images, only for those pixels which have the same coordinates.

Fig. 4.4. Joint histograms of two different photographs which show the same view
of a room. The numbers of bins of the histograms are 10 (left), 50 (center ), and 255
(right).
4.2 Image Alignment and Mutual Information 113

Fig. 4.5. The joint histograms, the photographs of Fig. 4.5, with the addition of
Gaussian noise to one of the images. The numbers of bins of the histograms are 10
(left), 50 (center ), and 255 (right).

two images, which show the same scene from a very similar point of view.
The first histogram has B = 10, the second one B = 50, and the third one
B = 255 bins. We can observe that 10 bins do not provide enough information
to the histogram. However, 255 bins produce a very sparse histogram (the
smaller the image is, the sparser the histogram would be). On the other hand,
a large number of bins make the histogram very vulnerable to noise in the
image. We can see the histograms of the same images with Gaussian noise in
Fig. 4.5.
To deal with the histogram binning problem, Rajwade et al. have proposed
a new method [133,134], which considers the images to be continuous surfaces.
In order to estimate the density of the distribution, a number of samples have
to be taken from the surfaces. The amount of samples determines the precision
of the approximation. For estimating the joint distribution of the two images,
each point of the image is formulated as the intersection of two-level curves,
one-level curve per image. In a continuous surface (see Fig. 4.6), there are
infinite level curves. In order to make the method computable in a finite time,
a number Q of intensity levels has to be chosen, for example, Q = 256. A lower
Q would proportionally decrease the precision of the estimation because the
level curves of the surface would have a larger separation, and therefore, some
small details of the image would not be taken into consideration.
In order to consider the image as a continuous surface, sub-pixel inter-
polation has to be performed. For the center of each pixel, four neighbor
points at a distance of half pixel are considered, as shown in Fig. 4.7a. Their
intensities can be calculated by vertical and horizontal interpolation. These
four points form a square that is divided in two triangles, both of which
have to be evaluated in the same way. For a definite image, for example, I1 ,
the triangle is formed by three points pi , i ∈ {1, 2, 3} with known positions
(xpi , ypi ) and known intensities zpi . Assuming that the intensities within a
triangle are represented as a planar patch, it would be given by the equation
114 4 Registration, Matching, and Recognition

220
200
180
160
140
120
100
80
60
40
20

50 100 150 200 250 300

Fig. 4.6. Top: the continuous surface formed by the intensity values of the labora-
tory picture. Bottom: some of the level curves of the surface.

z = A1 x + B1 y + C1 in I1 . The variables A1 , B1 , and C1 are calculated using


the system of equations given by the three points of the triangle:

zp1 = A1 xp1 + B1 yp1 + C1 ⎬
zp1 = A1 xp1 + B1 yp1 + C1 A1 =?, B1 =?, C1 =? (4.18)

zp1 = A1 xp1 + B1 yp1 + C1

For each triangle, once we have its values of A1 , B1 , and C1 for the image I1
and A2 , B2 , and C2 for the image I2 , we can decide whether to add a vote
for a pair of intensities (α1 , α2 ). Each one of them is represented as a straight
4.2 Image Alignment and Mutual Information 115

(a) (b) 1 (d) 1

1 3

2 4 2 4

2
(c) 1 (e) 1
4

2 4 2 4

Fig. 4.7. (a) Subpixel interpolation: the intensities of four points around a pixel are
calculated. The square formed is divided in two triangles. (b),(c) The iso-intensity
lines of I1 at α1 (dashed line) and of I2 at α2 (continuous line) for a single triangle.
In the first case (b), they intersect inside the triangle, so we vote for p(α1 , α2 ). In
the cases (c) and (d), there is no vote because the lines do not intersect in the
triangle; (d) is the case of parallel gradients. In (e), the iso-surfaces are represented,
and their intersection area in the triangle represents the amount of vote for the pair
of intensity ranges.

line in the triangle, as shown in Fig. 4.7b and c. The equations of these lines
are given by A,B, and C:

A1 x + B1 y + C1 = α1
x =?, y =? (4.19)
A2 x + B2 y + C2 = α2
The former equations form a system, which can be solved to obtain their
intersection point (x, y). If it results to be inside the area of the triangle
(p1 , p2 , p3 ), a vote is added for the pair of intensities (α1 , α2 ). All the triangles
in the image have to be processed in the same way, for each pair of intensities.
Some computational optimizations can be used in the implementation, in
order to avoid repeating some calculations.
The results of this first approach can be observed in Figs. 4.8 and 4.9
(third row). This method, however, has the effect of voting more to zones
with higher gradient. When the gradients are parallel, there is no vote, see
Fig. 4.7d. In [134], Rajwade et al. have presented an improved method that,
instead of counting intersections of iso-contours, sums intersection areas of iso-
surfaces. A similar approach to density estimation had been previously taken
by Kadir and Brady [90] in the field of image segmentation. An example of
intersecting iso-surfaces is shown in Fig. 4.7e. When the gradients are parallel,
their intersection with the triangle is still considered. In the case in which one
image has zero gradient, the intersection of the iso-surface area of the other
image is the considered area. This approach produces much more robust his-
tograms. Examples can be observed in Figs. 4.8 and 4.9 (fourth row). The
difference among the different methods is more emphasized in the images of
the laboratory because the images are poorly textured, but have many edges.
116 4 Registration, Matching, and Recognition

Fig. 4.8. Left: the disaligned case; right: the aligned case. First row: the alignment
of the images; second, third, and fourth row: classical, point-counting, and area-based
histograms, respectively.
4.2 Image Alignment and Mutual Information 117

Fig. 4.9. Left: the disaligned case; right: the aligned case. First row: the alignment
of the images; second, third, and fourth row: classical, point-counting, and area-based
histograms, respectively.
118 4 Registration, Matching, and Recognition

This causes the method based on iso-contours to yield a histogram, which


visibly differs from the classical histogram. The method based on iso-surfaces
generates much more robust histograms due to its independence of the gradi-
ent. See [134] for more detailed explanations, for a generalization to multiple
images, and for an extension to 3D images.
The joint histogram is necessary for not only calculating the joint entropy
H(X, Y ) (Eq. 4.17), but also the marginal entropies of each image H(X) and
H(Y ) (Eq. 4.16) also have to be estimated in order to obtain the mutual infor-
mation of two images. In the iso-contour method, the marginals are calculated
by the addition of the lengths of the iso-contours inside the triangles, which
is to say, by approximating the total length of each α-intensity level curve.
In the iso-surfaces method, a similar procedure is implemented by the addition
of iso-surface areas, instead of line lengths.
A comparison of both methods, applied to image alignment, can be seen
in Fig. 4.10. In this example, the search spaces are represented as surfaces

POINT COUNTING

0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1 50
0 40
0 30
10 20
20
30 10
40
50 0
AREA BETWEEN ISOCONTOURS

2.5

1.5

1
50
40
0.5 30
0
10 20
20
30 10
40 0
50

Fig. 4.10. Image alignment by means of Mutual Information – representation of the


search spaces generated by each one of the following measures: iso-contours (top),
iso-surfaces (bottom). Figure by courtesy of A. Rajwade and A. Rangarajan.
4.3 Alternative Metrics for Image Alignment 119

because only horizontal and vertical translations were used as parameters for
the alignment. It can be seen that the method based on iso-contours, or point-
counting yields a much more abrupt search surface than the iso-surfaces, or
area-based method.

4.3 Alternative Metrics for Image Alignment

Mutual information is a widely used measure in image alignment (also referred


to as image registration). However, it is not a metric as it does not satisfy the
triangle inequality. A metric d(x, y) should satisfy the following conditions:

1. Nonnegativity

d(x, y) ≥ 0
2. Identity of indiscernibles

d(x, y) = 0 ⇔ x = y
3. Symmetry

d(x, y) = d(y, x)
4. Triangle inequality

d(x, z) ≤ d(x, y) + d(y, z)

A pseudometric is a measure which relaxes the second axiom to

d(x, y) = 0 ⇐ x = y

This implies that two different points can have a zero distance.

4.3.1 Normalizing Mutual Information

Different mutual information normalizations are used in pattern recognition,


and not all of them are metrics. Uncertainty coefficients are used to scale
mutual information I(X, Y ), dividing it by the entropy of one of the terms,
X or Y .

I(X; Y ) I(X; Y )
or
H(Y ) H(X)
120 4 Registration, Matching, and Recognition

However, the entropies H(X) and H(Y ) are not necessarily the same.
A solution to this problem would be to normalize by the sum of both
entropies. The resulting measure is symmetric:
I(X; Y )
R(X, Y ) =
H(X) + H(Y )
H(X) + H(Y ) − H(X, Y )
=
H(X) + H(Y )
H(X, Y )
= 1−
H(X) + H(Y )
This measure actually measures redundancy and is zero when X and Y are
independent. It is proportional to the entropy correlation coefficient (ECC),
also known as symmetric uncertainty:
2
ECC(X, Y ) = 2 −
N I1 (X, Y )
2H(X, Y )
= 2−
H(X) + H(Y )
= 2R(X, Y )

In the previous equation, N I1 (X, Y ) refers to another variant of mutual


information normalization:
H(X) + H(Y )
N I1 (X, Y ) = (4.20)
H(X, Y )
Another widely used mutual information normalization is defined as
I(X; Y )
N I2 (X, Y ) = (4.21)
max(H(X), H(Y ))

It is not only symmetrical, but also satisfies positive definiteness (nonnegativ-


ity and identity of indiscernibles) as well as the triangle inequality, therefore
it is a metric.

4.3.2 Conditional Entropies

In [179], Zhang and Rangarajan have proposed the use of a new informa-
tion metric (actually a pseudometric) for image alignment, which is based on
conditional entropies:

ρ(X, Y ) = H(X|Y ) + H(Y |X) (4.22)

It satisfies nonnegativity, symmetry, and triangle inequality. However, ρ(X, Y )


can be zero not only when X = Y , but also when X is a function of Y , and
therefore, it is not a metric, but a pseudometric.
4.3 Alternative Metrics for Image Alignment 121

Given the definition H(X|Y ) = H(X, Y ) − H(Y ), the relation of ρ(X, Y )


with mutual information is

ρ(X, Y ) = H(X|Y ) + H(Y |X)


= H(X, Y ) − H(Y ) + H(X, Y ) − H(X) (4.23)
= H(X, Y ) − (H(Y ) − H(X, Y ) + H(X))
= H(X, Y ) − I(X; Y )

In the alignment problem, I(X, Y ) has to be maximized. Contrarily, the pseu-


dometric ρ(X, Y ) has to be minimized.
A useful normalization of ρ(X, Y ), which is also a pseudometric, is given
by dividing it by the joint entropy of the variables:

ρ(X, Y )
τ (X, Y ) =
H(X, Y )
H(X|Y ) + H(Y |X)
=
H(X, Y )

Its relation with mutual information, according to Eq. 4.24, is

H(X, Y ) − I(X; Y )
τ (X, Y ) =
H(X, Y )
I(X; Y )
= 1−
H(X, Y )

4.3.3 Extension to the Multimodal Case

Another important difference between the metric ρ(X, Y ) and the mutual
information measure is that the new information metric is easily extensible
to the multimodal case (registration of more than two images). The mutual
information (MI) of three variables is be defined as

I(X; Y, Z) = H(X) + H(Y ) + H(Z)


− H(X, Y ) − H(X, Z) − H(Y, Z) (4.24)
+ H(X, Y, Z)

There is another definition that uses the same notation, I(X; Y ; Z), called
interaction information, also known as co-information [16]. It is commonly
defined in terms of conditional mutual information, I(X; Y |Z) − I(X; Y ),
which is equivalent to the negative MI as defined in Eq. 4.24. The conditional
MI is defined as

I(X; Y |Z) = H(X|Z) − H(X, Y, Z)


= H(X, Z) + H(Y, Z) − H(Z) − H(X, Y, Z)
122 4 Registration, Matching, and Recognition

and it is never negative, I(X; Y |Z) ≥ 0. Provided that the MI between


X and Y is
I(X; Y ) = H(X) + H(Y ) − H(X, Y )
then the co-information of the three variables X, Y , and Z is

I(X; Y |Z) − I(X; Y ) = H(X, Z) + H(Y, Z) − H(Z) − H(X, Y, Z)


−(H(X) + H(Y ) − H(X, Y ))
= −H(X) − H(Y ) − H(Z)
+H(X, Z) + H(X, Y ) + H(Y, Z) − H(X, Y, Z)

which is equal to the negative MI of the three variables. The co-information


and the MI between three variables can be positive, zero, or negative. The
negativity of MI could be undesirable in some alignment approaches. To
overcome this problem, some alternative MI definitions have been proposed
in the literature. For example, even though it is not a natural extension of
MI, the definition

I(X; Y ; Z) = H(X) + H(Y ) + H(Z) − H(X, Y, Z)

avoids the negativity problem. In information theory, it is known as total


correlation.
The extension of ρ(X, Y ) to three or more random variables is straightfor-
ward:
ρ(X, Y, Z) = H(X|Y, Z) + H(Y |X, Z) + H(Z|X, Y )

n
ρ(X1 , X2 , . . . , Xn ) = H(Xi |X1 , . . . , Xi−1 , Xi+1 , . . . , Xn )
i=1

Similarly, the extension of τ (X, Y ) to three or more random variables is

ρ(X, Y, Z)
τ (X, Y, Z) =
H(X, Y, Z)

ρ(X1 , X2 , . . . , Xn )
τ (X1 , X2 , . . . , Xn ) =
H(X1 , X2 , . . . , Xn )
For a better understanding of ρ(X, Y, Z), τ (X, Y, Z), and I(X; Y ; Z), see
Fig. 4.11, where information is represented as areas. In [179], the compu-
tational complexity of estimating the joint probability of three variables is
taken into consideration. The authors propose the use of an upper bound of
the metric.

4.3.4 Affine Alignment of Multiple Images

In the case of alignment of two images, I1 and I2 , a proper transformation


T ∗ (I2 ) that aligns I2 to I1 has to be found. This proper transformation would
4.3 Alternative Metrics for Image Alignment 123

H(X) H(Y)

H(Y|X,Z)
H(X|Y,Z)
I(X,Y,Z)

H(Z|X,Y)

H(Z) H(X,Y,Z)

Fig. 4.11. Mutual Information I(X, Y, Z), joint entropy H(X, Y, Z), conditional
entropies H(· |· , · ), and marginal entropies H(· ) for three random variables X, Y ,
and Z, represented in a Venn diagram.

be the one that maximizes or minimizes some criterion. In the case of the ρ
metric, we search for a T ∗ that minimizes it:

T ∗ = argminρ(I1 , T (I2 ))
T

In the case of alignment of more than two or n images, n−1 transformations


have to be found:

{T1∗ , . . . , Tn−1

}= argmin ρ(I1 , T1 (I2 ), . . . , Tn−1 (In ))
{T1 ,...,Tn−1 }

In the case of affine alignment, the transformations are limited to vertical


and horizontal translations, rotations, scaling and shearing. Each transfor-
mation function T establishes a correspondence between the coordinates of
an image I and the transformed image I  = T (I). In the affine case, the
transformation function can be expressed as a product of matrices between
a transformation matrix M and the homogeneous coordinates in the original
image, (xI , yI , 1). The resulting homogeneous coordinates, (xI  , yI  , 1), are the
coordinates of the point in the transformed image I  :
⎛ ⎞ ⎛ ⎞⎛ ⎞
xI  ab e xI
⎝ yI  ⎠ = ⎝ c d f ⎠ ⎝ yI ⎠
1 001 1

The transformation matrix M of size 3 × 3 defines the affine transformation.


Concretely, e and f contain the translation:
⎛ ⎞ ⎛ ⎞⎛ ⎞ ⎛ ⎞
xI  10e xI xI + e
⎝ yI  ⎠ = ⎝ 0 1 f ⎠ ⎝ yI ⎠ = ⎝ yI + f ⎠
1 001 1 1
124 4 Registration, Matching, and Recognition

The scale is given by a and c:


⎛ ⎞ ⎛ ⎞⎛ ⎞ ⎛ ⎞
xI  a0 0 xI axI
⎝ yI  ⎠ = ⎝ 0 c 0 ⎠ ⎝ yI ⎠ = ⎝ cyI ⎠
1 00 1 1 1

Shearing is given by b:
⎛ ⎞ ⎛ ⎞⎛ ⎞ ⎛ ⎞
xI  1b0 xI xI + byI
⎝ yI  ⎠ = ⎝ 0 1 0 ⎠ ⎝ yI ⎠ = ⎝ yI ⎠
1 001 1 1

Finally, a rotation along the z axis of θ radians from the origin is given by
a, b, c and d all together:
⎛ ⎞ ⎛ ⎞⎛ ⎞ ⎛ ⎞
xI  cos θ − sin θ 0 xI xI cos θ − YI sin θ
⎝ yI  ⎠ = ⎝ sin θ cos θ 0 ⎠ ⎝ yI ⎠ = ⎝ xI sin θ + YI cos θ ⎠
1 0 0 1 1 1

When simultaneously aligning n images, n−1 different matrices M1 , . . . , Mn−1


have to be estimated.

4.3.5 The Rényi Entropy

Alfréd Rényi (20 March 1921–1 February 1970) was a Hungarian mathemati-
cian who mainly contributed in probability theory [137], combinatorics and
graph theory. One of his contributions in probability theory is a generaliza-
tion of the Shannon entropy, known as alpha-entropy (or α-entropy), as well
as Rényi entropy.
A random variable X with possible n values x1 , . . . , xm , has some
distribution P = (p1 , p2 , . . . , pn ) with k=1 pk = 1. The probability of
X to be xi is usually denoted as P (X = xi ) and sometimes it is abbreviatedly
denoted as P (xi ). The Shannon entropy definition can be formulated for the
random variable, or for its probability distribution. In the following definition,
E denotes the expected value function:

H(X) = −Ex (logb P (X)) (4.25)


m
=− P (X = xi ) logb P (X = xi ) (4.26)
i=1
m
=− P (xi ) logb P (xi ) (4.27)
i=1
n
=− pk logb pk (4.28)
k=1
4.3 Alternative Metrics for Image Alignment 125

The base b of the logarithm logb will be omitted in many definitions in this
book because it is not significant for most pattern recognition problems.
Actually, the base of the logarithm logb determines the units of the entropy
measure:
Logarithm Units of H
log2 bit
loge nat
log10 dit/digit
Some pattern recognition approaches model the probability distributions
as a function, called probability density function (pdf). In that case, the
summation of elements of P becomes an integral. The definition of Shannon
entropy for a pdf f (z) is

H(f ) = − f (z) log f (z)dz (4.29)
z

The Rényi entropy is a generalization of Shannon entropy. Its definition


for a distribution P is

1 n
Hα (P) = log pα (4.30)
1−α i=1
j

and its definition for a pdf f (z) is



1
Hα (f ) = log f α (z)dz (4.31)
1−α z

The order α of the Rényi entropy is positive and it cannot be equal to 1, as


1
there would be a division by zero in 1−α .

α ∈ (0, 1) ∪ (1, ∞)

In Fig. 4.12, it can be seen that for the interval α ∈ (0, 1), the entropy function
Hα is concave, while for the interval α ∈ (1, ∞), it is neither concave, nor con-
vex. Also, for the first interval, it is smaller or equal to the Shannon entropy,
Hα (P) ≥ H(P), ∀α ∈ (0, 1), given that Hα is a nonincreasing function of α.
The discontinuity at α = 1 is very significant. It can be shown, analytically
and experimentally, that as α approaches 1, Hα tends to the value of the
Shannon entropy (see the proof in Chapter 5). This relation between both
entropies has been exploited in some pattern recognition problems. As shown
in the following subsection, some entropy estimators provide an approximation
to the value of Rényi entropy. However, Shannon entropy could be necessary
in some problems. In Chapter 5, we present an efficient method to approximate
Shannon entropy from Rényi entropy by finding a suitable α value.
126 4 Registration, Matching, and Recognition

0.8

0.6
α=0
Hα(p)

α=0.2
0.4 α=0.5
Shannon
α=2
0.2
α=10
α→∞

0
0 0.2 0.4 0.6 0.8 1
p

Fig. 4.12. Rényi and Shannon entropies of a Bernoulli distribution P = (p, 1 − p).

4.3.6 Rényi’s Entropy and Entropic Spanning Graphs

Entropy estimation is critical in IT-based pattern recognition problems. En-


tropy estimators can be divided into two categories: “plug-in” and “nonplug-
in.” Plug-in methods first estimate the density function, for example, the
construction of a histogram, or the Parzen’s windows method. The nonplug-
in methods, on the contrary, estimate entropy directly from a set of samples.
The Rényi entropy of a set of samples can be estimated from the length
of their Minimal Spanning Tree (MST) in a quite straightforward way. This
method, based on Entropic Spanning Graphs [74], belongs to the “nonplug-in”
methods of entropy estimation.
The MST graphs have been used for testing the randomness of a set of
points (see Fig. 4.13). In [73], it was shown that in a d-dimensional feature
space, with d ≥ 2, the α-entropy estimator
# $
d Lγ (Xn )
Hα (Xn ) = ln − ln βLγ ,d (4.32)
γ nα

is asymptotically unbiased and consistent with the PDF of the samples. Here,
the function Lγ (Xn ) is the length of the MST, and γ depends on the order α
and on the dimensionality: α = (d − γ)/d. The bias correction βLγ ,d depends
on the graph minimization criterion, but it is independent of the PDF. There
are some approximations which bound the bias by (i) Monte Carlo simulation
4.3 Alternative Metrics for Image Alignment 127

Fig. 4.13. Minimal spanning trees of samples with a Gaussian distribution (top)
and samples with a random distribution (bottom). The length Lγ of the first MST
is visibly shorter than the length of the second MST.

of uniform random samples on unit cube [0, 1]d and (ii) approximation for
large d: (γ/2) ln(d/(2πe)) in [21].
The length Lγ (Xn ) of the MST is defined as the length of the acyclic
spanning graph with minimal length of the sum of the weights (in this case
the weights are defined as | e |γ ) of its edges {e}:

LM
γ
ST
(Xn ) = min | e |γ (4.33)
M (Xn )
e∈M (Xn )

where γ∈ (0, d). Here, M (Xn ) denotes the possible sets of edges of a spanning
tree graph, where Xn = {x1 , ..., xn } is the set of vertices which are connected
by the edges {e}. The weight of each edge {e} is the distance between its
vertices, powered the γ parameter: | e |γ . There are several algorithms for
building a MST, for example, the Prim’s MST algorithm has a straightforward
implementation. There also are estimators that use other kinds of entropic
graphs, for example, the k-nearest neighbor graph [74].
Entropic spanning graphs are suitable for estimating α-entropy for 0 ≤
α < 1, so Shannon entropy cannot be directly estimated with this method.
In [185], relations between Shannon and Rényi entropies of integer order are
128 4 Registration, Matching, and Recognition

discussed. In [115], Mokkadem constructed a nonparametric estimate of the


Shannon entropy from a convergent sequence of α-entropy estimates (see
Chapter 5).

4.3.7 The Jensen–Rényi Divergence and Its Applications

The Jensen–Rényi divergence is an information-theoretic divergence mea-


sure, which uses the generalized Rényi entropy for measuring the statistical
dependence between an arbitrary number of probability distributions. Each
distribution has a weight, and also the α order of the entropy can be ad-
justed for varying the measurement sensitivity of the joint histogram. The
Jensen–Rényi divergence is symmetrical and convex. For n probability distri-
butions p1 , p2 , . . . , pn , it is defined as
 n 
 
n
ω
JRα (p1 , . . . , pn ) = Rα ωi pi − ωi Rα (pi ) (4.34)
i=1 i=1

where Rα is the Rényi entropy and ω = (ω1 , ω2 , . . . , ωn ) is the weight vector,


and the sum of the weights (all of them positive) must be 1:


n
ωi = 1, ωi ≥ 0
i=1

In [70], Hamza and Krim define the divergence and study some interesting
properties. They also present its application to image registration and segmen-
tation. In the case of image registration with the presence of noise, the use of
the Jensen–Rényi divergence outperforms the use of mutual information, as
shown in Fig. 4.14.
For segmentation, one approach is to use a sliding window. The window is
divided in two parts W1 and W2, each one corresponding to a portion of data
(pixels, in the case of an image). The divergence between the distributions
of both subwindows is calculated. In the presence of an edge or two different
regions A and B, the divergence would be maximum when the window is
centered at the edge, as illustrated in Fig. 4.15. In the case of two distributions,
the divergence can be expressed as a function of the fraction λ of subwindow
W2, which is included in the region B.

Rα ((1 − λ)pa + λpb ) + Rα (pa )


JRα (λ) = Rα (p) −
2
where α ∈ (0, 1) and  
λ λ
p= 1− pa + pb
2 2
Results of segmentation are shown in Fig. 4.16. Different values of the α
parameter have similar results in this case.
4.3 Alternative Metrics for Image Alignment 129

2
Mutual Information
JR Divergence
1.8

1.6

1.4

1.2

0.8
0 2 4 6 8 10 12 14 16 18 20
θ

Fig. 4.14. Registration in the presence on noise. For the Jensen–Rényi divergence,
α = 1 and the weights are ωi = 1/n. Figure by A. Ben Hamza and H. Krim (2003c
Springer).

4.3.8 Other Measures Related to Rényi Entropy

In [137], Rényi introduces a measure of dissimilarity between densities, called


Rényi α-divergence, or Rényi α-relative entropy. This divergence and some
particular cases of it have been widely used for image registration (alignment),
as well as in other pattern recognition problems. Given two densities f (z) and
130 4 Registration, Matching, and Recognition

Region A Region B
Data

W1 W2 JR = 0.4

W1 W2 JR = 0.5

W1 W2 JR = 0.6

W1 W2 JR = 0.7

W1 W2 JR = 0.6

W1 W2 JR = 0.5

W1 W2 JR = 0.4

J-R divergence

Fig. 4.15. Sliding window for edge detection. The divergence between the distribu-
tions of the subwindows W1 and W2 is higher when they correspond to two different
regions of the image.

Fig. 4.16. Edge detection results using Jensen–Rényi divergence for various values
of α. Figure by A. Ben Hamza and H. Krim (2003
c Springer).

g(z), for a d-dimensional random variable z, the Rényi α-divergence of the


density g from the density f is defined as
  α
1 f (z)
Dα (f ||g) = log g(z) dz (4.35)
α−1 g(z)
4.3 Alternative Metrics for Image Alignment 131

It is not symmetric, so it is not a distance, just a divergence. Depending on


the order α, there are some special cases.
The most notable case is α = 1, because there is a division by zero in
1/(α − 1). However, in the limit of α → 1, the Kullback–Leibler (KL) diver-
gence is obtained:

f (z)
lim Dα (f ||g) = f (z) log dz = KL(f ||g)
α→1 g(z)
The KL-divergence is widely used in pattern recognition. In some problems it
is referred to as “information gain,” for example, in decision trees.
Another case is α = 12
 
D 12 (f ||g) = −2 log f (z)g(z)dz

which is related to the Bhattacharyya coefficient, defined as


 
BC(f, g) = f (z)g(z)dz

and
DB (f, g) = − log BC(f, g)
is the Bhattacharyya distance, which is symmetric. The Hellinger dissimilarity
is also related to the Bhattacharyya coefficient:
1
H(f, g) = 2 − 2BC(f, g)
2
Then, its relation to the α-relative divergence of order α = 12 is
7  
1 1
H(f, g) = 2 − 2 exp − D 12 (f ||g)
2 2
The divergences with α = −1 and α = 2 are also worth mentioning:
 2
1 (f (z) − g(z))
D−1 (f ||g) = dz
2 f (z)
and  2
1 (f (z) − g(z))
D2 (f ||g) = dz
2 g(z)
From the definition of the α-divergence (Eq. 4.35), it is easy to define a
mutual information that depends on α. The mutual information is a measure
that depends on the marginals and the joint density f (x, y) of the variables
x and y. Defining g as the product of the marginals, g(x, y) = f (x)f (y), the
expression of the α-mutual information is
  α
1 f (x, y)
Iα = Dα (f ||g) = log f (x)f (y) dxdy
α−1 f (x)f (y)
which, in the limit α → 1, converges to the Shannon mutual information.
132 4 Registration, Matching, and Recognition

Fig. 4.17. Ultrasound images of breast tumor, separated and rotated. Figure by
H. Neemuchwala, A. Hero, S. Zabuawala and P. Carson (2007
c Wiley).

4.3.9 Experimental Results

In [117], Neemuchwala et al. study the use of different measures for image
registration. They also present experimental results, among which the one
shown in Fig. 4.17 represented two images of a database of breast tumor
ultrasound images. Two different images are shown, one of them is rotated.
In Fig. 4.18, several alignment criteria are compared. For each one of them,
different orders α are tested and represented.

4.4 Deformable Matching with Jensen Divergence and


Fisher Information
4.4.1 The Distributional Shape Model

Let us focus on shape as a set (list) of points (like the rough discretization of
a contour, or even a surface) (Fig. 4.19). We have a collection S of N shapes
(data points sets): C = {Xc : c = 1, . . . , N } being Xc = {xci ∈ Rd : i =
1, . . . , nc } a given set of nc points in Rd , with d = 2 for the case of 2D shapes.
A distributional model [9, 174] of a given shape can be obtained by modelling
each point set as a Gaussian mixture. As we will see in more detail in the fol-
lowing chapter (clustering), Gaussian mixtures are fine models for encoding a
multimodal distribution. Each mode (Gaussian cluster) may be represented
as a multidimensional Gaussian, with a given average and covariance matrix,
and the convex combination (coefficients add to one) of all the Gaussians.
Gaussian centers may be placed at each point if the number of points is not
too high. The problem of finding the optimal number of Gaussians and their
placement and covariances will be covered in the first part of the next chap-
ter. For the moment, let us assume that we have a collection of cluster-center
point sets V = {Vc : c = 1, . . . , N }, where each center point sets consists of
4.4 Deformable Matching with Jensen Divergence and Fisher Information 133

Fig. 4.18. Normalized average profiles of image matching criteria for registration
of ultrasound breast tumor images. Plots are normalized with respect to the maxi-
mum variance in the sampled observations. Each column corresponds to a different
α order, and each row to a different matching criterion. (Row 1) k-NN graph-based
estimation of α-Jensen difference divergence, (row 2) MST graph-based estimation
of α-Jensen difference divergence, (row 3) Shannon Mutual Information, and (row 4)
α Mutual Information estimated using NN graphs. The features used in the exper-
iments are ICA features, except for the row 3, where the histograms were directly
calculated from the pixels. Figure by H. Neemuchwala, A. Hero, S. Zabuawala and
P. Carson (2007
c Wiley).

Fig. 4.19. Probabilistic description of shapes. Left: samples (denoted by numbers


1–3 (first row), 4–6 (second row) and 7–9 (third row). Right: alignment with the
variance represented as a circle centered at each shape point. Figure by A. Peter
and A. Rangarajan (2006
c IEEE).

Vc = {vca ∈ Rd : a = 1, . . . , K c }, K c being the number of clusters (Gaus-


sians) for the cth point set. For simplicity, let us assume that such numbers
are proportional to the sizes of their corresponding point sets. Then, the pdf
characterizing a given point set is
c c

K 
K
pc (x) = αac p(x|vca ) with αac = 1, αac ≥ 0 ∀αac (4.36)
a=1 a=1
134 4 Registration, Matching, and Recognition

Under the Gaussian assumption of the mixtures, we have


 
1 1 c T −1
p(x|va ) = G(x − va , Σa ) =
c c
1/2
exp − (x − va ) Σa (x − va ) c
(2π)d/2 Σa 2
(4.37)
denoting Σa the ath d × d covariance matrix as usual. Such matrix is assumed
to be diagonal for simplicity. Then, assuming independence between shape
points inside a given point set, we obtain the pdf for a given point set that is
of the form
c

nc 
nc 
K
p(X |V , α ) =
c c c
pc (xi ) = αac p(x|vca ) (4.38)
i=1 i=1 a=1

being αc = (α1c , . . . , αK
c
c
)T .
Let then Z denote the so called average atlas point set, that is, the point
set derived from registrating all the input point sets with respect to a com-
mon reference system. Registration means, in this context, the alignment of
all point sets Xc through general (deformable) transformations f c , each one
parameterized by μc . The parameters of all transformations must be recov-
ered in order to compute Z. To that end, the pdfs of each of the deformed
shapes may be expressed in the following terms:
c

K
c c
pc = pc (x|V , μ ) = αac p(x|f c (vca )) (4.39)
a=1

In the most general (deformable) case, functions f c are usually defined as


thin-plate splines (TPS) (see Bookstein classic papers [26,40]). The underlying
concept of TPS is that a deformable transformation can be decomposed into a
linear part and a nonlinear one. All we need is to have a set of n control points
x1 , . . . , xn ∈ Rd (remind the B-splines definition of contours in the previous
chapter). Then, from the latter points, we may extract a nonrigid function
as a mapping f : Rd → Rd :

f (x) = (Ax + t) + WU(x) (4.40)

where Ad×d and td×1 denote the linear part; Wd×n encodes the nonrigid part;
and U(x)n×1 encodes n basis functions (as many as control points) Ui (x) =
U (||x − xi ||), each one centered at each xi , being U (.) a kernel function (for
instance U (r) = r2 log(r2 ) when d = 2). Thus, what is actually encoding f
is an estimation (interpolation) of the complete deformation field from the
input points. The estimation process requires the proposed correspondences
for each control point: x1 , . . . , xn . For instance, in d = 2 (2D shapes), we
have for each xi = (xi , yi )T its corresponding point xi = (xi , yi )T (known
4.4 Deformable Matching with Jensen Divergence and Fisher Information 135

beforehand or proposed by the matching algorithm). Then, let Kn×n be a


symmetric matrix where the entries are U (rij ) = U (||xi − xj ||) with i, j =
1, . . . , n, and Pn×3 a matrix where the ith row has the form (1 xi yi ), that is
⎛ ⎞ ⎛ ⎞
0 U (r12 ) . . . U (r1n ) 1 x1 y1
⎜ U (r21 ) 0 . . . U (r2n ) ⎟ ⎜ 1 x2 y2 ⎟
⎜ ⎟ ⎜ ⎟
K=⎜ . . . . ⎟ P=⎜. . . ⎟ (4.41)
⎝ .. .. .. .. ⎠ ⎝ .. .. .. ⎠
U (rn1 ) U (rn2 ) . . . 0 1 xn yn

The latter matrix is the building block of matrix L(n+3)×(n+3) , and the coor-
dinates of points xi are placed in the columns of matrix V2×n :
   
K P x1 x2 . . . xn
L= V= (4.42)
PT 03×3 y1 y2 . . . yn

Then, the least squares estimation of A1 = (ax1 , ay1 ), and tx , and Wx =


(w1x , . . . , wnx ), which are the transformation coefficients for the X dimension,
is given by solving

L−1 (Vx |0, 0, 0)T = (Wx |tx , ax1 , ay1 )T (4.43)

being Vx = the first row of V. A similar rationale is applied for obtaining the
coefficients for the Y dimension, and then we obtain the interpolated position
for any vector x = (x, y)T :
   x y     x   n 
i=1 wi U (||x − xi ||)
x
f x (x) a1 a1 x t 
f (x) = = + + n
f y (x) ax2 ay2 y ty y
i=1 wi U (||x − xi ||)

(4.44)

yielding Eq. 4.40. The computation of this interpolation may be speeded up


by approximation methods like the Nyström one (see [50]). An important
property of f (x) obtained as described above is that it yields the smoothest
interpolation among this type of functions. More precisely, f minimizes
   2 2  2 2  2 2 
∂ f ∂ f ∂ f
I(f ) = 2
+2 + 2
dxdy (4.45)
R 2 ∂x ∂x∂y ∂y

which is proportional to E(f ) = trace(WKWT ), a quantify known as bend-


ing energy because it quantifies the amount of deformation induced by the
obtained mapping. The bending energy is only zero when the nonrigid coef-
ficients are zero. In this latter case, only the rigid (linear) part remains. Fur-
thermore, as a high value of E(f ) reflects the existence of outlying matchings
when a given correspondence field is recovered, this energy has been recently
used for comparing the performance of different matching algorithms [1].
Actually, E(f ) is one of the driving forces of the matching algorithm for
136 4 Registration, Matching, and Recognition

the distributional shape model because the registration of multiple shapes


(probabilistic atlas construction problem) may be formulated as finding
N 
 
N 
N

Z = arg min H πc p c − πc H(pc ) + E(f c ) (4.46)
μ1 ···μN
c=1 c=1 c=1
  
JSπ (p1 ,...,pN )

where JSπ (p1 , . . . , pN ) is the Jensen–Shannon divergence nwith respect to pdfs


p1 , . . . , pN and weight vector π = (π1 , . . . , pN ), where c=1 πc = 1 and πc ≥
0 ∀πc . Thus, a good alignment must satisfy that the aligned shapes have
similar pdfs, that is, similar mixtures. A first way of measuring (quantifying)
the similarities between pdfs is to use Jensen–Shannon (JS) divergence, as
we did in the previous chapter to drive Active Polygons. The first interesting
property of the JS measure in this context is that the pdfs can be weighted,
which takes into account the different number of points encoding each shape.
The second one is that JS allows the use of different number of cluster centers
in each point set.

4.4.2 Multiple Registration and Jensen–Shannon Divergence

Beyond the intuition of the use of JS divergence as a measure of compatibility


between multiple pdfs, it is interesting to explore, more formally, the connec-
tion between the maximization of the log-likelihood ratio and the minimization
of JS divergence. This is not surprising if one considers that the JS can be
defined as the Kullback–Leibler divergence between the convex combination
of pdfs and the pdfs themselves. Here, we start N by taking n1 i.i.d. samples
from p1 , n2 from p2 , and so on, so that M = c=1 nc . Then, if the weight of
each pdf is defined as πc = ncnb (normalizing the number of samples), the
b
likelihood ratio between the pooled distribution (the pdf resulting from the
convex combination using the latter weights) and the product of all the pdfs
is defined as
M &N '
k=1 c=1 πc pc (Xk )
Λ = N +nc , (4.47)
c
c=1 kc =1 pc (Xkc )

where Xk is one of the i.d.d. samples from the pooled distribution which is
mapped to the set of pooled samples consisting in {Xckc : kc = 1, . . . , nc ,
c = 1, . . . , N }. Then, taking the logarithm of the likelihood ratio, we have
N 

M  
N 
nc
log Λ = log πc pc (Xk ) − log pc (Xckc ) (4.48)
k=1 c=1 c=1 kc =1

In information theory, the version of the weak law of large numbers (conver-
N
gence in probability of n1 i=1 Xi toward the average E(X) when the number
4.4 Deformable Matching with Jensen Divergence and Fisher Information 137

of i.i.d. samples n is large enough) is called the Asymtotic Equipartition


Property (AEP):

1
n
1
− p(X1 , . . . , Xn ) = − p(Xi ) → −E(log p(x)) = H(X) (4.49)
n n i=1

where the arrow (→) indicates convergence in probability:


#8 8 $
8 1 8
lim P r 8− p(X1 , . . . , Xn ) − H(X)88 >  = 0, ∀ > 0
8 (4.50)
n→∞ n

Therefore, if each nc is large enough, we have


N 
 
N
log Λ = −M H πc p c + nc H(pc )
c=1 c=1
! N  "
 
N
nc
= −M H πc p c + H(pc )
c=1 c=1
N

= −JSπ (p1 , . . . , pN ) (4.51)

Consequently, maximizing the log-likelihood is equivalent to minimizing the


JS divergence. In order to proceed to this minimization, we must express each
pc conveniently, adapting Eq. 4.39 we have
c c

K
1 1 
K
c
pc = pc (x|V , μ ) = c c c
p(x|f (va )) = c G(x − f c (vca ), σ 2 I) (4.52)
a=1
Kc K a=1

that is, we are assuming that αac = K1c , being K c the number of clusters of
the cth distribution (uniform weighting) and, also for simplicity, a diagonal
and identical covariance matrix Σa = σ 2 I. In addition, the deformed cluster
centers are denoted by uca = f c (vca ).
Then, given again one of the i.d.d. samples from the pooled distribution Xk
mapped to one element of the pooled set {Xckc : kc = 1, . . . , nc , c = 1, . . . , N }
where {Xckc } is the set of nc samples for the cth shape, the entropies of both
individual distributions and the pooled one may be estimated by applying
again the AEP. For the individual distributions, we have
! Kc
"
1  1  1 
nc nc
H(pc ) = − log pc (Xkc ) = −
c
log G(Xkc − ua , σ I))
c c 2
nc nc K c a=1
kc =1 kc =1

(4.53)
138 4 Registration, Matching, and Recognition
c 
Regarding the convex combination and choosing πc = K M , being K = c Kc
the total number of cluster centers in the N point sets, we have
 Kc


N 
N
1  1 
K
πc p c = πc G(x − uc
a , σ 2
I)) = G(Xk − uj , σ 2 I)
c=1 c=1
K c a=1 K j=1
(4.54)
being Xk the pooled samples and uj , j = 1, . . . , M the pooled centers. Con-
sequently, the convex combination can be seen as a Gaussian mixture where
its Gaussians are centered on the deformed cluster centers. This simplifies the
estimation of the Shannon entropy, from AEP, for the convex combination:

  ⎛ ⎞

N
1 K
H πc p c =H⎝ G(Xk − uj , σ 2 I)⎠
c=1
K j=1
! "
1  1 
M K
=− log G(Xk − ua , σ I)
2
(4.55)
M j=1 K a=1

being M = c nc the sum of all samples. Therefore, the JS divergence is
estimated by
N 
 
N
JSπ (p1 , . . . , pN ) = H πc p c − πc H(pc )
c=1 c=1
! "
1  1 
M K
≈− log G(Xj − ua , σ 2 I)
M j=1 K a=1
! Kc
"
N
Kc 
nc
1 
+ log G(Xkc − ua , σ I))
c c 2
n K
c=1 c
K c a=1
kc =1
. (4.56)

Then, after the latter approximation, we may compute the partial derivatives
needed to perform a gradient descent of JS divergence with respect to the
deformation parameters:
# $T
∂JS ∂JS ∂JS
∇JS = 1
, 2
,..., (4.57)
∂μ ∂μ ∂μN
T T
and then, for the state vector (current parameters) Θ = (μ1 , . . . , μN ) and
T T
the weighting vector Γ = (γ 1 , . . . , γ N ), we have, for instance

Θt+1 = Θt − Γn ⊗ ∇JS(Θt ) (4.58)


4.4 Deformable Matching with Jensen Divergence and Fisher Information 139

where ⊗ is the Kronecker product. For the sake of modularity, a key element
is to compute the derivative of a Gaussian with respect to a μc :
∂G(Xckc − uca , σ 2 I) 1 ∂uc
c
= 2 G(Xckc − uca , σ 2 I)(Xckc − uca ) ac
∂μ σ ∂μ
c
1 ∂f (vca , μc )
= 2 G(Xckc − uca , σ 2 I)(Xckc − uca )
σ ∂μc
(4.59)
Therefore
1  
M K
∂JS 1 ∂G(Xj − ua , σ 2 I)
= − K
∂μc M j=1 a=1 G(Xj − ua , σ 2 I) a=1 ∂μc
c
Kc  
nc K
1 ∂G(Xckc − uca , σ 2 I)
+ K c
nc K a=1 G(Xkc − ua , σ I) a=1
c c 2 ∂μc
kc =1
c
1  
M K
1 ∂G(Xj − uca , σ 2 I)
=− K
M j=1 a=1 G(Xj − ua , σ 2 I) a=1 ∂μc
c
Kc  
nc K
1 ∂G(Xckc − uca , σ 2 I)
+ K c
nc K a=1 G(Xkc − ua , σ I) a=1
c c 2 ∂μc
kc =1
(4.60)
Starting from estimating the cluster centers Vc for each shape Xc using a
clustering algorithm (see next chapter), next step consists of perform gradi-
ent descent (or its conjugate version) setting Θ0 with zeros. Either numeri-
cal or analytical gradient descent can be used, though the analytical version
performs better with large deformations. Recall that the expressions for the
gradient do not include the regularization term, which only depends on the
nonrigid parameters. Anyway, a re-sampling must be done every couple of iter-
ations. After convergence, the deformed point-sets are close to each other, and
then it is possible to recover Z (mean shape) from the optimal deformation
parameters. For the sake of computational efficiency, two typical approaches
may be taken (even together). The first one is working with optimization
epochs, that is, find the affine parameters before finding the nonrigid ones
together with the re-estimation of the affine parameters. The second one is
hierarchical optimization, especially proper for large sets of shapes, that is,
large N . In this latter case, M is divided into m subsets. The algorithm is
applied to each subset separately and then the global atlas is found. It can be
proved (see Prob. 4.9) that minimizing the JS divergence with respect to all
M is equivalent to minimizing it with respect to the atlases obtained for each
of the m subset. The mathematical meaning is that JS divergence is unbiased
and, thus, the hierarchical approach is unbiased too. In general terms, the
algorithm works pretty well for atlas construction (see Fig. 4.20), the regular-
ization parameter λ being relatively stable in the range [0.0001, 0.0005]. With
140 4 Registration, Matching, and Recognition

Fig. 4.20. Registering deformations with respect to the atlas and the atlas itself.
Point Set1 to Point Set7 show the point-sets in their original state (‘o’) and after
registration with the atlas (‘+’). Final registration is showed at the bottom right.
Figure by F. Wang, B.C. Vemuri, A. Rangarajan and S.J. Eisenschenk (2008 c
IEEE).

respect to the robustness (presence of outliers) of the algorithm, it has been


proved to be more stable than previous approaches like TRP-RPM [40] or clas-
sical registration algorithms like ICP (Iterative Closest Point) [22]. Finally,
this method has been extended to 3D, which is a straightforward task.

4.4.3 Information Geometry and Fisher–Rao Information

The distributional shape model described above yields some degree of by-
passing the explicit computation of matching fields. Thus, when exploiting JS
divergence, the problem is posed in a parametric form (find the deformation
parameters). Furthermore, a higher degree of bypassing can be reached if
we exploit the achievements of information geometry. Information geometry
deals with mathematical objects in statistics and information theory within
the context of differential geometry [3, 5]. Differential geometry, in turn, stud-
ies structures like differentiable manifolds, that is, manifolds so smooth that
they can be differentiable. This kind of manifolds are studied by Rieman-
nian geometry and, thus, they are dubbed Riemannian manifolds. However,
it may be a long shot to introduce them formally without any previous in-
tuition. A closer concept to the reader is the one of statistical manifold, that
is a manifold of probability distributions, like one-dimensional Gaussian dis-
tributions. Such manifolds are induced by the parameters characterizing a
given family of distributions, and in the case of one-dimensional Gaussians,
the manifold is two dimensional θ = (μ, σ). Then, if a point in the manifold is
4.4 Deformable Matching with Jensen Divergence and Fisher Information 141

a probability distribution, what kind of metric is the manifold equipped with?


For multi-parameterized distributions θ = (θ1 , θ2 , . . . , θN ), the Fisher infor-
mation matrix is a metric in a Riemannian manifold. The (i, j) component of
such matrix is given by
 # $
∂ ∂
gij (θ) = p(x|θ) log p(x|θ) log p(x|θ) dx
∂θi ∂θj
# $
∂ ∂
=E log p(x|θ) log p(x|θ) (4.61)
∂θi ∂θj
For the trivial case of the one-dimensional Gaussian, we have

√ 1 (x − μ)2
ln p(x|μ, σ) = − ln 2π + ln −
σ 2σ 2
∂ (x − μ)
ln p(x|θ) = (4.62)
∂μ σ2
∂ 1 (x − μ)2 (x − μ)2 − σ 2
ln p(x|θ) = − + 3
=
∂σ σ σ σ3

Therefore, we obtain
⎛ & '2 & '⎞
(x−μ) (x−μ)3 −(x−μ)σ 2
⎜ E σ2 E σ5 ⎟
g(μ, σ) = ⎝ & ' & '2 ⎠ (4.63)
(x−μ)3 −(x−μ)σ 2 (x−μ)2 −σ 2
E σ5 E σ3

In the general case, g(θ), the so-called Fisher–Rao metric tensor, is symmetric
and positive definite. But, in what sense g(θ) is a metric? To answer the
question we have to look at the differential geometry of the manifold because
this manifold is not necessarily flat as in Euclidean geometry. More precisely,
we have to focus on the concept of tangent space. In differential geometry, the
tangent space is defined at any point of the differential manifold and contains
the directions (vectors) for traveling from that point. For instance, in the case
of a sphere, the tangent space at a given point is the plane perpendicular to
the sphere radius that touches this and only this point. Consider the sphere
with constant curvature (inverse to the radius) as a manifold (which typically
has a smooth curvature). As we move from θ to θ + δθ, the tangent spaces
change and also the directions. Then, the infinitesimal curve length δs between
the latter to points in the manifold is defined as by the equation

δs2 = gij (θ)dδθ i δθ j = (δθ T )g(θ)δθ (4.64)
ij

which is exactly the Euclidean distance if g(.) is the identity matrix. This
is why using the Euclidean distance between the pdfs parameters does
142 4 Registration, Matching, and Recognition

not generally take into account the geometry of the subspaces where these
distributions are embedded. For the case of the Kullback–Leibler diver-
gence, let us consider a one-dimensional variable x depending also on one
parameter θ:

D(p(x|θ + δθ)||p(x|θ)) ≡ D(p(θ + )||p(θ))


(4.65)
≡ D() ≈ D()|=0 + D ()|=0 + 12 2 D ()|=0

which corresponds to the McLaurin expansion. It is obvious that D()|=0 = 0.


On the other hand
 9
D () = x ∂p(x|θ+)
∂θ log p(x|θ+)
p(x|θ)
% (: (4.66)
1 ∂p(x|θ+) 1 ∂p(x|θ)
+ p(x|θ + ) p(x|θ+) ∂θ + p(x|θ) ∂θ

and consequently obtaining D ()|=0 = 0. Then, for the second derivative,


we have
 # ∂ 2 p(x|θ + ) p(x|θ + )
D () = 2
log
x
∂θ p(x|θ)
$
∂p(x|θ + ) 1 ∂p(x|θ + ) 1 ∂p(x|θ)
+ +
∂θ p(x|θ + ) ∂θ p(x|θ) ∂θ

∂ 2 p(x|θ + ) ∂p(x|θ + ) 1 p(x|θ + ) ∂p(x|θ)
+ − −
∂θ2 ∂θ p(x|θ) p(x|θ)2 ∂θ

p(x|θ + ) ∂ 2 p(x|θ + )
− (4.67)
p(x|θ) ∂θ2

Therefore
 # ∂p(x|θ) 1 ∂p(x|θ) 1 ∂p(x|θ)
$
D ()|=0 = +
x
∂θ p(x|θ) ∂θ p(x|θ) ∂θ
 # $ # $
∂ log p(x|θ) ∂p(x|θ)
=2 (4.68)
x
∂θ ∂θ
# $2 # $2 
 ∂ log p(x|θ) ∂ log p(x|θ)
=2 p(x|θ) = 2E
x
∂θ ∂θ
  
g(θ)

where g(θ) is the information that a random variable x has about a given
parameter θ. More precisely, the quantity

∂ log p(x|θ) 1 ∂p(x|θ)


V (x) = = (4.69)
∂θ p(x|θ) ∂θ
4.4 Deformable Matching with Jensen Divergence and Fisher Information 143

is called the score of a random variable. It is trivial to prove that its expec-
tation EV is 0 and, thus, g = EV 2 = var(V ) is the variance of the score.
Formally, this is a way of quantifying the amount of information about N θ in
the data. A trivial example is the fact that the sample average x̄ = n1 i=1 xi
is an estimator T (x) of the mean θ of a Gaussian distribution. Moreover, x̄
is also an unbiased estimator because the expected value of the error of the
estimator Eθ (T (x) − θ) is 0. Then, the Cramér–Rao inequality says that g(θ)
determines a lower bound of the mean-squared error var(T ) in estimating θ
from the data:
1
var(T ) ≥ (4.70)
g(θ)
The proof is trivial if one uses the Cauchy–Schwarz inequality over the prod-
uct of variances of V and T (see [43], p. 329). Thus, it is quite interesting
that D ()|=0 ∝ g(θ). Moreover, if we return to the McLauring expansion
(Eq. 4.65), we have
D(p(θ + )||p(θ)) ≈ 2 g(θ) (4.71)
whereas in the multi-dimensional case, we would obtain
1
D(p(θ + δθ)||p(θ)) ≈ (δθ T )g(θ)δθ (4.72)
2
which is pretty consistent with the definition of the squared infinitesimal arc-
length (see Eq. 4.64). Finally, the Cramér–Rao inequality for the multidimen-
sional case is
Σ ≥ g −1 (θ) (4.73)
Σ being the covariance matrix of a set of unbiased estimators for the param-
eter vector θ.

4.4.4 Dynamics of the Fisher Information Metric

In the latter section, we have presented the Fisher–Rao metric tensor and its
connection with the infinitesimal arc-length in a Riemannian manifold. Now,
as we want to travel from a probability distribution to another one through
the Riemannian manifold, we focus on the dynamics of the Fisher–Rao metric
tensor [34]. The underlying motivation is to connect, as smoothly as possible,
two consecutive tangent spaces from the origin distribution toward the final
distribution. Such connection is the affine connection. In the Euclidean space,
as the minimal distance between two points is the straight line, there is no need
of considering change of tangents at each point in between because this change
of direction does not occur in the Euclidean space. However, and quoting Prof.
Amari in a recent videolecture [4]:These two concepts (minimality of distance,
and straightness) coincide in the Riemannian geodesic. Thus, the geodesic is
the curve sailing through the surface of the manifold, whose tangent vectors
remain parallel during such transportation, and minimizes
144 4 Registration, Matching, and Recognition
 1   1
T
E= g(θ ij )θ˙i θ˙j dt = θ̇ g(θ)θ̇dt (4.74)
0 0

being θ˙i = and being t ∈ [0, 1] the parameterization of the geodesic path
dθ i
dt ,
 1 
θ(t). Of course, the geodesic also minimizes s = 0 g(θ ij )θ˙i θ˙j dt, the
path length. Geodesic computation is invariant to reparameterization and to
coordinate transformations. Then, the geodesic can be obtained by minimizing
E through the application of the Euler–Lagrange equations characterizing the
affine connection1
δE    ∂gij ∂gik

∂gkj ˙ ˙

= −2 gki θ̈ i + − − θi θj = 0 (4.75)
δθ k i ij
∂θ k ∂θ j ∂θ i

being θ k the kth parameter. The above equations can be rewritten as a func-
tion of the Chirstoffel symbols of the affine connection

1 ∂gik ∂gkj ∂gij
Γk,ij = + − (4.76)
2 ∂θ j ∂θ i ∂θ k

as the following system of second-order differential equations


 & '
gki θ̈ i + Γk,ij θ˙i θ˙j = 0 (4.77)
i ij

We are assuming a probabilistic shape model with K elements in the mixture,


and a common variance σ 2 (free parameter). The only parameters to estimate
are the K 2D vectors μi corresponding to the point averages. Thus, each
shape is represented by N = 2K parameters θ = (μ1 , μ2 , . . . , μN )T . This
implies that the Fisher tensor is a N × N matrix, and also that the system
of second-order differential equations has N equations. However, it cannot be
solved analytically. In these cases, the first tool is the gradient descent:
δE
θ τk+1 (t) = θ τk (t) + ατ (4.78)
δθ τk (t)
being τ an iteration parameter and α a step one. If we discretize the time
from t = 0 and t = 1 in equally spaced intervals, we initialize θ(0) with the
parameters of one of the shapes and θ(1) with the parameters of the other.
The value of E after finding all parameters yields the distance between both
shapes (the length of the geodesic).
Once the geodesics between two shapes are found, it should be inter-
esting to find a way of translating these geodesics to the shape space in
such a way that the result is as smoothest as possible, while preserving the
1
Due to the tight connection with relativity theory, the usual notation for these for-
mulas follow Einstein summation convention. We apply here an unusual notation
for the sake of clarity.
4.4 Deformable Matching with Jensen Divergence and Fisher Information 145

likelihood. This is called finding the extrinsic deformations induced by the


intrinsic geodesics. Likelihood preservation is imposed by setting the deriva-
tive of the log-likelihood with respect to time. The parameters of the mix-
ture are a sequence of 2D vectors, each one representing a point in the land-
mark μ = (θ1 , θ2 ). The derivative of the log-likelihood referred to that pair of
parameters is
d log L(x|μ) ∂L ∂L
= (∇θ1 log L)T θ̇1 + (∇θ2 log L)T θ̇2 + u− v=0,
dt ∂x1 (t) ∂x2 (t)
(4.79)
where u = dx dx2
dt and v = dt define the vector flow induced by the intrinsic
1

deformation. Therefore, if we enforce smoothing via regularization (thin-plates


for instance), the functional to minimize is the following:
 # $2 
d log L(x|μ) 2 2 2
3
E= + λ (∇ u)2 + (∇ v)2 dx
dt

Example results are shown in Fig. 4.21. When Fisher matrices are based on
other estimators like α-order entropy metric tensors, for which closed solutions
are available, the results are not so smooth than when the Fisher matrix is
used. The α-order metric tensor, for α = 2, is defined as
   
α ∂p(x) ∂p(x)
gij (θ) = dx (4.80)
∂θi ∂θj

3
1 2

4 3
1
3
1 2

4 3
1

Fig. 4.21. Shape matching results with information geometry. Left: Fisher-
information (top) vs α-entropy (bottom). Dashed lines are the initializations and
solid ones are the final geodesics. Note that the Fisher geodesics are smoother. Left:
table of distances comparing pairs of shapes 1–9 presented in Fig. 4.19. Figure by
A. Peter and A. Rangarajan (2006
c Springer).
146 4 Registration, Matching, and Recognition

An alternative way of solving the above minimization problem is to


discretize time in T values and compute the gradients of the following function:


T
E= θ̇(t)T g(θ(t))θ̇(t) (4.81)
t=1

However, in [114] it is noted that the derivatives in RN ×T are quite unstable,


unless at each t, we identify the orthogonal directions that are more likely
to contribute more to the energy, and then use only the partial derivatives
along these directions to compute the gradient. More precisely, as θ ∈ RN ,
we diagonalize the N × N tensor g(θ), which is symmetric, to obtain the or-
thonormal basis given by the eigenvectors {φ1 , . . . , φN }, assuming that their
corresponding eigenvalues are ordered λ1 > λ2 > · · · > λN . Let then θ(t)
be the unknown point in the geodesic, {φj (t) j = 1, . . . , N } its eigenvalues
and {λj (t) j = 1, . . . , N } its eigenvectors. The magnitude of each eigenvalue
denotes the importance of each of the directions represented by the corre-
sponding eigenvector. Consequently, it is reasonable to truncate the basis to
P ≤ N eigenvectors attending to retain a given percentage of variability,
and thus, of the energy E(t) (at time t). Consequently, if the percentage is
fixed, P may vary along time. Then we have the following orthonormal vec-
tors {φj (t) j = 1, . . . , P }. Let then Vj (t) ∈ RN ×T , j = 1, . . . , P be the vector
(0, . . . , 0, φj (t), 0, . . . , 0)T . Given Vj (t), we may define

E(θ + δVj (t)) − E(θ)


∂jt E = (4.82)
δ
and then, define an approximation of the gradient:

δE  P T
∇E(θ) = ≈ (∂jt E)Vj (t) (4.83)
δθ j=1 t=1

The gradient search is initialized with the linear path between the shapes and
proceeds by deforming the path until convergence.

4.5 Structural Learning with MDL

4.5.1 The Usefulness of Shock Trees

Almost all the data sources used in this book, and particularly in the present
chapter, have a vectorial origin. Extending the matching and learning tech-
niques to the domains of graphs is a hard task, which is commencing to in-
tersect with information theory. Let us consider here the simpler case of trees
where we have a hierarchy which, in addition, is herein assumed to be sam-
pled correctly. Trees have been recognized as good representations for binary
4.5 Structural Learning with MDL 147

# #

3 002 3 001

1 004 1 005 1 007 1 006 1 002 1 003 1 005 1 004

3 001 3 002
1 001 1 002
1 003 1 001

F F F F F F F F F

1-001 1-002 1-002


1-003
1-005
1-004 3-001
3-002

1-004
1-007
1-005
1-006
3-001
3-002

1-003

1-001

Fig. 4.22. Examples of shock graphs extracted from similar shape and the result
of the matching algorithm. Figure by K. Siddiqi, A. Shokoufandeh, S.J. Dickinson
and S.W. Zucker (1999
c Springer).

shapes, and tree-versions of the skeleton concept lead to the shock tree [145].
A shock tree is rooted in the latest characteristic point (shock) found while
computing the skeleton (see Fig. 4.22). Conceptually, the shock graph is built
by reversing the grassfire transform, and as a point, is more and more inside
the shape, its hierarchy is higher and higher in the tree. Roughly speaking,
the tree is build as follows: (i) get the shock point and its branch in the skele-
ton; (ii) the children of the root are the shocks closer in time of formation
and reachable through following a path along a branch; (iii) repeat assuming
that each child is a parent; (iv) the leaves of the tree are terminals without
a specific shock label (that is, each earlier shock generates a unique child, a
terminal node). Using a tree instead of a graph for representing a shape has
clear computational advantages because they are easier to match and also to
learn than graphs.

4.5.2 A Generative Tree Model Based on Mixtures


The unsupervised learning of trees (and consequently of the shapes they rep-
resent or encode) is the main topic of this section. Consider that we observe a
tree t which is coherent with a model H. We commence by assuming that the
probability of observing the tree given the model M is a mixture k conditional
probabilities of observing the tree given models T1 , . . . , Tk [156]:

k
P (t|M) = αc P (t|Tc ) (4.84)
c=1
148 4 Registration, Matching, and Recognition

As we have noted above, for the sake of simplicity, sampling errors affect
the nodes, but hierarchical relations are preserved. Each tree model Tc is
defined by a set of nodes Nc , a tree order Oc , and the observation probabilities
assigned to the nodes of a tree coherent with model Tc are grouped into the set
Θc = {θc,i } : i ∈ Nc . Therefore, the probability of observing a node i in the set
of trees assigned to the model (class) c is denoted by θc,i . A simple assumption
consists of considering that the nodes of t are independent Bernoulli samples
from the model Tc . The independence assumption leads to the factorization
of P (t|Tc ). If t is an observation of Tc , we have
 
P (t|Tc ) = θc,i (1 − θc,i ) (4.85)
i∈Nt j∈Nc /Nt

and otherwise P (t|Tc ) = 0. In order to avoid unconnected tree samples, it


is also assumed that the root is always observed, that is, it is observed with
probability 1. Consequently, structural noise can only eliminate a node (typ-
ically a leave), provided that the remaining tree is connected. However, the
correspondences between nodes of different trees are not known beforehand
and must be inferred. Such correspondences must satisfy the hierarchical con-
straints (if two nodes belong to the same hierarchy level-order, then their
images through the correspondence should also belong to the same order).
Then, an interesting question is how to both estimate the models H, T1 , . . . , Tk
that best fit the observed trees D = {t1 , t2 , . . . , tN }, and the correspondences
C which map nodes of different trees. It is important to note here that the
resulting models depend on the correspondences found. The underlying idea
is that if a given substructure or structural relation is highly preserved in
the data (very frequent), it is reasonable to find it into the model. Regarding
information theory, the minimum description length criterion is quite useful
here in order to quantify the complexity of a tree model. As we have seen in
the previous chapter, the MDL is the sum of the negative of the log-likelihood
(cost of describing the data) and the cost of describing the model. The cost,
if describing a tree t given the tree model Tc and the correspondences C, is
defined by a finite number when the hierarchy constraints are satisfied:
 
− log P (t|Tc , C) = − log θc,i − log(1 − θc,j ) (4.86)
i∈Im(C) j∈N /Im(C)

being Im(.) the image of the correspondence, and it is assumed to be ∞


otherwise. Then, the cost of describing a data set D given H, T1 , . . . , Tk and
C is defined as
 k 
  
L(D|H, C) = − log P (t|H, C) = − log αc P (t|Tc , C) (4.87)
t∈D t∈D c=1

Adding another simplifying assumption (a tree t can only be the observation


of a unique model), we have hidden binary variables that link trees and modes:
4.5 Structural Learning with MDL 149

zct = 1 if the tree t is observed from model Tc , and zct = 0 otherwise. This
simplifies the cost of describing the data:

k 
k 
L(D|z, H, C) = − zct log P (t|Tc , C) = log P (t|Tc , C) (4.88)
t∈D c=1 c=1 t∈Dc

being Dc = {t ∈ D : zct = 1}. On the other hand, the cost of describing the full
model H is built on three components. The first one comes from quantifying
the cost of encoding the observation probabilities Θ̂c :

k
nc
L(Θ̂c |z, H, C) = log(mc ) (4.89)
c=1
2

being nc the number of nodes in the model Tc and mc = t∈D zct the num-
ber of trees assigned to the model Tc . The second component comes from
quantifying the cost of the mapping implemented by z:

k
L(z|H, C) = − log αc mc (4.90)
c=1

The third component concerns the coding of the correspondence C. Assuming


equally likely correspondences, the main remaining question is to estimate
properly the number of such correspondences. If one takes the number of or-
dered trees (trees with a number in each node where the hierarchy imposes
a partial order) with n nodes, the cost is overestimated because in this con-
text only unordered trees are considered. For this wider class of trees, the
Wedderburn–Etherington numbers are commonly used in graph theory. Their
asymptotic behavior is exponential: 1, 1, 1, 2, 3, 6, 11, 23, 46, 98, 207, 451, 983,
2,179, 4,850, 10,905, 24,631, 56,011, 127,912, 293,547, . . .. Thus, the logarithm
may be given by ng + const., where g = (ζ + λ), ζ ≈ 1.31 and λ is a prior
term. Therefore, integrating the three components we have


k
k−1
L(H|C) = g nc + log(m) + const. (4.91)
c=1
2

where nc is the number of nodes in model Tc , and the second term with
m = |D| is the cost of describing the mixing parameters α. Dropping the
constant, the MDL in this context is given by

MDL(D, H|C)
  
k  log(mc )
= − log P (t|Tc , C) + nc + g − mc log αc
c=1
2
t∈Dc
k−1
+ log(m) . (4.92)
2
150 4 Registration, Matching, and Recognition

Considering the log-likelihood, let Kc,i t


= {j ∈ N t |C(j) = i} be the set of
nodes in Dc for which there is a correspondence with a node i ∈ N c . One-
t
to-one matching constrains result in having singletons or empty sets for Kc,i
(its size is either one or zero, respectively). Then, if lc,i = t∈Dc |Kc,i | is the
t

number of trees in the data mapping a node to i, and mc = |Dc | is the number
of trees assigned to Tc , then the maximum likelihood estimation of sampling
l
probability under the Bernouilli model is θc,i = mc,ic . Then, we have


log P (t|Tc , C)
t∈Dc
 #  $
lc,i lc,i lc,i lc,i
= mc log + (1 − ) log 1 −
mc mc mc mc
i∈Nc

= mc [θc,i log θc,i + (1 − θc,i ) log(1 − θc,i )]
i∈Nc

=− mc H(θc,i ) (4.93)
i∈Nc

that is, the individual log-likelihood with respect to each model is given by the
weighted entropies of the individual Bernoulli ML estimations corresponding
to each node. Furthermore, we estimate the mixing proportions αc as the ob-
k
served frequency ratio, that is, αc = m m
c
. Then, defining H(α) = − c=1 αc ,
we have the following resulting MDL criterion:

M DL(D, H|C)

k  
mc k−1
= mc H(θi,c ) + log + g + mH(α) + log(m)
c=1 i∈Nc
2 2
. (4.94)

4.5.3 Learning the Mixture

How to learn k tree models from the N input trees in D so that the MLD
cost is minimized? A quite efficient method is to pose the problems in terms
of agglomerative clustering (see a good method for this in the next chapter).
The process is illustrated in Fig. 4.22. Start by having N clusters, that is, one
cluster/class per tree. This implies assigning a tree model per sample tree. In
each tree, each node has unit sample probability. Then we create one mixture
component per sample tree, and compute the MDL cost for the complete
model assuming an equiprobable mixing probability. Then, proceed by taking
all possible pairs of components and compute tree unionstree unions. The
mixing proportions are equal to the sum of proportions of the individual
unions. Regarding the sampling probabilities of merged nodes, let mi = |Di |
4.5 Structural Learning with MDL 151

and mj = |Dj | the number of tree samples assigned respectively to Ti and Tj .


Let also lu and lv be the number of times the nodes in u ∈ Ti and v ∈ Tj are
observed in Di and Dj , respectively. Then, if the nodes are not merged, their
sampling probabilities are θu = mil+m
u
j
and θv = mi l+m
v
j
. However, if the nodes
are merged, the sampling probability of the resulting node is θuv = mlui +m
+lv
j
.
For the sake of computational efficiency, in [156] it is defined as the concept of
minimum description length advantage (MDLA) for two nodes in the following
terms:

MDLA(u, v) = MDLu + MDLv − MDLuv


= (mi + mj )[H(θu ) + H(θv ) − H(θuv ]
1
+ log(mi + mj ) + g
2
(4.95)

and then, the MDLA for a set of merges M is



MDLA(M) = MDLA(u, v) (4.96)
(u,v)∈M

Given all the possibilities of merging two trees, choose the one which re-
duces MDL by the greatest amount. If all the proposed merges increase the
MDL, then the algorithm stops and does not accept any merging. If one of
the mergings succeeds, then we have to compute the costs of merging it with
the remaining components and proceed again to choose a new merging. This
process continues until no new merging can drop the MDL cost.

4.5.4 Tree Edit-Distance and MDL

Estimating the edit distance, the cost of the optimal sequence of edit oper-
ations over nodes and edges is still an open problem in structural pattern
matching. In [156], an interesting link between edit distance and MDL is es-
tablished. Given the optimal correspondence C, the tree-edit distance between
trees t and t is given by
  
Dedit (t, t ) = ru + rv + muv (4.97)
u∈dom(C) v ∈ima(C) (u,v)∈C

where ru and rv are the costs of removing nodes u and v, respectively, whereas
muv is the cost of matching nodes u and v. In terms of sets of nodes of a tree
N t , the edit distance can be rewritten as
  
Dedit (t, t ) = ru + rv + (muv − ru − rv) (4.98)
u∈N t v∈N t (u,v)∈C
152 4 Registration, Matching, and Recognition

but what are the costs of ru , rv and muv ? Attending to the MDL criterion,
these costs are the following ones:

1
rz = (mt + mt )H(θz ) + log(mt + mt ) + g, z ∈ {u, v}
2
1
muv = (mt + mt )H(θuv ) + log(mt + mt ) (4.99)
2
Therefore, edit costs are closely related to the information of the variables as-
sociated to the sampling probabilities. Shape clustering examples using these
edit costs and spectral clustering are shown in Fig. 4.23 (bottom). In these

Fig. 4.23. Tree learning. Top: dynamic formation of the tree union. The darker
the node, the higher its sampling probability. Bottom: results of shape clustering
from (a) mixture of attributed trees; (b) weighted edit distance; and (c) union of
attributed trees. Figure by A. Torsello and E.R. Hancock (2006
c IEEE).
4.5 Structural Learning with MDL 153

experiments, the weighted/attribute versions of both the edit-distance and the


union trees come from generalizing the pure structural approach described
above to include weights in the trees. These weights are associated to nodes,
follow a particular distribution (Gaussian for instance) and influence the prob-
ability of sampling them. Finally, it is important to note that other IT princi-
ples like minimum message length (MML) and other criteria explained along
the book can drive this approach [155].

Problems
4.1 Distributions in multidimensional spaces
Perform the following experiment using Octave, Matlab, or some other similar
tool. Firstly, generate 100 random points in a 1D space (i.e., 100 numbers).
Use integer values between 0 and 255. Calculate the histogram with 256 bins
and look at its values. Secondly, generate 100 random points in a 2D space
(i.e., 100 pairs of numbers). Calculate the 2D histogram with 256 bins and
look at the values. Repeat the experiment with 3D and then with some high
dimensions. High dimensions cannot be represented, but you can look at the
maximum and minimum values of the bins. What happens to the values of
the histogram? If entropy is calculated from the distributions you estimated,
what behavior would it present in a high-dimensional space? What would you
propose to deal with this problem?
4.2 Parzen window
Look at the formula of the Parzen window’s method (Eq. 4.8). Suppose we
use in it a Gaussian kernel with some definite covariance matrix ψ and we
estimate the distribution of the pixels of the image I (with 256 gray levels).
Would the resulting distribution be the same if we previously smooth the
image with a Gaussian filter? Would it be possible to resize (reduce) the
image in order to estimate a similar distribution? Discuss the differences. You
can perform experiments on natural images in order to notice the differences,
or you can try to develop analytical proofs.
4.3 Image alignment
Look at Fig. 4.2. It represents the values of two different measures for image
alignment. The x and z axes represent horizontal and vertical displacement
of the image and the vertical axis represents the value of the measure. It can
be seen that mutual information yields smoother results than the normalized
cross correlation. Suppose
 asimpler measure consists in the difference of pixels
of the images I and I  : x y Ix,y − Ix,y

. How would the plot of this measure
look like? Use Octave, Matlab or some other tool to perform the experiments
on a natural image. Try other simple measures and see which one would be
more appropriate for image alignment.
4.4 Joint histograms
In Fig. 4.3, we show the classical way to build a joint histogram. For two
similar or equal images (or sets of samples), the joint histogram has high
154 4 Registration, Matching, and Recognition

values in its diagonal. Now suppose we have two sets of samples generated by
two independent random variables. (If two events are independent, their joint
probability is the product of the prior probabilities of each event occurring by
itself, P (A ∩ B) = P (A)P (B).) How would their joint histogram (see Fig. 4.3)
look?
4.5 The histogram-binning problem
Rajwade et al. proposed some methods for dealing with the histogram-binning
problem. Their methods present a parameter Q, which quantifies the intensity
levels. This parameter must not be confused with the number of bins in clas-
sical histograms. Explain their difference. Why setting Q is much more robust
than setting the number of histogram bins? Think of cases when the clas-
sical histogram of two images would be similar to the area-based histogram.
Think of other cases when both kinds of histograms would present a significant
difference.
4.6 Alternative metrics and pseudometrics
Figure 4.11 represents the concept of mutual information in a Venn diagram.
Draw another one for the ρ(W, X, Y, Z) measure of four variables and shade
the areas corresponding to that measure. Think of all the possible values it
could take. Make another representation for the conditional mutual informa-
tion.
4.7 Entropic graphs
Think of a data set formed by two distributions. If we gradually increase the
distance of these distributions until they get completely separated, the en-
tropy of the data set would also increase. If we continue separating more and
more the distributions, what happens to the entropy value for each one of the
following estimators? (a) Classical histograms plugin estimation; (b) Parzen
window; (c) minimal spanning tree graph-based estimation; and (d) k-nearest
neighbors graph-based estimation, for some fixed k.
4.8 Jensen–Rényi divergence
For the following categorical data sets: S1 = {1, 2, 1},S2 = {1, 1, 3}, and
S3 = {3, 3, 3, 2, 3, 3}, calculate their distributions and then their Jensen–Rényi
divergence, giving S1 and S2 the same weight and S3 the double.
4.9 Unbiased JS divergence for multiple registration
Prove that the JS divergence is unbiased for multiple registration of M shapes,
that is, the JS divergence with respect to the complete set is equal to the JS
divergence with respect to the m << N subsets in which the complete set is
partitioned. A good clue for this task is to stress the proof of JSπ (p1 , . . . pN )−
JSβ (S1 , . . . SN ) = 0, where the Si are the subsets and β must be properly
defined.
4.10 Fisher matrix of multinomial distributions
The Fisher information matrix of multinomial distributions is the key to spec-
ify the Fisher information matrix of a mixture of Gaussians [101]. Actually,
4.5 Structural Learning with MDL 155

the parameters of the multinomial in this case are the mixing components
(which must form a convex combination – sum 1). Derive analytically the
Fisher matrix for that kind of distributions.
4.11 The α-order metric tensor of Gaussians and others
An alternative tensor to the Fisher one is the α-order metric one. When
α = 2, we obtain the expression in Eq. 4.80. Find an analytical expression of
this tensor for the Gaussian distribution. Do the same with the multinomial
distribution. Compare the obtained tensors with the Fisher–Rao ones.
4.12 Sampling from a tree model
Let Tc be a tree model consisting of one root node with two children. The
individual probability of the root is 1, the probability of the left child is 0.75
and that of the right child is 0.25 (probability is normalized at the same level).
Quantify the probability of observing all the possible subgraphs, including the
complete tree, by following the independent Bernoulli model (Fig. 4.22)
4.13 Computing the MDL of a tree merging process
Given the definition of MDL for trees, compute the sampling probabilities
and MDLs for all steps of the application of the algorithm to the example
illustrated in Fig. 4.22. Compute the MDL advantage in each iteration of the
algorithm.
4.14 Computing the MDL edit distance
Compute the MDL edit distance for all the matchings (mergings) illustrated
in Fig. 4.22.
4.15 Extending MDL for trees with weights
As we have explained along the chapter, it is possible to extend the pure struc-
tural framework to an attributed one, where attributes are assigned to nodes.
Basically this consists of modeling a weight distribution function, where most
relevant nodes have associated larger weights. From the simple assumption
that weights are Gaussian distributed, the probability of a given weight can
be defined as
⎧ 
(w−μc,i )2

⎨ − 12
σ2
1 √
Pw (w|μc,i , σc,i ) = θc,i σc,i 2π e c,i if w ≥ 0

0 otherwise
and the sample probability θc,i is the integral of the distribution over positive
weights:  
 ∞ (w−μc,i )2
1 − 12 2
θc,i = √ e σ
c,i dw
0 σc,i 2π
Given the latter definitions, modify the log-likelihood by multiplying the
Bernoulli probability by Pw (.|.) only in the positive case (when we consider
θc,i ). Obtain a expression for the MDL criterion under these new conditions.
Hints: consider both the change of the observation probability and the addi-
tion of two new variables per node (which specify the Gaussian for its weight).
156 4 Registration, Matching, and Recognition

4.6 Key References

• P. Viola and W. M. Wells-III. “Alignment by Maximization of Mutual


Information”. 5th International Conference on Computer Vision 24(2):
137–154 (1997)
• J. P. W. Pluim, J. B. A. Maintz, and M. A. Viergever. “Mutual-
information-based Registration of Medical Images: A Survey”. IEEE
Transactions on Medical Imaging 22(8): 986–1004 (2003)
• J. Zhang and A. Rangarajan. “Affine Image Registration Using a New
Information Metric”. IEEE Conference on Computer Vision and Pattern
Recognition 1: 848–855 (2004)
• A. Rajwade, A. Banerjee and A. Rangarajan. “Probability Density Es-
timation Using Isocontours and Isosurfaces: Application to Information
Theoretic Image Registration”. IEEE Transactions on Pattern Analysis
and Machine Intelligence 31(3): 475–491 (2009)
• A. Ben Hamza and H. Krim. “Image Registration and Segmentation by
Maximizing the Jensen-Rényi Divergence”. Energy Minimization Methods
in Computer Vision and Pattern Recognition – LNCS 2683: 147–163 (2001)
• H. Neemuchwala, A. Hero, and P. Carson. “Image Registration in High
Dimensional Space”, International Journal on Imaging Systems and Tech-
nology h16(5): 130–145 (2007)
• A. Peter and A. Rangarajan. “Information Geometry for Landmark
Shape Analysis: Unifying Shape Representation and Deformation”.
IEEE Transactions on Pattern Analysis and Machine Intelligence 31(2):
337–350 (2009)
• F. Wang, B. Vemuri, A. Rangarajan, I. M. Schmalfuss, and
S.J. Eisenschenck. “Simultaneous Registration of Multiple Point-Sets
and Atlas Construction”. IEEE Transactions on Pattern Analysis and
Machine Intelligence 30(11): 2011–2022 (2008)
• A. Torsello and E. R. Hancock. “Learning Shape-Classes Using a Mixture
of Tree-Unions”. IEEE Transactions on Pattern Analysis and Machine
Intelligence 28(6): 954–967 (2006)
5
Image and Pattern Clustering

5.1 Introduction
Clustering, or grouping samples which share similar features, is a recurrent
problem in computer vision and pattern recognition. The core element of a
clustering algorithm is the similarity measure. In this regard information the-
ory offers a wide range of measures (not always metrics) which inspire clus-
tering algorithms through their optimization. In addition, information theory
also provides both theoretical frameworks and principles to formulate the clus-
tering problem and provide effective algorithms. Clustering is closely related
to the segmentation problem, already presented in Chapter 3. In both prob-
lems, finding the optimal number of clusters or regions is a challenging task.
In the present chapter we cover this question in depth. To that end we explore
several criteria for model order selection.
All the latter concepts are developed through the description and discus-
sion of several information theoretic clustering algorithms: Gaussian mixtures,
Information Bottleneck, Robust Information Clustering (RIC) and IT-based
Mean Shift. At the end of the chapter we also discuss basic strategies to form
clustering ensembles.

5.2 Gaussian Mixtures and Model Selection

5.2.1 Gaussian Mixtures Methods

A probability mixture model is a probability distribution which is the result


of the combination of other probability distributions. Mixture models, in par-
ticular those formed by Gaussian distributions (or kernels), are widely used in
areas involving statistical modeling of data. In statistical pattern recognition,
mixture models allow a formal approach to clustering [82]. In traditional clus-
tering methods, different heuristics (e.g. k-means) or agglomerative methods

F. Escolano et al., Information Theory in Computer Vision and Pattern Recognition, 157

c Springer-Verlag London Limited 2009
158 5 Image and Pattern Clustering

are used [81], while mixture models have a set of parameters which can be
adjusted in a formal way. The estimation of these parameters is a clustering,
provided that each sample belongs to some kernel of the set of kernels in the
mixture. In Bayesian supervised learning the mixture models are used for rep-
resentation of class-conditional probability distributions [72] and for Bayesian
parameter estimation [46].

5.2.2 Defining Gaussian Mixtures

In a mixture model the combination of distributions has to be convex, that is,


a linear combination with non-negative weights. Let us define a d-dimensional
random variable y which follows a finite mixture distribution. Its probability
density function (pdf) p(y|Θ) is described by a weighted sum of kernels. These
kernels are known distributions, which, in the case of Gaussian mixtures,
consist of Gaussian distributions.

K
p(y|Θ) = πi p(y|Θi ) (5.1)
i=1

K
where 0 ≤ πi ≤ 1, i = 1, ..., K, and πi = 1,
i=1

where K is the number of kernels, π1 , ..., πk are the a priori probabilities of each
kernel, and Θi are the parameters describing the kernel. In Gaussian mixtures
these parameters are the means and the covariance matrices, Θi = {μi , Σi }. In
Fig. 5.1, see an example of 1D data following a Gaussian mixture distribution
formed by K = 3 kernels of different parameters of each one of them.
Thus, the whole set of parameters of a mixture is Θ≡{Θ1 , ..., Θk , π1 , ..., πk }.
Obtaining the optimal set of parameters Θ∗ is usually posed in terms of max-
imizing the log-likelihood of the pdf to be estimated:


N
(Y |Θ) = log p(Y |Θ) = log p(yn |Θ)
n=1


N 
K
= log πk p(yk |Θk ). (5.2)
n=1 k=1

where Y = {y1 , ..., yN } is a set of N independent and identically distributed


(i.i.d.) samples of the variable Y . The Maximum-Likelihood (ML) estimation


ΘM T = arg max (Θ). (5.3)
Θ
5.2 Gaussian Mixtures and Model Selection 159

0.5

0.45

0.4

0.35

0.3

0.25

0.2

0.15

0.1

0.05

0
10 15 20 25 30 35 40

Fig. 5.1. Data of one dimension whose distribution can be approximated by a


Gaussian mixture with three kernels of different means and variances.

cannot be determined analytically. The same happens with the Bayesian Max-
imum a Posteriori (MAP) criterion

ΘMAP = arg max((Θ) + log p(Θ)). (5.4)
Θ

In this case other methods are needed, like expectation maximization (EM)
or Markov-chain Monte Carlo algorithms.

5.2.3 EM Algorithm and Its Drawbacks

The expectation maximization algorithm is widely used for fitting mixtures


to data [49, 112]. It allows to find maximum-likelihood solutions to problems
in which there are hidden variables. In the case of Gaussian mixtures, these
variables are a set of N labels Z = {z 1 , ..., z N } associated to the samples. Each
(i) (i) (i) (i)
label is a binary vector z i = [z1 , ..., zk ], with zm = 1 and zp = 0, if p = m,
indicating that y (i) has been generated by the kernel m. The log-likelihood of
the complete set of data X = {Y, Z} is


N 
K
log p(Y, Z|Θ) = zkn log[πk p(yn |Θk )]. (5.5)
n=1 k=1

The EM algorithm is an iterative procedure. It generates a sequence of esti-


mations of the set of parameters {Θ∗ (t), t = 1, 2, ...} by alternating two steps:
the expectation step and the maximization step, until convergence is achieved.
A detailed description of the algorithm is given in [136] and an illustration of
its results is given in Fig. 5.2.
160 5 Image and Pattern Clustering

Fig. 5.2. The classical EM is prone to random initialization. The algorithm con-
verges to different local minima in different executions.

E Step

It consists of estimating the expected value of the hidden variables given


the visible data Y and current estimation of the parameters Θ∗ (t). Such an
expectation can be expressed in the following way:

πk∗ (t)p(y (n) |Θk∗ (t))


E[zk |y, Θ∗ (t)] = P [zk = 1|y, Θ∗ (t)]) =
(n) (n)
, (5.6)
Σj=1 πj∗ (t)p(y (n) |Θk∗ (t))
K

Thus, the probability of generating yn with the kernel k is given by

πk p(y(n) |k)
p(k|yn ) = (5.7)
Σj=1 πj p(y(n) |k)
K

M Step

Given the expected Z, the new parameters Θ∗ (t + 1) are given by

1 
N
πk = p(k|yn ), (5.8)
N n=1
N
n=1 p(k|yn )yn
μk = N , (5.9)
n=1 p(k|yn )
N
n=1 p(k|yn )(yn − μk )(yn − μk )T
Σk = N , (5.10)
n=1 p(k|yn )

The EM algorithm results depend very much on the initialization and


it usually converges to some local maximum of the log-likelihood function,
5.2 Gaussian Mixtures and Model Selection 161

which does not ensure that the pdfs of the data are properly estimated. In
addition, the algorithm requires that the number of elements (kernels) in the
mixture is known beforehand. A maximum-likelihood criterion with respect
to the number of kernels is not useful because it tends to fit one kernel for
each sample.

5.2.4 Model Order Selection

The model order selection problem consists of finding the most appropriate
number of clusters in a clustering problem or the number of kernels in the
mixture. The number of kernels K is unknown beforehand and cannot be
estimated through maximizing the log-likelihood because (Θ) grows with K.
In addition, a wrong K may drive the EM toward an erroneous estimation. In
Fig. 5.3 we show the EM result on the parameters of a mixture with K = 1
describing two Gaussian distributions.
The model order selection problem has been addressed in different ways.
There are algorithms which start with a few number of kernels and add new
kernels when necessary. Some measure has to be used in order to detect when
some kernel does not fit well the data, and a new kernel has to be added. For
example, in [172], a kernel is split or not, depending on the kurtosis measure,
used as a measure of non-Gaussianity. Other model order selection methods
start with a high number of kernels and fuse some of them. In [55, 56, 59]
the EM algorithm is initialized with many kernels randomly placed and then
the minimum description length (MDL) principle [140] is used to iteratively
remove some of the kernels until the optimal number of them is achieved.
Some other approaches are used both to split and fuse kernels. In [176] a
general statistical learning framework called de Bayesian Ying-Yang system is
proposed and suitable for using for model selection and unsupervised learning
of finite mixtures. Other approaches combine EM and genetic algorithms for
learning mixtures, using an MDL criterion for finding the best K.


− −
− −

Fig. 5.3. Two Gaussians with averages μ1 = [0, 0] y μ2 = [3, 2] (left) are erroneously
described by a unique kernel with μ = [1.5, 1] (right). (Figure by Peñalver et al. [116]
(2009
c IEEE)).
162 5 Image and Pattern Clustering

In the following sections we present the entropy-based expectation maxi-


mization (EBEM) algorithm. This algorithm starts with one kernel and itera-
tively splits it until achieving the optimal number of kernels. It avoids the ini-
tialization problems which EM has, and uses as a criterion the “Gaussianity”
of the data which are fit by the kernels of the mixture.

5.3 EBEM Algorithm: Exploiting Entropic Graphs


The EBEM (entropy-based EM) algorithm [116] starts with only one kernel,
and finds the maximum-likelihood solution. In each iteration it tests whether
the underlying pdf of each kernel is Gaussian. If not, it replaces that kernel
with two kernels adequately separated from each other. After the kernel lower
Gaussianity has been split into two, new EM steps are performed in order to
obtain a new maximum-likelihood solution for the new number K of kernels.
The algorithm is initialized with a unique kernel whose parameters of average
and covariance are given by the sample. Consequently, the algorithm is not
prone to initialization, overcoming the local convergence of the usual EM
algorithm.
The EBEM algorithm converges after a few iterations and is suitable for
density estimation, pattern recognition and unsupervised color image segmen-
tation. The only parameter which has to be set is the Gaussianity threshold.
It is more versatile that just fixing the number of kernels beforehand. We
may know how well we want our data to fit the Gaussian distributions of the
mixture, but we may not necessarily know how many clusters form the data.
For instance, in the color image segmentation context, one may assume that
in a image sequence of the same environment, the Gaussianity threshold may
be nearly constant whereas the number of kernels will be different for each
frame.

5.3.1 The Gaussianity Criterion and Entropy Estimation

Recall that for a discrete variable Y = {y1 , ..., yN } with N values the entropy
is defined as

N
H(Y ) = −Ey [log(P (Y ))] = − P (Y = yi ) log P (Y = yi ). (5.11)
i=1

A fundamental result of information theory is the “2nd Gibbs Theorem,”


which states that Gaussian variables have the maximum entropy among all
the variables with equal variance [88]. Then, the entropy of the underlying
distribution of a kernel should reach the maximum when the distribution is
Gaussian. This theoretical maximum entropy is
1
Hmax (Y ) = log[(2πe)d |Σ|]. (5.12)
2
5.3 EBEM Algorithm: Exploiting Entropic Graphs 163

The comparison of the Gaussian entropy with the entropy of the underly-
ing data is the criterion which EBEM uses for deciding whether a kernel is
Gaussian or it should be replaced by other kernels which fit better the data.
In order to evaluate the Gaussianity criterion, the Shannon entropy of
the data has to be estimated. Several approaches to Shannon entropy estima-
tion have been studied in the past. We can widely classify them as methods
which first estimate the density function, and methods which by-pass this and
directly estimate the entropy from the set of samples. Among the methods
which estimate the density function, also known as “plug-in” entropy estima-
tors [13], there is the well-known “Parzen windows” estimation. Most of the
current nonparametric entropy and divergence estimators belong to the “plug-
in” methods. They have several limitations. On the one hand the density es-
timator performance is poor without smoothness conditions. The estimations
have high variance and are very sensitive to outliers. On the other hand, the
estimation in high-dimensional spaces is difficult, due to the exponentially in-
creasing sparseness of the data. For this reason, entropy has traditionally been
evaluated in one (1D) or two dimensional (2D) data. For example, in image
analysis, traditionally gray scale images have been used. However, there are
datasets whose patterns are defined by thousands of dimensions.
The “nonplugin” estimation methods offer a state-of-the-art [103] alterna-
tive for entropy estimation, estimating entropy directly from the data set. This
approach allows us to estimate entropy from data sets with arbitrarily high
number of dimensions. In image analysis and pattern recognition the work of
Hero and Michel [74] is widely used for Rényi entropy estimation. Their meth-
ods are based on entropic spanning graphs, for example, minimal spanning
trees (MSTs) or k-nearest neighbor graphs. A drawback of these methods for
the EBEM algorithm is that the methods based on entropic spanning graphs
do not estimate Shannon entropy directly. In the work of Peñalver et al. [116]
they develop a method for approximating the value of Shannon entropy from
the estimation of Rényi’s α-entropy, as explained in the following subsection.

5.3.2 Shannon Entropy from Rényi Entropy Estimation

In Chapter 4 (in Section 4.3.6) we explained how Michel and Hero estimate
the Rényi’s α-entropy from a minimum spanning tree (MST). Equation 4.32
showed how to obtain the Rényi entropy of order α, that is, Hα (Xn ), directly
from the samples Xn , by means of the length of the MST. There is a disconti-
nuity at α = 1 which does not allow us to use this
 value of α in the expression
of the Rényi entropy: Hα (f ) = 1/(1 − α) log z f α (z) dz. It is obvious that
for α = 1 there is a division by zero in 1/(1 − α). The same happens in the
expression of the MST approximation (Eq. 4.32). However, when α → 1, the
α-entropy converges to the Shannon entropy:

lim Hα (p) = H(p) (5.13)


α→1
164 5 Image and Pattern Clustering

The limit can be calculated using L’Hôpital’s rule. Let f (z) be a pdf of z. Its
Rényi entropy, in the limit α → 1 is

log z f α (z) dz
lim Hα (f ) = lim
α→1 α→1 1−α
 1
In α = 1 we have that log z f (z) dz = log 1 = 0 (note that f (z) is a pdf,
then its integral over z is 1). This, divided by 1 − α = 0 is an indetermination
of the type 00 . By L’Hôpital’s rule we have that if

lim g(x) = lim h(x) = 0,


x→c x→c

then
g(x) g  (x)
lim = lim  .
x→c h(x) x→c h (x)

Substituting the expression of the limit of the Rényi entropy:


 ∂
 α
log z f α (z) dz ∂α (log z f (z) dz)
lim = lim
α→1 1−α α→1
∂α (1 − α)


The derivate of the divisor is ∂α (1 − α) = −1, and the derivate of the divi-
dend is
   
∂ 1 ∂ z f α (z) dz
α
log f (z) dz = 
∂α z f α (z) dz ∂α
z

1
=  α f α (z) log f (z) dz.
z
f (z) dz z

The first term of this expression goes to 1 in the limit because f (z) is a pdf:
 α
f (z) log f (z) dz
lim Hα (f ) = lim − z  α
α→1 α→1 f (z) dz
 1 z
f (z) log f (z) dz
=− z
 1
=− f (z) log f (z) dz ≡ H(f )
z

Then, in order to obtain a Shannon entropy approximation from α-entropy,


α must have a value close to 1. It will be convenient to have a value strictly
less than 1, as for α > 1 the Rényi entropy is no more concave, as shown in
Fig. 4.12. The problem is which value close to 1 is the optimal, given a set of
samples.
The experiments presented in [116] show that it is possible to model Hα
as a function of α, independently on the on the size and nature of the data.
The function is a monotonical decreasing one, as shown in Fig. 5.4 (left)
5.3 EBEM Algorithm: Exploiting Entropic Graphs 165
Hs 0,999

α*
Hα 5,0 0,997
4,5 0,995
4,0 0,993
3,5 0,991
3,0 0,989
2,5 0,987
2,0 0,985
0,983
1,5
0,981
1,0
0,979
0,5 0,977
0,0 0,975
0,0 0,2 0,4 0,6 0,8 1,0
α 0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850
N

Fig. 5.4. Left: Hα for Gaussian distributions with different covariance matrices.
Right: α∗ for dimensions between 2 and 5 and different number of samples.

Hs
Hα 5,0
4,5
4,0
3,5
3,0
2,5
2,0
1,5
1,0
0,5
0,0
0,0 0,2 0,4 0,6 0,8 1,0 α

Fig. 5.5. The line tangent to Hα in the point α∗ gives the Shannon entropy ap-
proximated value at α = 1.

(experimental results). For any point of this function, a tangent straight line
y = mx + b can be calculated. This tangent is a continuous function and can
give us a value at its intersection with α = 1, as shown in Fig. 5.5. Only one
of all the possible tangent lines is the one which gives us a correct estimation
of the Shannon entropy at α = 1; let us say that this line is tangent to the
function at some point α∗ . Then, if we know the correct α∗ , we can obtain
the Shannon entropy estimation. As Hα is a monotonous decreasing function,
the α∗ value can be estimated by means of a dichotomic search between two
well-separated values of α, for a constant number of samples and dimensions.
It has been experimentally verified that α∗ is almost constant for diago-
nal covariance matrices with variance greater than 0.5, as shown in Fig. 5.4
(right). Figure 5.6 shows the estimation of α∗ for pdfs with different covariance
matrices and 400 samples.
Then, the optimal α∗ depends on the number of dimensions D and samples
N . This function can be modeled as
a + b · ecD
α∗ = 1 − , (5.14)
N
and its values a, b and c have been calibrated for a set of 1,000 distributions
with random 2 ≤ d ≤ 5 and number of samples. The resulting function is
166 5 Image and Pattern Clustering

Fig. 5.6. α∗ 2D for different covariance values and 400 samples. Value remains
almost constant for variances greater than 0.5.

Eq. 5.14 with values a = 1.271, b = 1.3912 and c = −0.2488, as reported


in [116]. The function is represented in Fig. 5.4 (right).

5.3.3 Minimum Description Length for EBEM

It can be said that the EBEM algorithm performs a model order selection,
even though it has to be tuned with the Gaussianity deficiency threshold.
This threshold is a parameter which does not fix the order of the model. It
determines the degree of fitness of the model to the data, and with a fixed
threshold different model orders can result, depending on the data. However,
it might be argued that there is still a parameter which has to be set, and
the model order selection is not completely solved. If there is the need not
to set any parameter manually, the minimum description length principle can
be used.
The minimum description length principle (see Chapter 3) chooses from
a set of models the representation which can be expressed with the shortest
possible message. The optimal code-length for each parameter is 1/2 log n,
asymptotically for large n, as shown in [140]. Then, the model order selection
criterion is defined as:

N (k)
CM DL (Θ(k) , k) = −L(Θ(k) , y) + log n, (5.15)
2
where the first term is the log-likelihood and the second term penalizes an ex-
cessive number of components, N (k) being the number of parameters required
to define a mixture of k kernels.
5.3 EBEM Algorithm: Exploiting Entropic Graphs 167

5.3.4 Kernel-Splitting Equations

When the kernel K∗, with lowest Gaussianity, has to be split into the K1 and
K2 kernels (components of the mixture), their parameters Θk1 = (μk1 , Σk1 )
and Θk2 = (μk2 , Σk2 ) have to be set. The new covariance matrices have two
restrictions: they must be definite positive and the overall dispersion must
remain almost constant:
π∗ = π1 + π2
π∗ μ∗ = π1 μ1 + π2 μ2 (5.16)
π∗ (Σ∗ + μ∗ μT∗ ) = π1 (Σ1 + μ1 μT1 ) + π2 (Σ2 + μ2 μT2 )

These constraints have more unknown variables than equations. In [48] they
perform a spectral decomposition of the actual covariance matrix and they
estimatethe new eigenvalues and eigenvectors of new covariance matrices.
T
 Let ∗ = V∗ Λ∗ V∗ be 1the spectral d
decomposition of the covariance matrix
∗ , with Λ
∗ = diag(λj∗ , ..., λj∗ ) a diagonal matrix containing the eigen-
values of ∗ with increasing order, ∗ the component with the lowest entropy
ratio, π∗ , π1 , π2 
the 
priors
 of both original and new components, μ∗ , μ1 , μ2
the means and ∗ , 1 , 2 the covariance matrices. Let also be D a d × d
rotation matrix with columns orthonormal unit vectors. D is constructed by
generating its lower triangular matrix independently from d(d − 1)/2 different
uniform U (0, 1) densities. The proposed split operation is given by

π1 = u1 π∗
π2 = (1 − u1 )π∗
d  
μ1 = μ∗ − ( i=1 ui2 λi∗ V∗i ) ππ21
d  
μ2 = μ∗ − ( i=1 ui2 λi∗ V∗i ) ππ12 (5.17)
Λ1 = diag(u3 )diag(ι − u2 )diag(ι + u2 )Λ∗ ππ∗1
Λ2 = diag(ι − u3 )diag(ι − u2 )diag(ι + u2 )Λ∗ ππ∗2
V1 = DV∗
V2 = D T V∗

where, ι is a d × 1 vector of ones, u1 , u2 = (u12 , u22 , ..., ud2 )T and u3 =


(u13 , u23 , ..., ud3 )T are 2d + 1 random variables needed to construct priors, means
and eigenvalues for the new component in the mixture and are calculated as

u1 ∼ be(2, 2), u12 ∼ β(1, 2d),


(5.18)
uj2 ∼ U (−1, 1), u13 ∼ β(1, d), uj3 ∼ U (0, 1)

with j = 2, ..., d. and β() a Beta distribution.


The splitting process of a kernel is graphically described in a 2D exam-
ple, in Fig. 5.7, where it can be seen that the directions and magnitudes of
168 5 Image and Pattern Clustering

V2* V1*

u22 l2*
u12 l*
1

Fig. 5.7. Two-dimensional example of splitting one kernel into two new kernels.

Algorithm 6: Entropy based EM algorithm


ebem algorithm
Initialization: Start with a unique kernel.
K ← 1.
Θ1 ← {μ1 , Σ1 } given by the sample.
repeat:
repeat
E Step
M Step
Estimate log-likelihood in iteration i: i
until: |i − i−1 | < convergence th
Evaluate H(Y ) and Hmax (Y ) globally
if (H(Y )/Hmax < entropy th)
Select kernel K∗ with the lowest ratio and
decompose into K1 and K2
Initialize parameters Θ1 and Θ2 (Eq. 5.17)
Initialize new averages: μ1 , μ2
Initialize new eigenvalues matrices: Λ1 , Λ2
Initialize new eigenvector matrices: V1 and V2
Set new priors: π1 and π2
else
Final ← True
until: Final = True

variability are defined by eigenvectors and eigenvalues of the covariance ma-


trix. A description of the whole EBEM algorithm can be seen in Alg. 6.

5.3.5 Experiments

One of the advantages of the EBEM algorithm is that it starts from K = 1


components. It keeps on dividing kernels until necessary, and needs no back-
ward steps to make any correction, which means that it does not have the
5.3 EBEM Algorithm: Exploiting Entropic Graphs 169
K=1 K=2 K=3
7 7 7

4 4 4

2.5 2.5 2.5

0 0 0

−1 −1 −1
−1 0 3.0 5 7 −1 0 3.0 5 7 −1 0 3.0 5 7
K=4 K=5
7 7

11000

4 4
10000

9000
2.5 2.5
8000

7000
0 0
6000

−1 −1 5000
−1 0 3.0 5 7 −1 0 3.0 5 7 1 2 3 4 5 6 7 8 9 10

Fig. 5.8. Evolution of the EBEM algorithm from one to five final kernels. After
iteration the algorithm finds the correct number of kernels. Bottom-right: Evolution
of the cost function with MDL criterion.

need to join kernels again. Thus, convergence is achieved in very few itera-
tions. In Fig. 5.8 we can see the evolution of the algorithm on synthetic data.
The Gaussianity threshold was set to 0.95 and the convergence threshold of
the EM algorithm was set to 0.001.
Another example is shown in Fig. 5.9. It is a difficult case because there
are some overlapping distributions. This experiment is a comparison with the
MML-based method described in [56]. In [56] the algorithm starts with 20
kernels randomly initialized and finds the correct number of components after
200 iterations. EBEM starts with only one kernel and also finds the correct
number of components, with fewer iterations. Figure 5.9 shows the evolution
of EBEM in this example. Another advantage of the EBEM algorithm is that
it does not need a random initialization.
Finally we also show some color image segmentation results. Color segmen-
tation can be formulated as a clustering problem. In the following experiment
the RGB color space information is taken. Each pixel is regarded to as a sam-
ple defined by three dimensions, and no position or vicinity information is
used. After convergence of the EBEM we obtain the estimated number K of
color classes, as well as the membership of the individual pixels to the classes.
The data come from natural images with sizes of 189×189 pixels. Some results
are represented in Fig. 5.10.
170 5 Image and Pattern Clustering

Fig. 5.9. Fitting a Gaussian mixture with overlapping kernels. The algorithm starts
with one component and selects the order of the model correctly.

5.4 Information Bottleneck and Rate Distortion Theory


In this section the Information Bottleneck method is discussed. The method
may be considered as an improvement over a previous clustering method based
on the Rate Distortion Theory, that will be also explained here. The main idea
behind Information Bottleneck is that data can be compressed (or assigned to
clusters) while preserving important or relevant information, and it has been
succesfully applied to computer vision, in the context of object categorization,
as long as to other pattern classification problems.

5.4.1 Rate Distortion Theory Based Clustering

When partitioning the data in clusters, a representation for each of these


clusters must be chosen; it may be the centroid of the cluster or a random
element contained in it. In order to measure the goodness of the representation
of a cluster with respect to the data in it, a distortion measure must be
defined. Ideal clustering would be achieved in the case of minimal distortion
between the data and their corresponding representations. However, as can
be seen in Fig. 5.11, distortion is inversely proportional to the complexity
of the obtained data model, and as a consequence, data compression (and
thus generalization) is lower as distortion decreases. Rate Distortion Theory
is a branch of information theory addressed to solve this kind of problems:
5.4 Information Bottleneck and Rate Distortion Theory 171

Fig. 5.10. Color image segmentation results. Original images (first column) and
color image segmentation with different Gaussianity deficiency levels (second and
third columns). See Color Plates. (Courtesy of A. Peñalver.)

given a minimal expected distortion, which is the maximum possible data


compression that we can achieve? Or, in other words, how many clusters are
needed to represent these data?
172 5 Image and Pattern Clustering

I(T;X)
1
0.95
0.9
0.85
0.8 D constraint

0.75
0.7
0.65
0.6
0.55
0.5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Ed(X,T)

Fig. 5.11. Left: the effect of data compression on distortion. If T = X (there is


a cluster for each sample), there is zero distortion, but data representation is not
compact. In this case, I(T ; X) = H(X). By the other hand, if |T | = 1 (there is
an only cluster), we obtain a very compact representation of the data, yielding
I(T ; X) = 0; however, distortion is also very high. Right: plot representing the
relation between model complexity I(T ; X) and distortion Ed (X, T ), just introduced,
and an upper bound of D (the vertical slashed line).

Given the partition p(t|x) of a dataset x ∈ X in t ∈ T clusters, its quality


may be measured by

• The complexity of the model I(T ; X) or rate; that is, the information that
a representative gives of thedata in a cluster
• The distortion Ed (X, T ) = ijp(xi )p(tj |xi )d(xi , tj ); that is, the distance
from the data to its representation

As Fig. 5.11 shows, if an upper bound constraint D on the expected func-


tion is given, lower D values could be used, relaxing the distortion constraint
and providing stronger compression. Therefore, Rate Distortion Theory can
be expressed as a the problem of searching the most compact model p(t|x)
given the distortion constraint D:

R(D) = min I(T ; X) (5.19)


{p(t|x):Ed (X,T )≤D}

and this problem can be solved minimizing the following function, where β is
a Lagrange multiplier applied for weighting the distortion term:

F[p(t|x)] = I(T ; X) + βEd (X, T ) (5.20)

Algorithm 7 shows the Blahut–Arimoto algorithm to minimize F, based


on alternating steps to refine p(t|x) and p(t), since both are unknown and each
depends on the other. The normalization factor Z(x, β) ensures that the con-
straint p(t|x) = 1, ∀x ∈ X is satisfied. Divergence is measured by means of
t
5.4 Information Bottleneck and Rate Distortion Theory 173

Algorithm 7: Blahut–Arimoto algorithm


Input: p(x), T , β
Initialize random init p(t)
while no convergence do
p(t)
p(t|x) = Z(x,β) exp−βd(x,t)

p(t) = p(x)p(t|x)
x
end
Output: A clustering p(t|x) that minimizes F for a minimal expected
distortion given by β

a distance function d(x, t) between a data sample and its representative. The
algorithm runs until convergence is achieved. Several examples are given in
Fig. 5.12. This algorithm guarantees that the global minimum is reached and
obtains a compact clustering with respect to the minimal expected distortion,
defined by parameter β. The main problem is that it requires two input pa-
rameters: β and the number of clusters, that must be fixed. However, as can
be seen in Fig. 5.12, depending on distortion the algorithm may yield p(t) = 0
for one or more clusters; that is, this input parameter actually indicates the
maximum number of clusters. Slow convergence for low β values is another
drawback of this algorithm. Finally, result will be conditioned by the initial
random initialization of p(t).

5.4.2 The Information Bottleneck Principle

The main drawback of the Blahut–Arimoto clustering algorithm, and the Rate
Distortion Theory, is the need of defining first a distortion function in order
to compare data samples and representatives. In our previous example, this
distortion was computed as the euclidean distance between the sample and its
corresponding representative. However, in complex problems, the important
features of the input signal that define the distortion function may not be
explicitly known. However, we may be provided with an additional random
variable that helps to identify what is relevant information in the data samples
with respect to this new variable in order to avoid loosing too much of this
information when clustering data. One example is the text categorization
problem. Given a set of words x ∈ X and a set of text categories y ∈ Y , the
objective is to classify a text as belonging to a category yi depending on the
subset of words of X present in the document and their frequency. A common
approach to this problem begins with the splitting of the set X into t ∈ T
clusters, and continues with a learning step that builds document category
models from these clusters; then, any new document is categorized from the
174 5 Image and Pattern Clustering

1000
900
800
700
600
500
400
300
200
100
0
0 100 200 300 400 500 600 700 800 900 1000

1000
900
800
700
600
500
400
300
200
100
0
0 100 200 300 400 500 600 700 800 900 1000

1000
900
800
700
600
500
400
300
200
100
0
0 100 200 300 400 500 600 700 800 900 1000

Fig. 5.12. Example of application of the Blahut–Arimoto algorithm to a set X of


1,000 data samples using β = 0.009, β = 0.01 and β = 1, obtained from a uniform
distribution, being |T | = 20 (in the plots, cluster centroids are represented by cir-
cles). In all cases, p(x) = 1/1,000, ∀x ∈ X. Lower β values decrease the weight of the
distortion term in the rate distortion function F ; as a consequence, the influence of
the distortion term during energy minimization decreases, and the data compression
increases, due to the fact that it is partitioned into less data clusters.

frequency histogram that represents the amount of each cluster that is present
in this document. In this case X and Y are not independent variables, meaning
that I(X; Y ) > 0.
5.4 Information Bottleneck and Rate Distortion Theory 175

Fig. 5.13. The Information Bottleneck principle applied to text categorization. A


text is categorized into a given category y ∈ Y depending on the words x ∈ X
that are present in it and their frequency. In order to increase efficiency, the words
that may be considered during the task are splitted into different clusters t ∈ T .
The objective of Information Bottleneck is to achieve a maximal data compression
(given by I(T ; X)) while maximizing the amount of relevant information of X about
Y (given by I(T ; Y )) that the data compression preserves. This I(T ; Y ) should be
as similar to I(X; Y ) as possible, I(X; Y ) being the amount of information that the
original set X gives about Y .

As stated before, a measure of the quality of this clusterization is given


by I(X; T ). However, another constraint is necessary, due to the fact that if
the data compression is indefinitely increased, then the information that X
provides about Y is lost. Unlike in Rate Distortion Theory, this additional
restriction is not given by distortion for a fixed |T |. Information Bottleneck
is focused on searching an optimal representation T of X so that predicitions
of Y from T are the most similar possible to direct predictions of Y from X,
while minimizing |T |. That is, the objective is to find an equilibrium between
compression, given by I(X; T ), and information of clusters about Y , given by
I(T ; Y ). Figure 5.13 illustrates this concept applied to the text categorization
problem. The equilibrium is reached by minimizing the following function:

L[p(t|x)] = I(T ; X) − βI(T ; Y ) (5.21)

Algorithm 8 is a generalization of the Blahut–Arimoto algorithm. The


output is a clustering p(t|x) that fulfills the previously indicated constraints.
However, due to being based on Blahut–Arimoto, some of the drawbacks of
this algorithm are still present: it needs two parameters (the number of clusters
and β) and its convergence to the solution is also slow. Furthermore, the
algorithm may tend to a local minimum, resulting in a suboptimal solution.
It must also be noted that in this algorithm the divergence measure is replaced
by the Kullback–Leibler divergence.
In Fig. 5.14 we can see a graph representing Ix = I(X; T ) vs. Iy = I(T ; Y )
for different values of |T |. These results were obtained from the input data
176 5 Image and Pattern Clustering

Algorithm 8: Information Bottleneck based on a generalization of the


Blahut–Arimoto algorithm
Input: p(x, y), T , β
Initialize random init p(t|x)
while no convergence do
p(t)
p(t|x) = Z(x,β exp−βDKL [p(y|x)||p(y|t)]

1
p(y|t) = p(t) p(y|x)p(t|x)
 x
p(t) = p(x)p(t|x)
x
end
Output: A clustering p(t|x) that minimizes L for a given β

0.25
⎟ T⎟=3
⎟ T⎟=13
0.2
⎟ T⎟=23

0.15
I(T;Y)

0.1

0.05

0
0 0.2 0.4 0.6 0.8 1
I(T;X)

Fig. 5.14. Plot of I(T ; X) vs. I(T ; Y ) for three different values of the cardinality
of T : |T | = 3, |T | = 13 and |T | = 23 and for input data in Fig. 5.15. A part of
this plot is zoomed in order to show that when plots corresponding to different T
cardinalities arrive to a specific value, they diverge. By Altering the β parameter
we are displacing through any of these convex curves, depending on the growth rate
given in the text. This fact suggests that a deterministic annealing approach could
be applied.

in Fig. 5.15. The plot shows that the growth rate of the curve is different
depending on the β parameter. Specifically, this ratio is given by

δI(T ; Y )
= β −1 > 0 (5.22)
δI(X; T )

There is a different convex curve for each different cardinality of the set
T . Varying the value of β, we may move through any of these convex curves
on the plane Ix Iy . This fact suggests that a deterministic annealing (DA)
approach could be applied in order to find an optimal clustering.
5.5 Agglomerative IB Clustering 177

0.5

0 0
300 50
100
200 150
200
100
250
300

Fig. 5.15. Input data used to compute the plots in Figs. 5.14 and 5.16. Given two
classes Y = {y1 , y2 }, the shown distribution represents p(X, Y = y1 ) (from which
p(X, Y ) can be estimated).

5.5 Agglomerative IB Clustering

The agglomerative Information Bottleneck Clustering (AIB) was designed as


an alternative to other methods such as the generalization of the Blahut–
Arimoto algorithm explained above. The main feature of this greedy algorithm
is that rather than being a soft clustering algorithm, it reduces data by means
of a bottom-up hard clustering; that is, each sample is assigned to an only
cluster. It is based on relating I(T ; Y ) with Bayesian error. The algorithm is
deterministic, in the sense that it yields hard clustering for any desired num-
ber of clusters. Furthermore, it is not parametric. And it is not limited to hard
clustering: due to the fact that the results of this method may be considered
as the limit of the Information Bottleneck based on deterministic annealing
(when zero temperature is reached), using the Information Bottleneck equa-
tions shown in previous sections these results must be transformed to soft
clustering. However, the main drawbacks of the method are two. First, it is a
greedy algorithm, thus its results are optimal in each step but it is globally
suboptimal. And second, its bottom up behavior, starting with a cluster for
each sample and joining them until only one cluster is achieved, makes this
algorithm computationally expensive.

5.5.1 Jensen–Shannon Divergence and Bayesian Classification


Error

The Information Bottleneck principle may also be interpreted as a Bayesian


classification problem: given a set of samples X, these samples must be clas-
sified as members of a set of classes Y = {y1 , y2 , . . . , yn } with a priori proba-
bilities {p(yi )}. In this case, Bayesian error is given by

PBayes (e) = p(x)(1 − maxi p(yi |x)) (5.23)
x∈X
178 5 Image and Pattern Clustering

The following equation shows how this Bayesian error is bounded by


Jensen–Shannon divergence:
1 1
(H(Y )−JSp(yi ) [p(x|yi )])2 ≤ PBayes (e) ≤ (H(Y )−JSp(yi ) [p(x|yi )])
4(M − 1) 2
(5.24)
where the Jensen–Shannon divergence of M distributions pi (x), each one hav-
ing a prior Πi , 1 ≤ i ≤ M , is
!M "
 
M
JSΠ [p1 , p2 , . . . , pM ] = H Πi pi (x) − Πi H[pi (x)] (5.25)
i=1 i=1
The following derivation also relates Jensen–Shannon to Mutual Informa-
tion:
!M " M
 
JSp(y1 ,...,p(yM ) [p(x|yi ), . . . , p(x|yM )] = H p(yi )p(x|yi ) − p(yi )H[p(x|yi )]
i=1 i=1
= H(X) − H(X|Y ) = I(X; Y )
We will see later how these properties are applied to AIB.

5.5.2 The AIB Algorithm


Figure 5.12 shows examples of Blahut–Arimoto results. As can be seen, given
a fix value for parameter |T |, the final number of clusters increases for higher
β values. This effect is also produced in the Blahut–Arimoto generalization
for Information Bottleneck. In general, given any finite cardinality |T | ≡ m,
as β → ∞ we reach a data partition in which each x ∈ X is part of an only
cluster t ∈ T for which DKL [p(y|x)||p(y|t)] is minimum and p(t|x) only takes
values in {0, 1}. The AIB algorithm starts from an initial state with |T | = |X|,
and proceeds in each step joining two clusters, selected by means of a greedy
criterion, until all samples are compressed in an only cluster. A hard clustering
constraint is applied during this process.
Given an optimal partition of X into a set of clusters Tm = {t1 , t2 , . . . , tm },
a union of k of these clusters to create a new union cluster tk , and the cor-

responding new partition Tm , always yields a loss of information, due to the

fact that I(Tm ; Y ) ≥ I(TM ; Y ). Therefore, each greedy step of the algorithm
selects the union cluster tk that minimizes this loss of information, that is
estimated from the following distributions, in order to yield a new partition
in which m = m − k + 1:


⎪ k

⎪ p(t 
) = p(ti )

⎪ k


⎨ i=1
 k
p(y|t 
) = 1
p(ti , y), ∀y ∈ y (5.26)

⎪ k 
p(tk )




i=1

⎪  1 if x ∈ ti for any 1 ≤ i ≤ k
⎩ p(tk |x) = ∀x ∈ X
0 otherwise
5.5 Agglomerative IB Clustering 179

Before dealing with the AIB algorithm, some concepts must be introduced:
• The merge prior distribution of tk is given by Πk ≡ (Π1 , Π2 , . . . , Πk ),
p(ti )
where Πk is the a priori probability of ti in tk , Πi = p(t ).
k
• The decrease of information in Iy = I(T ; Y ) due to a merge is

δIy (t1 , . . . , tk ) = I(TM ; Y ) − I(TM ;Y ) (5.27)

• The decrease of information in Ix = I(T ; X) due to a merge is



δIx (t1 , . . . , tk ) = I(TM ; X) − I(TM ; X) (5.28)

From these definitions, it can be demonstrated that:

δIx (t1 , . . . , tk ) = p(tk )H[Πk ] ≥0 (5.29)



δIy (t1 , . . . , tk ) = p(tk )JSΠk [p(Y |t1 ), . . . , p(Y |tk )] ≥0 (5.30)

The algorithm is shown in Alg. 9. The cardinality of the initial partition


is m = |X|, containing each cluster an only sample. In each step, a greedy

Algorithm 9: Agglomerative Information Bottleneck


Input: p(x, y), N = |X|, M = |Y |
Construct T ≡ X:
for i = 1 . . . n do
ti = {xi }
p(ti ) = p(xi )
p(y|ti ) = p(y|xi ) for every y ∈ Y
p(t|xj ) = 1 if j = i and 0 otherwise
end
T = {t1 , . . . , tj }

Information loss precalculation:


for every i, j = 1 . . . N, i < j do
di,j = (p(ti ) + p(tj ))JSΠ2 [p(y|ti ), p(y, tj )]
end

Main loop:
for t = 1 . . . (N − 1) do
Find {α, β} = argmini,j {di,j }.
Merge {zα , zβ } ⇒ t .
p(t ) = p(zα ) + p(zβ )
p(y|t ) = p(t1 ) (p(tα , y) + p(tβ , y)) for every y ∈ Y
p(t |x) = 1 if x ∈ zα ∪ zβ and 0 otherwise, for every x ∈ X
Update T = {T − {zα , zβ }} ∪ {t }.
Update di,j costs w.r.t. t , only for couples that contain zα or zβ .
end
Output: Tm : m-partition of X into m clusters, for every 1 ≤ m ≤ N
180 5 Image and Pattern Clustering

decision selects the set of clusters to join that minimizes δIy (t1 , . . . , tk ), al-
ways taking k = 2 (pairs of clusters). Clusters are joined in pairs due to a
property of the information decrease: δIy (t1 , . . . , tk ) ≤ δIy (t1 , . . . , tk+1 ) and
δIx (t1 , . . . , tk ) ≤ δIx (t1 , . . . , tk+1 ) ∀k ≥ 2; the meaning of this property is that
any cluster union (t1 , . . . , tk ) ⇒ tk can be built as (k − 1) consecutive unions
of cluster pairs; for 1 ≤ m ≤ |X|, the optimal partition may be found from
(|X|−m) consecutive union of cluster pairs. Therefore, the loss of information
δIy must be computed for each cluster pair in Tm in order to select cluster
pairs to join. The algorithm finishes when there is only one cluster left. The
result may be expressed as a tree from where cluster sets Tm may be inferred
for any m = |X|, |X| − 1, . . . , 1 (see Fig. 5.16).
In each loop iteration the best pair union must be found; thus, complex-
ity is O(m|Y |) for each cluster pair union. However, this complexity may be
decreased to O(|Y |) if the mutual information loss due to a pair of clusters
merging is estimated directly by means of one of the properties enumerated
above: δIy (t1 , . . . , tk ) = p(tk )JSΠk [p(Y |t1 ), . . . , p(Y |tk )]. Another improve-
ment is the precalculation of the cost of merging any pair of clusters. Then,
when two clusters ti and tj are joined, this cost must only be updated only
for pairs that include one of these two clusters.
A plot of I(T ; Y )/I(X; Y ) vs. I(T ; X)/H(X) is shown in Fig. 5.16. As can
be seen, the decrease of mutual information δ(m) when decreasing m, given
by the next equation, only can increase

0.8
I(Z;X)/H(X)

0.6

0.4

0.2

0
0 0.2 0.4 0.6 0.8 1
I(Z;Y)/I(X;Y)

Fig. 5.16. Results of the AIB algorithm applied to a subset of the data in Fig. 5.15.
This subset is built from 90 randomly selected elements of X. Left: tree showing
the pair of clusters joined in each iteration, starting from the bottom (one cluster
assigned to each data sample) and finishing at the top (an only cluster containing all
the data samples). Right: plot of I(T ; Y )/I(X; Y ) vs. I(T ; X)/H(X) (information
plane). As the number of clusters decreases, data compression is higher, and as
a consequence I(T ; X) tends to zero. However, less number of clusters also means
that the information that these clusters T give about Y decreases; thus, I(T ; Y ) also
tends to zero. The algorithm yields all the intermediate cluster sets; so a trade-off
between data compression and classification efficiency may be searched.
5.5 Agglomerative IB Clustering 181

I(Tm ; Y ) − I(Tm−1 ; Y )
δ(m) = (5.31)
I(X; Y )

This measure provides a method to perform a basic order selection. When


δ(m) reaches a high value, a meaningful partition of clusters has been achieved;
further merges will produce a substantial loss of information, and the algo-
rithm can stop at this point. However, this point is reached when m is low,
and in this case the complexity of AIB is negligible; thus, usually it is worthy
to finish the algorithm and obtain the complete cluster tree, as the one shown
in Fig. 5.16.
It must be noted that usually hard clustering from AIB yields worser
results than any other Information Bottleneck algorithm based on soft clus-
tering. The cause is that soft clustering is optimal, as it maximizes I(T ; Y )
while constraining I(T ; X). However, any I(T ; X) vs. I(T ; Y ) curve in the
information plane, obtained by annealing, intersects with the curve obtained
from AIB. Thus, AIB may be considered as a good starting point, from which
reverse annealing will guide to an ultimate solution.

5.5.3 Unsupervised Clustering of Images


This section is aimed to show an application of Information Bottleneck to com-
puter vision in the image retrieval field. The objective of an image retrieval sys-
tem is: given an input query image, answer with a ranked set of images, from
most to less similar, extracted from a database. If the images of the database
are clustered, then an exhaustive search in the database is not needed, and
the process is less computational expensive. Clustering of the database im-
ages by means of Information Bottleneck principle may help to maximize the
mutual information between clusters and image content, not only improving
efficiency, but also performance.
In the work by Goldberger et al. [63], the images are modeled by means of
Gaussian Mixtures, and clustered using AIB. However, due to the fact that
Gaussian Mixture Models are not discrete, the AIB algorithm must be adapted
to continuous distributions. An example of how the images are modeled by
means of Gaussian Mixtures is shown in Fig. 5.17. Each pixel is described
by a feature vector built from its color values in the Lab color space and its
localization in the image (although only these features are used, the model is
general and could be extended using other features like texture, for instance).
From this representation, a Gaussian Mixture Model f (x) is extracted for
each image X, grouping pixels in homogeneous regions defined by Gaussian
distributions.
An usual similarity measure in image applications is the Kullback–Leibler
divergence, and in this work this is also the measure applied to compare im-
ages or clusters. This measure is easily estimated from histograms or discrete
distributions, as we have seen during previous chapters. In the case of two dis-
tributions f and g representing Gaussian Mixture Models, Kullback–Leibler
may be approximated from Monte Carlo simulations:
182 5 Image and Pattern Clustering

b 0.5
1
a
0.4

Information Loss (bits)


0.3
2

0.2 3
4

5
0.1 876
10 9
18171413
19 11
12
1615
0
1400 1410 1420 1430 1440 1450 1460
Algorithm Steps
1

c 2

3
4
5 6
8 7
9
10
11
12
13
14
16 15
17
18

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

d
6
AIB - Color+XY GMM
AIB - Color GMM
AIB - Color histogram
5 AHI - Color histogram

4 e
Image Representation I (X,Y)
Color Histogram 2.08
I(C;Y)

3 Color GMM 3.93


Color + XY GMM 5.35
2

0
8 7 6 5 4 3 2 1 0
I(C;X)

Fig. 5.17. From left to right and from top to bottom. (a) Image representation. Each
ellipsoid represents a Gaussian in the Gaussian Mixture Model of the image, with its
support region, mean color and spatial layout in the image plane. (b) Loss of mutual
information during the IAB clustering. The last steps are labeled with the number of
clusters in each step. (c) Part of the cluster tree formed during AIB, starting from 19
clusters. Each cluster is represented with a representative image. The labeled nodes
indicate the order of cluster merging, following the plot in (b). (d) I(T ; X) vs.
I(T ; Y ) plot for four different clustering methods. (e) Mutual Information between
images and image representations. (Figure by Goldberger et al. (2006
c IEEE)). See
Color Plates.
5.5 Agglomerative IB Clustering 183

1
n
f f (xt )
D(f ||g) = f log ≈ log (5.32)
g n t=1 g(xt )

where x1 , x2 , . . . , xn are sampled from f (x).


As stated before, image clustering is performed by means of AIB, in which
X represents the image set to classify, p(x) the priors, that are considered uni-
form, and Y the random variable associated with the feature vector extracted
from a unique pixel. The distribution f (y|x), from which the input parameter
p(x, y) may be estimated, is given by


k(x)
f (y|x) = αx,j N (μx,j , Σx,j ) (5.33)
j=1

where k(x) is the number of Gaussian components for image x. There are
several equations of Alg. 9 that must be adapted to Gaussian Mixture Model.
For instance, given a cluster t, f (y|t) is the mean of all image models that are
part of t:
1  1 
k(x)
f (y|t) = f (y|x) = αx,j N (μx,j , Σx,j ) (5.34)
|t| x∈t |t| j=1

As can be seen, f (y|t) is also modeled as a Gaussian Mixture. When two


clusters t1 and t2 are joined, the updating of this joint distribution is given by

1   |ti |
f (y|t1 ∪ t2 ) = f (y|x) = f (y|ti ) (5.35)
|t1 ∪ t2 | x∈t ,t |t ∪ t2 |
i=1,2 1
1 2

Finally, the cost of merging two clusters is given by


 |ti |
d(t1 , t2 ) = D(f (y|ti )||f (y|t1 ∪ t2 )) (5.36)
i=1,2
|X|

where |X| is the size of the image database. From this formulation, AIB can
be applied as explained in previous section. In Fig. 5.17, the loss of mutual
information for a given database image is shown. In the same figure, part of
the cluster tree is also represented.
Mutual information is used in this work as a measure of quality. For in-
stance, the clustering quality may be computed as I(X; Y ), X being the unsu-
pervised clustering from AIB and Y a manual labeling. Higher values denote
better clustering quality, as the cluster gives more information about the im-
age classes. The quality of the image representation can also be evaluated by
means of I(X; Y ), in this case X being the set of images and Y the features
extracted from their pixels. Due to the fact that a closed-form expression to
calculate I(X; Y ) for Gaussian Mixtures does not exist, this mutual informa-
tion is approximated from I(T ; X) and I(T ; Y ). In Fig. 5.17, the I(T ; X) vs.
184 5 Image and Pattern Clustering

I(T ; Y ) plot is shown for AIB using Gaussian Mixture Models based on color
and localization (AIB – Color+XY GMM), Gaussian Mixture Models based
only on color (AIB – Color GMM), and without Gaussian Mixtures for AIB
and Agglomerative Histogram Intersection (AIB – Color histogram and AHI –
Color histogram). In all cases, I(X; Y ) is extracted from the first point of the
curves, due to the fact that the sum of all merges cost is exactly I(X; Y ).
Related to this plot, we show in Fig. 5.17 a table that summarizes these val-
ues. The best quality is obtained for color and localization based Gaussian
Mixture Models.
Image retrieval from this clustering is straightforward. First the image
query is compared to all cluster representatives, and then it is compared to
all images contained in the selected cluster. Not only computational efficiency
is increased, compared to an exhaustive search, but also previous experiments
by Goldberger et al. show that clustering increases retrieval performance.
However, this approach has several drawbacks, the main one being that it
is still computationally expensive (training being the most time consuming
phase). Furthermore, no high level information like shape and texture is used,
and as a consequence all images in a category must be similar in shape and
appearance.

5.6 Robust Information Clustering


A recent clustering algorithm based on the principles of Rate Distortion and
information theory is Robust Information Clustering (RIC) [148]. This algo-
rithm is nonparametric, meaning that it is not based on data pdfs. The idea
behind the algorithm is to apply two steps in order to perform a minimax
mutual information approach. During the first step, minimization is achieved
by means of a rate distortion process which leads to a deterministic annealing
(DA) that splits the data into clusters. The annealing process continues until
a maximum number of clusters is reached. Maximization step is intended to
identify data outliers for each different number of clusters. An optimal number
of clusters is estimated in this step based on Structural Risk Minimization.
The Deterministic Annealing step has been described in previous sections
(see Blahut–Arimoto algorithm for details). Thus, we focus on the use of the
channel capacity, an information theory related measure, to detect outliers
and estimate the model order during RIC algorithm. To understand channel
capacity, we must think in terms of communications over a noisy channel.
Messages x ∈ X are sent through this channel, being received at the other
side as y ∈ Y . In this case, mutual information I(X; Y ) represents the quality
of the channel when transmitting messages. Thus, channel capacity can be
described as the maximum of this mutual information, and represents the
maximum information rate that can travel through it:

C = max I(X; Y ) (5.37)


p(x)
5.6 Robust Information Clustering 185

When information rate exceeds the channel capacity, then this communi-
cation is affected by distortion. Properties of channel capacity are:
1. C ≥ 0, due to the fact that I(X|Y ) ≥ 0.
2. C ≤ log |X|, due to the fact that C = max I(X; Y ) ≤ max H(X) = log |X|.
3. C ≤ log |Y |, for the same reason.
4. I(X; Y ) is a continuous function of p(x).
5. I(X; Y ) is a concave function of p(x) (thus, a maximum value exists).
During Robust Information Clustering, and given the sample priors p(X)
and the optimal clustering p̄(W |X) obtained after a Deterministic Annealing
process, p(X) is updated in order to maximize the information that samples
x ∈ X yield about their membership to each cluster w ∈ W (see the last
channel capacity property):

C(D(p(X))) = max C(D(p(X))) = max I(p(X); p̄(W |X)) (5.38)


p(X) p(X)

This channel capacity is subject to a constraint D(p(X)) given by


l 
K
D(p(X)) = p(xi )p̄(wk |xi )d(wk , xi ) (5.39)
i=1 k=1

where d(wk |xi ) is the dissimilarity between sample xi ∈ X and cluster centroid
wk ∈ W . In order to perform the maximization, a Lagrange multiplier λ ≥ 0
may be introduced:

C(D(p(X))) = max[I(p(X); p̄(W |X)) + λ(D(p̄(X)) − D(p(X)))] (5.40)


p(X)

In the latter equation, p̄(X) is a fixed unconditional a priori pmf (usually


a equally distributed probability function). Based on a robust density estima-
tion, with D(p(X)) = D(p̄(X)) for p(xi ) ≥ 0, the maximum of the constrained
capacity for each xi is given by
ci
p(xi ) = p(xi ) l (5.41)
p(xi )ci
i=1

where:
⎡ ⎤
⎢ 
K
p̄(wk |xi ) ⎥
ci = exp ⎣ (p̄(wk |xi ) ln l − λp̄(wk |xi )d(wk , xi )⎦
k=1 p(xj )p̄(wk |xi )
j=1
(5.42)

If p(xi ) = 0 for any xi ∈ X, then xi is considered an outlier. In this regard,


λ may be understood as an outlier control parameter; depending on the value
186 5 Image and Pattern Clustering

Empirical risk
VC dimension
Risk True risk

S3
S2
S1

Model complexity

Fig. 5.18. Left: after choosing a class of functions to fit the data, they are nested
in a hierarchy ordered by increasing complexity. Then, the best parameter config-
uration of each subset is estimated in order to best generalize the data. Right: as
complexity of the functions increases, the VC dimension also increases and the em-
pirical error decreases. The vertical dotted line represents the complexity for which
the sum of VC dimension and empirical error is minimized; thus, that is the model
order. Functions over that threshold overfit the data, while functions below that
complexity underfit it.

of λ, more samples will be considered as outliers. However, it must be noted


that λ = 0 should not be chosen, as the resultant number of outliers is an
essential information during model order selection.
Regarding model order, its estimation is based on Structural Risk Mini-
mization, that involves finding a trade-off between a classifier empirical error
(i.e., its fitting quality) and its complexity. Complexity is measured in terms of
VC (Vapnik and Chervonenkis) dimension. Figure 5.18 illustrates this process,
that may be summarized as
1. From a priori knowledge, choose the class of functions that fit to the data:
n degree polynomials, n layer neural networks, or n cluster data labeling,
for instance.
2. Classify the functions in this class by means of a hierarchy of nested subsets,
ordered by increasing complexity. In the case of polynomials, for instance,
they may be ordered in an increasing degree order.
3. Apply an empirical risk minimization for each subset, that is, find for each
subset the parameters that better fit the data.
4. Select the model for which the sum of empirical risk (that decreases as
complexity increases) and VC dimension (that increases as complexity in-
creases) is minimal.
In the case of RIC algorithm, the set of nested subsets is defined by

S1 ⊂ S2 ⊂ · · · ⊂ SK ⊂ · · · (5.43)

where SK = (QK (xi , W ) : w ∈ ΛK ), ∀i, with a set of functions that indicate


the empirical risk for the fitted model during deterministic annealing:
5.6 Robust Information Clustering 187


K 
K
p(wk ) exp(−d(xi , wk )/T )
QK (xi , W ) = lim p(wk |xi ) = lim K
T →0 T →0
k=1 k=1 p(wk ) exp(−d(wk , xi )/T )
k=1
(5.44)
When T → 0, then p(wk |xi ) may be approximated as the complement of
a step function, that is linear in parameters and assigns a label to each sam-
ple depending on the dissimilarity between sample xi and cluster wk . Thus,
as stated by Vapnik [165], VC dimension may be estimated from parameter
number, being hk = (n + 1)k for each Sk . Then increment in cluster number
leads to an increment of complexity: h1 ≤ h2 ≤ · · · ≤ hk ≤ · · · . From this
starting point, model order is selected minimizing the following VC bound,
similarly to Vapnik application to Support Vector Machines:
  1/2 
ε 4
ps ≤ η + 1+ 1+η (5.45)
2 ε

where
m
η= (5.46)
l
hk (ln h2lk + 1) − ln 4ξ
ε=4 (5.47)
l
l and m being the number of samples and outliers, and ξ < 1 a constant.
The RIC algorithm can be seen in Alg. 10. From the dataset X = {x1 ,
. . . , xl } and a chosen maximum number of clusters Kmax , the algorithm re-
turns these data splitted into a set of clusters with centers W = {w1 , . . . , wk }
and identifies the outliers. Song provides an expression to estimate the param-
eter Kmax , but depending on data nature this expression may not be valid;
thus, we leave it as an open parameter in Alg. 10. Regarding dissimilarity
measure, the euclidean distance was the one chosen for this algorithm. Due
to this fact, RIC tends to create hyperspherical clusters around each cluster
center. Alternative dissimilarity measures may help to adapt the algorithm to
kernel based clustering or data that cannot be linearly separable.
An example of application is shown in Fig. 5.19. In this example, RIC is
applied to a dataset obtained from four different Gaussian distributions, thus,
that is the optimal number of clusters. In this example, we set Tmin to 0.1
and α to 0.9. Parameter ξ, used during order selection, is set to 0.2. Finally,
the parameters ε and λ, that affect the amount of outliers found during the
algorithm, were set to 1 and 0.05, respectively. As can be seen, although
clustering based on five clusters from deterministic annealing (K = 5) yields
no outliers, the algorithm is able to find the optimal number of clusters in
the case of K = 4, for which several outliers are detected. Note that the
model order selection step needs information about noisy data. Therefore, if
λ = 0 is used to solve the same example, the algorithm reaches Kmax without
discovering the optimal model order of the data.
188 5 Image and Pattern Clustering

Algorithm 10: Robust Information Clustering


Input: n-dimensional samples X, |X| = l, Kmax
1. Initialization

• Priors initialization: p̄(x) = 1/l


• Temperature initialization: T > 2λmax (Vx ), where λmax (Vx ) is the
highest eigenvalue of the X covariance matrix Vx
• Cluster initialization: K = 1 and p(w1 ) (the algorithm starts with an
only cluster)

2. Deterministic Annealing: the aim of this step is to obtain the optimal


soft clustering p(w|x) for the current K and T
while no convergence do
for i = 1 . . . K do
2
p(wi ) exp−(x−wi  /T )
p(wi |x) =

K
2
p(wj ) exp−(x−wj  /T )

j=1
p(wi ) = p̄(x)p(wi |x)
x
p̄(x)p(wi |x)
x
wi = p(wi )
end
end
3. Cooling: T = αT , being alpha < 1
4. Cluster partition step: calculate λmax (Vxw ) for each cluster k, being:
l
Vxw = p(xi |wk )(xi − wk )(xi − wk )T
i=1
Given M = min λmax for k = 1 . . . K, corresponding to cluster k̄, which may
k
be splitted during this step.
if T ≤ 2M and K < Kmax then
Outlier detection and order selection:
Initialization: p(xi ) = 1/l, lambda > 0, epsilon > 0, and p̄(w|x) being the
optimal8 clustering partition given8 by step 2.
8  l 8
8 8
while 8ln p(xi )ci − ln max ci 8 < ε do
8 i=1...l 8
i=1
for i = 1 . . . l do
 K
p̄(wk |xi )
ci = exp[ (p̄(wk |xi ) ln l − λp̄(wk |xi )wk − xi 2 )]
k=1

p(xj )p̄(wk |xi )
j=1
p(xi )ci
p(xi ) =

l
p(xi )ci
i=1
end
end
(continued)
5.7 IT-Based Mean Shift 189

Algorithm 10: continued


Order selection: Calculate ps (Eq. 5.45) for the current K. If a minimum
ps is found, the optimal order has been achieved. In this case, STOP the
algorithm and return the clustering for T = 0
cluster partitioning: cluster k̄ is splitted into two new clusters
wK+1 = wk̄ + δ and wk̄ = wk̄ − δ
p(w ) p(w )
p(wK+1 ) = 2 k̄ , p(wk̄ ) = 2 k̄
K =K +1
end
5. Temperature check: if (T > T min) and K < Kmax return to 2.
Otherwise, STOP the algorithm and return the clustering for T=0.
Output: W : k cluster centers, p(W |X)

5.7 IT-Based Mean Shift

Information theory not only was included in clustering algorithms during last
years, but it also may help to assess or even theoretically study this kind
of methods. An example is given in this section, where Mean Shift, a popu-
lar clustering and order selection algorithm, is observed from an Information
Theoretical point of view [135].

5.7.1 The Mean Shift Algorithm

Mean shift is a nonparametric clustering and order selection algorithm based


on pdf estimation from Parzen windows. Given x ∈ X = (xi )N i=1 , Parzen
window is defined as

1 
N
P (x, σ) = G(||x − xi ||2 , σ) (5.48)
N i=1

where G(t, σ) = exp−t /2σ is a Gaussian kernel with variance σ. In order to


2 2

estimate the pdf modes from a set of samples obtained from a multimodal
Gaussian density, the mean shift algorithm looks for stationary points where
∇P (x, σ) = 0. This problem may be solved by means of an iterative procedure
in which xt+1 at iteration t + 1 is obtained from its value xt at previous
iteration: N
G(||x − xi ||2 , σ)xi
t+1 t i=1
x = m(x ) = N (5.49)
G(||x − xi ||2 , σ)
i=1

In the latter equation, m(x) represents the mean of all samples xi ∈ X


when the Gaussian kernel is centered in x. In order to reach a stationary
point, the kernel centered in x must follow the direction of the mean shift
vector m(x) − x. The effect of normalization in m(x) is that the kernel moves
190 5 Image and Pattern Clustering
25 25

20 20

15 15

10 10

5 5

0 0

−5 −5

−10 −10
−15 −10 −5 0 5 10 15 20 −15 −10 −5 0 5 10 15 20

25 25

20 20

15 15

10 10

5 5

0 0

−5 −5

−10 −10
−15 −10 −5 0 5 10 15 20 −15 −10 −5 0 5 10 15 20
25 25

20 20

15 15

10 10

5 5

0 0

−5 −5

−10 −10
−15 −10 −5 0 5 10 15 20 −15 −10 −5 0 5 10 15 20

Fig. 5.19. Example of application of RIC to a set of 2D samples gathered from four
different Gaussian distributions. From left to right, and from top to bottom: data
samples, optimal hard clustering for K = 1 (ps = 0.4393), optimal hard clustering
for K = 2 (ps = 0.6185), optimal hard clustering for K = 3 (ps = 0.6345), optimal
hard clustering for K = 4 (ps = 0.5140) and optimal hard clustering for K = 5
(ps = 0.5615). In all cases, each cluster is represented by a convex hull containing
all its samples, and a circle is representing the cluster representative. Outliers are
represented by star symbols. When K = 4 the minimum ps is found; thus, although
in other cases the amount of outliers is lower, K = 4 is the optimal number of
clusters returned by the algorithm.

with large steps through low density regions and with small steps otherwise;
step size estimation is not needed. However, the kernel width σ remains as an
important parameter.
Iteratively application of Eq. 5.49 until convergence is known as Gaussian
Blurring Mean Shift (GBMS). This process searches the pdf modes while
5.7 IT-Based Mean Shift 191

blurring initial dataset. First, samples evolve to the modes of the pdf while
mutually approaching. Then, and from a given iteration, data tend to collapse
fast, making the algorithm unstable. The blurring effect may be avoided if
the pdf estimation is based on the original configuration of the samples. This
improvement is known as Gaussian Mean Shift (GMS). In GMS, the pdf
estimation is obtained comparing the samples in current iteration with original
ones xoi ∈ X o :
N
G(||x − xoi ||2 , σ)xoi
t+1 t i=1
x = m(x ) = N (5.50)
G(||x − xoi ||2 , σ)
i=1

Samples in X o are not modified during this algorithm, which is stable


and always converges with a linear convergence rate. An additional property
of this algorithm is that the trajectory followed by the samples toward local
minima is smooth: the angle between two consecutive mean shifts is always
in the range (− π2 , π2 ). Although GMS algorithm is still sensitive to parameter
σ, it may be dynamically adapted using some of the methods reviewed in
Chapter 5.

5.7.2 Mean Shift Stop Criterion and Examples

An advantage of GMS over GBMS is that samples are not blurred. Kernels
tend to converge to the pdf modes and remain stable. Therefore, a simple stop
crietrion for GMS based on a translation threshold may be applied. GMS ends
when the sum of mean shift vector magnitudes of the samples is lower than a
given threshold, that is
1  t
N
d (xi ) < δ (5.51)
N i=1

dt (xi ) being the mean shift magnitude corresponding to sample xi at


iteration t:
dt (xi ) = ||xti − xt−1
i ||2 (5.52)
Concerning GBMS, there are two main convergence phases. During the
first one, all samples rapidly collapse to their pdf modes, while pdf modes
are slowly displaced among each others. Then, and depending on Parzen’s
Window width, all pdf modes also collapse to produce an only sample. Thus,
in the case of GBMS, we should apply a stop criterion that stops the algorithm
at the end of the first convergence phase. The criterion described in the original
work by Rao et al. [135] is a heuristic that relies on information theory.
During the second convergence phase of GBMS, the set dt = {dt (xi )}N i=1
contains only k different values, k being the detected number of pdf modes.
Splitting dt into a histogram with sufficient bin size and in the range
[0, max(dt )], only k of these bins will not be zero valued. The transition from
192 5 Image and Pattern Clustering

the first convergence phase to the second one can be detected when the dif-
ference of Shannon’s entropy estimated from this histogram at consecutive
iterations is approximately equal to zero:

|H(dt+1 ) − H(dt )| < 10−8 (5.53)

Two examples of GMS and GBMS are shown in Figs. 5.20 and 5.21. The
data in the first example are extracted from a ring of 16 Gaussian pdfs, with
the same variance and different a priori probabilities (i.e. the reason why
Gaussians are represented with different heights in that figure). In the case
of the second example, the data were extracted from 10 Gaussian pdfs with
random mean and variance. As can be seen, in both cases GMS outperforms
GBMS, correctly locating the exact number of pdf modes and giving a more
accurate prediction of the actual mode location. An effect of GBMS is the
collapse of several modes if their separation is lower than the kernel size.

50
40
30
20
3
10 x10

0 10
8
6 50
−10 4
2
−20
50
−30 0

−40 0

−50
−50 −40 −30 −20 −10 0 10 20 30 40 50 −50 −50

50 50
40 40
30 30
20 20
10 10
0 0
−10 −10
−20 −20
−30 −30
−40 −40
−50 −50
−50 −40 −30 −20 −10 0 10 20 30 40 50 −50 −40 −30 −20 −10 0 10 20 30 40 50

Fig. 5.20. Example of GMS and GBMS application. The samples were obtained
from a ring of Gaussian distributions with equal variance and different a priori
probabilities. From left to right and from top to bottom: input data, pdfs from which
data were extracted, GBMS results, and GMS results.
5.7 IT-Based Mean Shift 193

50
40
0.07
30
0.06
20 0.05
0.04
10
0.03
0 0.02
0.01
−10 0
40
−20 40 30
30 20
−30 20 10
10
0 0
−40 −10
−40 −30 −20 −10 0 10 20 30 40 −10
−20 −20
−30 −30
−40 −40
50 50

40 40

30 30

20 20

10 10

0 0
−10 −10

−20 −20

−30 −30

−40 −40
−40 −30 −20 −10 0 10 20 30 40 −40 −30 −20 −10 0 10 20 30 40

Fig. 5.21. Example of GMS and GBMS application. The samples were obtained
from 10 Guassian distributions with random means and variances. From left to right
and from top to bottom: input data, pdfs from which data were extracted, GBMS
results, and GMS results.

5.7.3 Rényi Quadratic and Cross Entropy from Parzen Windows

This section studies the relation between Rényi entropy and pdf estimation
based on Parzen windows that will be exploited during the next section. Let
us first recall the expression of Rényi quadratic entropy:
 
H(X) = − log 2
P (x) dx (5.54)

Estimating P (x) from Parzen yields a product of two Gaussian kernels.


An interesing property is that the integral of the product of two Gaussian
kernels is equivalent to a Gaussian kernel which variance is the sum of the
original variances. Thus, previous expression can be expressed as

H(X) = − log(V (X)) (5.55)


194 5 Image and Pattern Clustering

where
1 
N N
V (X) = G(||xi − xj ||2 , σ) (5.56)
N 2 i=1 j=1

It must be noted that this last expression considers each pair of samples.
The contribution of each sample xi ∈ X is given by

1 
N
V (xi ) = G(||xi − xj ||2 , σ) (5.57)
N 2 j=1

In the original paper by Rao et al. [135], samples are described as infor-
mation particles that interact among them by means of forces similar to the
laws of physics. From this supposition emerges the association of V (xi ) to the
concept of Information potential. Thus, V (xi ) may be understood as the in-
formation potential of xi over the rest of samples in the dataset (see Fig. 5.22).
The derivative is given by
 
1 
N
∂ ||xj − xi ||2
V (xi ) = 2 G(||xi − xj || , σ)
2
(5.58)
∂xi N j=1 σ2

Following the particle schema, this derivative F (xi ) represents the infor-
mation net force that all the samples exert over xi :

∂ N
F (xi ) = V (xi ) = F (xi |xj ) (5.59)
∂xi j=1

where the information force that xj exerts over xi is


 
1 ||xj − xi ||2
F (xi |xj ) = 2 G(||xi − xj || , σ)
2
(5.60)
N σ2
All this information particle formulation can be extended to the case of
interaction between two different datasets X = (xi )N M
i=1 and Y = (yj )j=1 .
Given Parzen window estimation of X and Y pdfs, PX (x, σX ) and PY (x, σY ),
analysis in this case is based on the Rényi cross entropy:
 
H(X; Y ) = − log PX (t)PY (t) dt (5.61)

Then, using the Gaussian kernel property previously introduced:

H(X; Y ) = − log(V (X; Y )) (5.62)

where
1 
N M
V (X; Y ) = G(||xi − yj ||2 , σ) (5.63)
N M i=1 j=1
5.7 IT-Based Mean Shift 195

10

−5

−10

−15
−15 −10 −5 0 5 10

10

−5

−10

−15
−15 −10 −5 0 5 10

Fig. 5.22. Representation of the information force within a dataset (top) and be-
tween two different datasets (bottom).

and σ 2 = σX
2
+ σY2 . Finally, the information force that Y exerts over a xi ∈ X
sample (see Fig. 5.22) is

F (xi ; Y ) = V (xi ; Y )
∂xi

= F (xi |xj )
 
1 
M
yj − xi
= G(||xi − yj ||2 , σ)
N M j=1 σ2

The opposite force F (X; yj ) is simply derived from F (xi ; Y ) swapping


M ↔ N and X ↔ Y .
196 5 Image and Pattern Clustering

5.7.4 Mean Shift from an IT Perspective

Studying Mean Shift from an information theory perspective gives insights on


what is mean shift actually optimizing, and also provides additional evidence
of the stability of GMS over GBMS. From the original dataset X o , that will
not be modified during the process, and an initial X = X o , we may define
Mean Shift as an energy minimization problem. The cost function to minimize
as X is updated is given by

J(X) = min H(X)


X
= min − log(V (X))
X
= max V (X)
X

1 
N N
= max G(||xi − xj ||2 , σ)
X N 2 i=1 j=1

In the latter expression, the logarithm can be removed. Due to the fact
that the logarithm is a monotonic function, its optimization may be translated
to an optimization of its parameter, V (X) in this case. As stated before,
xk = {1, 2, .., n} ∈ X are modified in each iteration. In order to search a
stable configuration of X, J(X) must be differentiated and equated to zero:
 
2 
N
||xk − xj ||2
2F (xk ) = 2 G(||xk − xj || , σ)
2
=0 (5.64)
N j=1 σ2

After rearranging Eq. 5.64, we obtain exactly the same GBMS iterative
sample updating equation in Eq. 5.49. The conclusion extracted after this
derivation is that GBMS directly minimizes the dataset’s overall Rényi’s
quadratic entropy. The cause of the unstability of GBMS is the infinite support
property of Gaussian kernel, when the only saddle point is given by H(X) = 0.
In order to avoid GBMS issues, we may choose to minimize Rényi’s cross en-
tropy rather than Rényi quadratic entropy. The new cost function is given by

1 N
N
J(X) = max V (X; X o ) = max G(||xi − xj ||2 , σ) (5.65)
X X N 2 i=1 j=1

that when differentiated with respect to xk = {1, 2, ..., n} ∈ X and equaled to


zero yields:

J(x) = F (x; X o ) = 0 (5.66)
∂xk
In this case, the update equation is exactly Eq. 5.50. Figure 5.23 shows
an example of the evolution of GMS and GBMS cost functions during 59
iterations, from the example in Fig. 5.20. As can be seen, GMS cost function
smoothly decreases and converges after a few iterations. Conversely, GBMS
5.8 Unsupervised Classification and Clustering Ensembles 197

3.8
GBMS
3.6 GMS
3.4
3.2
3
2.8
2.6
2.4
2.2
5 10 15 20 25 30 35 40 45 50 55
Iterations

Fig. 5.23. Evolution of GMS and GBMS cost function during 59 mean shift itera-
tions applied to data shown in Fig. 5.20.

cost function decreases during all the process. The GMS stop criterion is more
intuitive, due to the fact that is based on its cost function. But GBMS should
finish before H(X) = 0 is reached, or all samples would have collapsed to
an only one. The Rényi cross entropy cost function could have been applied
to GBMS. Although a minimum in this graph would be an indicative of the
transition from the first convergence phase to the second one, this approach
is only valid when pdf modes do not overlap.

5.8 Unsupervised Classification and Clustering


Ensembles

In previous sections of this chapter we have explained the problems of cluster-


ing and we have presented different algorithms which use IT for performing an
unsupervised partition of the data. In this section we address the problem of
using different clustering algorithms and combining their results into a single
clustering result. Such a strategy is intended to reuse the outputs yielded by
different clustering algorithms, producing a new, more robust clustering. Some
of the advantages are the possibility to use available clustering solutions and
parallelization. One way to generate different clustering results is to perform
several clustering procedures by varying the feature sets of the data, as well
as by varying the parameters of the clustering algorithms. For example, an
EM algorithm can generate different clusterings on the same data, if different
initial parameters are used. This is not the case in the EBEM algorithm as it
does not depend on initialization; in this case, the Gaussianity deficiency pa-
rameter could be varied for producing different outputs. For other algorithms
the number of desired clusters in the partition can be changed. Also, very
different clustering algorithms could be run in order to combine their results.
Therefore, using clustering ensembles can be useful for creating a partition
with little hand-tuning or when there is little knowledge about the data.
198 5 Image and Pattern Clustering

In clustering ensembles additional problems arise, apart from those inher-


ent to the original clustering problem. The most obvious one is the labeling
problem. Suppose we run four different clustering algorithms on the data:

X = {X1 , X2 , . . . , XN } (5.67)

Suppose we have N = 6 samples and the algorithms yield four different par-
titions C = {C1 , C2 , C3 , C4 } which contain the following labelings:

C1 = {1, 1, 2, 2, 3, 3}
C2 = {2, 2, 3, 3, 1, 1}
(5.68)
C3 = {2, 1, 1, 4, 3, 3}
C4 = {2, 1, 4, 1, 3, 3}

These partitions are a result of hard clustering, which is to say that the
different clusters of a partition are disjoint sets. Contrarily, in soft clustering
each sample can be assigned to several clusters in different degrees.
How to combine these labels for the clustering ensemble? In the first place,
the labels are nominal values and have no relation among different Ci clus-
terings. For example C1 and C2 correspond to identical partitions despite
the values of the labels. Secondly, some clusters may agree on some samples,
but disagree on other ones. Also, the fact that some clusterings may yield a
different number of clusters is an extra complexity. Therefore some kind of
consensus is needed.
At first glance, the consensus clustering C∗ should share as much informa-
tion as possible with the original clusterings Ci ∈ C. This could be formulated
as a combinatorial optimization problem in terms of mutual information [150].
A different strategy is to summarize the clustering results in a co-association
matrix whose values represent the degree of association between objects. Then,
some kind of voting strategy can be applied to obtain a final clustering. An-
other formulation of the problem [131] states that the clusterings Ci ∈ C are,
again, the input of a new clustering problem in which the different labels as-
signed to each sample become its features. In Section 5.8.1 we will explain
this formulation.

5.8.1 Representation of Multiple Partitions

In the previous example we showed four different clusterings for the data
X = {X1 , X2 , · · · , XN } with N = 6 samples. Suppose each Xj ∈ X has
D = 2 features:
X1 = {x11 , x12 }
X2 = {x21 , x22 }
X3 = {x31 , x32 }
(5.69)
X4 = {x41 , x42 }
X5 = {x51 , x52 }
X6 = {x61 , x62 }
5.8 Unsupervised Classification and Clustering Ensembles 199

The samples Xj can be represented together with the labels of their clusterings
in C. Also, let us use different labels for each one of the partitions, to make
explicit that they have no numerical relation. Now each sample has H = |C|
new features. There are two ways for performing a new clustering: (a) using
both original and new features or (b) using only new features. Using only
the new features makes the clustering independent of the original features. At
this point the problem is transformed into a categorical clustering problem
in a new space of features and it can be solved using various statistical and
IT-based techniques. The resulting clustering C∗ is known as a consensus
clustering or a median partition and it summarizes the partitions defined by
the set of new features.

5.8.2 Consensus Functions

Different consensus functions and heuristics have been designed in the litera-
ture. Co-association methods, re-labeling approaches, the mutual information
approach, and mixture model of consensus are some relevant approaches [131].
Among the graph-based formulations [150] there are the instance-based, the
cluster-based and the hypergraph methods. Some of the characteristics of
these approaches are:

• The co-association method evaluates the similarity between objects, based


on the number of clusters shared by two objects in all the partitions. It has
a quadratic complexity in the number of patterns and it does not provide
good co-association value estimation if the number of clusterings is small.
• Re-labeling approaches solve the label correspondence problem heuristi-
cally looking for an agreement with some reference partition. Then, a sim-
ple voting procedure can associate objects with their clusters and requires
a known number of clusters in the target consensus partition.
• The Mutual Information approach has as objective function the mutual
information between the empirical probability distribution of labels in the
consensus partition and the labels in the ensemble.
• The Mixture Model of consensus has a maximum-likelihood formulation
based on a finite mixture model, explained in the next subsection.
• The instance-based graph formulation models pairwise relationships
among samples with a fully connected graph, resulting in a partition-
ing problem of size N 2 . Its computational complexity is expensive, and
may vary depending on the algorithm used to partition the graph.
• The cluster-based graph formulation constructs a graph to model the simi-
larities among clusters in the ensemble, then partitions this graph to group
clusters which correspond to one another.
• Hypergraph methods are heuristics for solving the cluster ensemble prob-
lem, by representing the set of clusterings as a hypergraph. In hypergraphs
a hyperedge is a generalization of an edge so that it can connect more than
200 5 Image and Pattern Clustering

two vertices. In [150] several hypergraph-based approaches are presented,


such as a cluster similarity partitioning, a minimum cut objective for hy-
pergraph partitioning, and a cluster correspondence problem.
Next we will explain the mixture model of consensus and two mutual
information based objective functions for consensus.

A Mixture Model of Consensus

In this approach for consensus the probabilities of the labels for each pattern
are modeled with finite mixtures. Mixture models have already been described
in this chapter. Suppose each component of the mixture is described by the
parameters θ m , 1 ≤ m ≤ M where M is the number of components in the
mixture. Each component corresponds to a cluster in the consensus clustering,
and each cluster also has a prior probability πm , 1 ≤ m ≤ M . Then the
parameters to be estimated for the consensus clustering are:

Θ = {π1 , . . . , πM , θ 1 , . . . , θ M } (5.70)

In our toy-example we will use M = 2 components in the mixture.1 The data


to be described by the mixture are the “new features,” which are the labels
of all the partitions, obtained from C. We will denote these labels as

Y = {Y1 , . . . , Y N }
(5.71)
Yi = {c1i , c2i , . . . , cHi }

where H = |C|. In other words, using matrices formulation, we have Y = C  .


It is assumed that all the labels Yi , 1 ≤ i ≤ N are random and have the
same distribution, which is described by the mixture


M
P (Yi |Θ) = πm Pm (Yi |θm ) (5.72)
m=1

Labels are also assumed to be independent, so the log-likelihood function for


the parameters Θ given Y is


N
log (Θ|Y) = log P (Yi |Θ) (5.73)
i=1

1
However, the number of final clusters is not obvious, neither in the example, nor
in real experiments. This problem called model order selection has already been
discussed in this chapter. A minimum description length criterion can be used for
selecting the model order.
5.8 Unsupervised Classification and Clustering Ensembles 201

Now the problem is to estimate the parameters Θ which maximize the likeli-
hood function (Eq. 5.73). For this purpose some densities have to be modeled.
In the first place, a model has to be specified for the component-conditional
densities which appear in Eq. 5.72. Although the different clusterings of the
ensemble are not really independent, it could be assumed that the compo-
nents of the vector Yi (the new features of each sample) are independent,
therefore their probabilities will be calculated as a product. In second place,
a probability density function (PDF) has to be chosen for the components of
Yi . Since they consist of cluster labels in Ci , the PDF can be modeled as a
multinomial distribution.
A distribution of a set of random variates X1 , X2 , . . . , Xk is multinomial if

n! 
k
P (X1 = x1 , . . . , Xk = xk ) = k ϑxi i (5.74)
i=1 xi ! i=1

where xi are non-negative integers such that


k
xi = n (5.75)
i=1

and θi > 0 are constants such that


k
θi = 1 (5.76)
i=1

Putting this together with the product for the conditional probability of Yi
(for which conditional independence was assumed), the probability density for
the components of the vectors Yi is expressed as

 &
H K(j) 'δ(yij ,k)
Pm (Yi |Θm ) = ϑjm (k) (5.77)
j=1 k=1

where ϑjm (k) are the probabilities of each label and δ(yij , k) returns 1 when
k is the same as the position of the label yij in the labeling order; otherwise
it returns 0. K(j) is the number of different labels existing in the partition
j. For example, for the labeling of the partition C2 = {b, b, c, c, a, a}, we
have K(2) = 3, and the function δ(yi2 , k) would be evaluated to 1 only for the
parameters: δ(a, 1), δ(b, 2), δ(c, 3) and it would be 0 for any other parameters.
Also note that in Eq. 5.77, for a given mixture the probabilities for each
clustering sum 1:


K(j)
ϑjm (k) = 1, ∀j ∈ {1, · · · , H}, ∀m ∈ {1, · · · , M }, (5.78)
k=1
202 5 Image and Pattern Clustering

Table 5.1. Original feature space and transformed feature space obtained from the
clustering ensemble. Each partition of the ensemble is represented with a different
labels in order to emphasize the label correspondence problem.

Sample Orig. features X New features C Consensus


Xi C1 C2 C3 C4 C∗
X1 x11 x12 1 b N β ?
X2 x21 x22 1 b M α ?
X3 x31 x32 2 c M δ ?
X4 x41 x42 2 c K α ?
X5 x51 x52 3 a J γ ?
X6 x61 x62 3 a J γ ?

In our toy-example, taking the values of Y from Table 5.1 and using
Eq. 5.77, the probability for the vector Y1 = {1, b, N, β} to be described
by the 2nd mixture component would be:
3 &
 'δ(1,k) 3 & 'δ(b,k)
P2 (Yi |Θ2 ) = ϑ12 (k) · ϑ22 (k)
k=1 k=1

4 & 'δ(N,k) 4 & 'δ(β,k)
· ϑ32 (k) · ϑ42 (k)
k=1 k=1

(5.79)
= ϑ12 (1)1 ϑ12 (2)0 ϑ12 (3)0
· ϑ22 (1)0 ϑ22 (2)1 ϑ22 (3)0
· ϑ32 (1)0 ϑ32 (2)1 ϑ32 (3)0 ϑ32 (4)0
· ϑ42 (1)0 ϑ42 (2)1 ϑ42 (3)0 ϑ42 (4)0

= ϑ12 (1)1 ϑ22 (2)1 ϑ32 (2)1 ϑ42 (2)1


This probability refers to the membership of the first sample to the second
component of the mixture, corresponding to the second cluster of the final
clustering result. The parameters ϑjm (k) which describe the mixture are still
unknown and have to be estimated.
The likelihood function (Eq. 5.73) can be optimized with the EM algo-
rithm, already explained at the beginning of the chapter. It would not be
possible to solve it analytically, as all the parameters in Θ are unknown. The
EM is an iterative algorithm for estimation when there are hidden variables.
In the present problem we will denote these hidden variables with Z. Given
the complete set of data (Y, Z), their distributions have to be consistent, then
the log-likelihood function for this set is given by

log P (Y|Θ) = log P (Y, z|Θ) (5.80)
z∈Z

The EM algorithm iterates two main steps. In the E step the expected val-
ues of the hidden variables are estimated, given the data Y and the current
5.8 Unsupervised Classification and Clustering Ensembles 203

estimation of the parameters Θ. The following equations are the result of the
derivation of the equations of the EM algorithm:
H K(j) & 'δ(yij ,k)

πm ϑjm (k)
j=1 k=1
E[zim ] = M (5.81)
 H K(j) & 'δ(yij ,k)
 
πn ϑjn (k)
j=1 k=1
n=1

In the M step the parameters of the multinomial distribution are updated in


the following way:
N
E[zim ]
i=1
πm = N  M
(5.82)
E[zim ]
i=1 m=1
N
ϑ(yij , k)E[zim ]
i=1
ϑjm (k) = N  K(j)
(5.83)
ϑ(yij , k)E[zim ]
i=1 k=1
After the convergence of the EM algorithm, E[zim ] will contain the value of
the probability that the pattern Yi is generated by the mth component of the
mixture. The cluster to which the ith sample is assigned is that component
of the mixture which has the largest expected value E[zim ], 1 ≤ i ≤ N

Mutual Information Based Consensus

The mutual information approach to the clustering consensus refers to the


maximization of the mutual information between the empirical probability
distribution of labels in the consensus partition C∗ and the labels in the
ensemble {C1 , C2 , C3 , C4 }. One possibility is to use Shannon’s definition of
mutual information, as do Strehl and Ghosh in [150].
K ∗ K(i)
 
  P (L∗k , Lij )
∗ ∗ i
I(C ; Ci ) = P (Lk , Lj ) log (5.84)
k=1 j=1
P (L∗k )P (Lij )

where K(i) is the number of different labels in the Ci partition, K ∗ is the


number of different labels in the consensus partition C∗ (which is the objective
partition). The sets Lij refer to the samples which are assigned to the cluster
j of the clustering i. Thus, the probabilities P (L∗k ), P (Lij ), and P (L∗k , Lij ) are
calculated this way:
P (L∗k ) = |L∗k |/N
i
P (Lj ) = |Lij |/N (5.85)
P (Lk , Lj ) = |L∗k ∩ Lij |/N
∗ i

where N is the total number of samples (6 in the example). This equation


calculates the mutual information between only two partitions. For maximiz-
ing the amount of information that the hypothetic partition C∗ shares with
204 5 Image and Pattern Clustering

all partitions in the ensemble C the sum of all mutual informations has to be
maximized:

H
C∗ = arg max I(C; Ci ) (5.86)
C
i=1

Let us see an example with the data in Table 5.1. Suppose we want to
calculate the mutual information between the fourth clustering C4 and a
consensus clustering C∗ = {1, 1, 1, 2, 2, 2}. We would have the following cluster
labels:
L42 L43 L4 4
  
C4 = { β, α, δ, α, γ, γ }
 
L41 L41 (5.87)

C = { 1, 1, 1, 2, 2, 2 }
     
L∗
1 L∗
2

Using Eq. 5.84 we calculate the mutual information between the partitions.
The logarithm log(x) is calculated in base 2.
 
2 4
|L∗k ∩ Lij | |L∗k ∩ Lij | · 6

I(C ; C4 ) = log
j=1
k=1
N |L∗k | · |Lij |

1 1·6 1 1·6 1 1·6


= log + log + log + 0
6 3·2 6 3·1 6 3·1 (5.88)
1 1·6 2 2·6
+ log + 0 + 0 + log
6 3·2 6 3·2
2
=
3
The same calculation should be performed for the other partitions in C in
order to obtain a measure of the amount of information which C∗ shares with
the ensemble.
There is another definition of entropy, the entropy of degree s, which leads
to the formulation of generalized mutual information, used in [131] for clus-
tering consensus evaluation. Entropy of degree s is defined as
 n 

−1
s
H (P) = (2 1−s
− 1) pi − 1
s
(5.89)
i=1

where P = (p1 , . . . , pn ) is a discrete probability distribution, s > 0, and s = 1.


In the limit s → 1 it converges to the Shannon entropy. The entropy of degree
s permits simpler characterization than Rényi’s entropy of order r. Partic-
ularly, in the probabilistic measure of interclass distance, quadratic entropy
(entropy of degree s = 2) is closely related to classification error. With s = 2,
generalized mutual information is defined as
5.8 Unsupervised Classification and Clustering Ensembles 205

I 2 (C∗ ; Ci ) = H 2 (C∗ ) − H 2 (C∗ |Ci )


⎛ ⎞ ∗
⎛ ⎞

K(i)

K 
K(i)
= −2 ⎝ P (Lij )2 − 1⎠ + 2 p(L∗r ) ⎝ P (Lij |L∗r )2 − 1⎠
j=1 r=1 j=1
(5.90)
where the probability P (L∗k |Lij ) can be calculated as

P (L∗k |Lij ) = |L∗k ∩ Lij |/|Lij | (5.91)

The formulation of mutual information presented in Eq. 5.90 corresponds to a


clustering consensus criterion which has already been used for clustering, and
is known as a “category utility function.” Fisher used it in COBWEB [60] for
conceptual clustering.
These mutual information functions are objective functions which have
to be maximized. For this purpose the generation and evaluation of different
hypothetical clusterings C∗ are necessary. Trying all the possible partitions
could be impracticable; so some kinds of heuristics or simulated annealing
have to be used. For example, if we are performing a hill climbing in the space
of possible consensus clusterings, and we are in the state C∗ = {1, 2, 1, 2, 2, 2},
there are a set of possible steps to new states. Each one of the six labels could
be changed to l = l ± 1; so there are a maximum of N ∗ 2 possible states
to evaluate. Once all of them are evaluated, the best one is selected and the
process is repeated. Such evaluation involves the calculation of the information
shared with each one of the partitions in the ensemble. For example, let us
evaluate the possible changes to other states from the state {1, 2, 1, 2, 2, 2}.
The notation Lij is used in the same way as illustrated in Eq. 5.87.

+ , H
+ ,
I 2 {1, 2, 1, 2, 2, 2}; C = I 2 {1, 2, 1, 2, 2, 2}; Ci
i=1
+ , = 0.2222 + 0.2222 + 0.5556 + 0.8889 = 1.8889
I 2 +{2, 2, 1, 2, 2, 2}; C , = 0.2222 + 0.2222 + 0.2222 + 0.5556 = 1.2222
I 2 +{1, 1, 1, 2, 2, 2}; C , = 0.6667 + 0.6667 + 1.0000 + 0.6667 = 3.0000
I 2 +{1, 3, 1, 2, 2, 2}; C , = 0.5556 + 0.5556 + 0.8889 + 0.8889 = 2.8889
I 2 +{1, 2, 2, 2, 2, 2}; C , = 0.2222 + 0.2222 + 0.5556 + 0.5556 = 1.5556 (5.92)
I 2 +{1, 2, 1, 1, 2, 2}; C , = 0.6667 + 0.6667 + 0.6667 + 0.6667 = 2.6667
I 2 +{1, 2, 1, 3, 2, 2}; C , = 0.5556 + 0.5556 + 0.8889 + 0.8889 = 2.8889
I 2 +{1, 2, 1, 2, 1, 2}; C , = 0.0000 + 0.0000 + 0.3333 + 0.6667 = 1.0000
I 2 +{1, 2, 1, 2, 3, 2}; C , = 0.2222 + 0.2222 + 0.5556 + 0.8889 = 1.8889
I 2 +{1, 2, 1, 2, 2, 1}; C , = 0.0000 + 0.0000 + 0.3333 + 0.6667 = 1.0000
I 2 {1, 2, 1, 2, 2, 3}; C = 0.2222 + 0.2222 + 0.5556 + 0.8889 = 1.8889

The maximal amount of mutual information shared with the partitions of


the ensemble is achieved if the next step is to make C∗ = {1, 1, 1, 2, 2, 2}.
Note that we have allowed to explore the addition of a new class label, “3.”
206 5 Image and Pattern Clustering

Different constraints regarding the vector space and the possible changes can
be imposed. A hill climbing algorithm has been used in this example; however
it would stop at the first local maximum. Simulated annealing or a different
searching algorithm has to be used to overcome this problem.

Problems

5.1 The entropy based EM algorithm and Gaussianity


The EBEM algorithm performs clustering by fitting Gaussian mixtures to the
data; the components of the mixture are the resulting clusters. We presented
an experiment with color image segmentation in the RGB color space. Do
you think that the colors of the pixels of natural images follow a Gaussian
distribution? If the data which we want to clusterize do not follow exactly a
mixture of Gaussian distributions, which parameter of EBEM would you tune?
Think of the modifications that you would have to perform on the EBEM
algorithm in order to make it work for some other model of distribution.

5.2 The entropy based EM algorithm for color image segmentation


In the experiment of EBEM with color image segmentation which we pre-
sented, the samples of the data have three dimensions: the dimensions of the
RGB color space. However, the positions of the pixels are not taken into ac-
count. It would be good to add the criterion of spatial continuity (regions).
Think of the modifications that would be necessary.

5.3 The entropy based EM algorithm and MDL


The EBEM algorithm can use the MDL as model order selection criterion. In
this case, when the clustering algorithm is applied to image segmentation it
has been experimentally observed that the algorithm usually progresses to a
high order model. This behavior is due to the fact that MDL underweighs the
complexity of the models because it does not consider how many pixels are
described by a given component of the mixture. Some authors increase the 1/2
factor of the complexity term of the MDL formula (see Eq. 5.15). This is done
for overweighing complexity and penalizing higher order models. However, the
chosen factor depends on the application. Consider the idea of overweighing
the complexity term by including not only the number of parameters of the
mixture but also the coding efficiency of each component. This is meant to
consider both the number of parameters and the amount of pixels encoded by
them.

5.4 Information Theory and clustering


Different uses of information theory in the field of data clustering have been
introduced through this chapter. Enumerate the information theory concepts
applied to Agglomerative Information Bottleneck, Robust Information Clus-
tering and IT Mean Shift. Summarize how these concepts are applied.
5.8 Unsupervised Classification and Clustering Ensembles 207

5.5 Rate Distortion Theory


Given the samples:

x1 2 4
x2 5.1 0
x3 9.9 1.7
x4 12.3 5.2

and the representatives of two clusters

t1 5 2.5
t2 8.5 3.5

if p(xi ) = p(xj )∀i = j and p(t1 |xi ) = p(t2 |xi )∀i, estimate the distortion
Ed (X, T ) and the model complexity. Think of how to decrease the distortion
without varying the number of clusters. How this affects change to the model
complexity?

5.6 Information Bottleneck


Recall Alg. 8 (Blahut–Arimoto Information Bottleneck clustering). In which
circumstances the inequality I(T ; Y ) > I(X; T ) is true? And the inequal-
ity I(X; T ) > I(T ; Y )? Is it possible, in general, to state that any of both
alternatives is better than the other one?

5.7 Blahut–Arimoto Information Bottleneck


Given the data given in Prob. 5.5. Let us consider two classes Y = {y1 , y2 }
for which p(y1 |x1 ) = p(y2 |x2 ) = 1 and p(y2 |x3 ) = p(y2 |x4 ). Estimate p(y|t)
(see Alg. 8). Apply an iteration of Alg. 8 using β = 0.01. From these results,
calculate ∂I(T ; Y ) and ∂I(X; T ), and give an explanation of why these values
are obtained.

5.8 Deterministic annealing


Modify Alg. 8 in order to make use of the deterministic annealing principles.

5.9 Agglomerative Information Bottleneck


Let X = {x1 , x2 , x3 , x4 } be a set of samples that may be labeled as Y =
{y1 , y2 }. Given the p(X, Y = y1 ) distribution:

X p(xi , Y = y1 )
x1 0.9218
x2 0.7382
x3 0.1763
x4 0.4057

apply the first iteration of Alg. 5.16, determining the first two samples that
are joined and the new posteriors.
208 5 Image and Pattern Clustering

1
0 0

X 1-p Y

1 1
p

Fig. 5.24. Example channel.

5.10 Model order selection in AIB


The AIB algorithm starts with a high number of clusters (a cluster for each
pattern) and proceeds by fusing cluster pairs until there is only one cluster left.
Think about the effectiveness of the model order selection criterion described
in Eq. 5.31. Hint: explore the information plane (Fig. 5.16).
5.11 Channel capacity
Channel capacity is defined in Eq. 5.37. It is defined over mutual Information,
that may be derived as I(X; Y ) = H(Y )−H(Y |X) (see Chapter 4). Calculate
the capacity of the channel in Fig. 5.24, in which possible inputs (X) and
possible outputs (Y) are related by means of a conditional distribution p(y|x)
represented by edges. Draw a plot that represents channel capacity vs. p and
interpret it.
5.12 RIC and alternative model order selection
The Robust Information Clustering (RIC) algorithm relies on a VC dimension
criterion to estimate the optimal order of the model. Suggest an alternative
IT-based criterion for this algorithm. Think about expressing the structural
risk concept in terms of information theory.
5.13 Mean Shift and Information Theory
The interpretation of the Mean Shift algorithm in terms of information theory
relies on the Rényi quadratic entropy minimization (α = 2). Following a
similar rationale, it could be interesting to extend this framework to other α-
entropies. Explain the impact of such extension into the modified algorithm.
5.14 X-means clustering and Bayesian Information Criterion
The X-means algorithm is an extension of the classic K-means algorithm.
X-means [124] finds automatically the optimal number of clusters by exploit-
ing the Bayesian Information Criterion (BIC) [94, 143]. Such criterion is de-
fined by considering the log-likelihood in the first term and −K/2 log n as
in MDL (K the number of parameters, n the number of samples). Actually,
the general MDL converges to BIC when n → ∞. The probability model as-
sumes one Gaussian distribution per cluster. This assumption is plugged into
the definition of the log-likelihood. X-means proceeds by running K-means
for a given K and deciding whether to split or not each of the K clusters,
according to BIC. This process is performed recursively. Discuss the role of
the MDL/BIC in the process and its expected behavior in high-dimensional
data. Think about the data sparseness and complexity matters.
5.9 Key References 209

5.15 Clustering ensembles


The consensus based on mixture models has a maximum-likelihood formula-
tion based on a finite mixture model. In this approach, which kinds of as-
sumptions about the data are made? Compare then with the consensus based
on mutual information.

5.9 Key References


• A. Peñalver, F. Escolano, and J.M. Sáez. “EBEM: An Entropy-Based EM
Algorithm for Gaussian Mixture Models”. International Conference on
Pattern Recognition, Hong Kong (China) (2006)
• A. Peñalver, F. Escolano, and J.M. Sáez. “Two Entropy-Based Methods for
Learning Unsupervised Gaussian Mixture Models”. SSPR/SPR – LNCS
(2006)
• M. Figueiredo and A.K. Jain. “Unsupervised Learning of Finite Mixture
Models”. IEEE Transactions on Pattern Analysis and Machine Intelli-
gence 24(3): 381–396 (2002)
• N. Srebro, G. Shakhnarovich, and S. Roweis. “An Investigation of Com-
putational and Informational Limits in Gaussian Mixture Clustering”. In-
ternational Conference on Machine Learning (2006)
• N.Z. Tishby, F. Pereira, and W. Bialek. “The Information Bottleneck
method”. 37th Allerton Conference on Communication, Control and Com-
puting (1999)
• N. Slonim, N. Friedman, and N. Tishby. “Multivariate Information Bot-
tleneck”. Neural Computation 18: 1739–1789 (2006)
• J. Goldberger, S. Gordon, and H. Greenspan. “Unsupervised Image-Set
Clustering Using an Information Theoretic Framework”. IEEE Transac-
tions on Image Processing 15(2): 449–458 (2006)
• N. Slonim and N. Tishby. “Agglomerative Information Bottleneck”. In
Proceeding of Neural Information Processing Systems (1999)
• W. Punch, A. Topchy, and A. Jain. “Clustering Ensembles: Models of
Consensus and Weak Partitions”. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 27(12): 1866–1881 (2005)
6
Feature Selection and Transformation

6.1 Introduction
A fundamental problem in pattern classification is to work with a set of fea-
tures which are appropriate for the classification requirements. The first step
is the feature extraction. In image classification, for example, the feature set
commonly consists of gradients, salient points, SIFT features, etc. High-level
features can also be extracted. For example, the detection of the number of
faces and their positions, the detection of walls or surfaces in a structured envi-
ronment, or text detection are high-level features which also are classification
problems in and of themselves.
Once designed the set of features it is convenient to select the most in-
formative of them. The reason for this is that the feature extraction process
does not yield the best features for some concrete problems. The original fea-
ture set usually contains more features than it is necessary. Some of them
could be redundant, and some could introduce noise, or be irrelevant. In some
problems the number of features is very high and their dimensionality has to
be reduced in order to make the problem tractable. In other problems feature
selection provides new knowledge about the data classes. For example, in gene
selection [146] a set of genes (features) are sought in order to explain which
genes cause some disease. On the other hand, a properly selected feature set
significantly improves classification performance. However, feature selection is
a challenging task.
There are two major approaches to dimensionality reduction: feature se-
lection and feature transform. Feature selection reduces the feature set by
discarding features. A good introduction to feature selection can be found
in [69]. Feature transform refers to building a new feature space from the
original variables, therefore it is also called feature extraction.
Some well-known feature transform methods are principal component anal-
ysis (PCA), linear discriminant analysis (LDA), and independent component
analysis (ICA), among others. PCA transform relies on the eigenvalues of the
covariance matrix of the data, disregarding the classes. It represents the data
F. Escolano et al., Information Theory in Computer Vision and Pattern Recognition, 211

c Springer-Verlag London Limited 2009
212 6 Feature Selection and Transformation

optimally in terms of minimal mean-square error between the transformed rep-


resentation and the original one. Therefore, it is useful for separating noise,
but not for finding those features which are discriminative in a classification
problem. There is an extension called generalized principal component anal-
ysis (GPCA), which is essentially an algebraic geometric approach to clus-
tering. While PCA finds projection directions that maximize the Gaussian
variance, GPCA finds subspaces passing through the data. Another feature
transform method is LDA, which uses the between-class covariance matrix
and the transformed features are discriminative for classification. However, it
produces strictly less features than the number of classes of the data. Also,
the second-order statistics (covariances) are useful for classes which are rep-
resented by well-separated unimodal Gaussian data distributions. The third
feature transform mentioned, ICA, finds a maximally clustered projection of
the data, maximizing the divergence to the Gaussian density function. There-
fore, there is an emphasis on non-Gaussian clusters.
In the feature selection literature, there are three different approaches: fil-
ter methods [69], wrapper methods [24, 96], and online methods [127]. Filter
feature selection does not take into account the properties of the classifier
(it relies on statistical tests over the variables), while wrapper feature selec-
tion tests different feature sets by building the classifier. Finally, online fea-
ture selection incrementally adds or removes new features during the learning
process. There is also a notable difference between the supervised and the
unsupervised feature selection methods. In supervised classification the class
labels of the training samples are present. Contrarily, in the unsupervised
classification problems, the class labels have to be decided by identifying clus-
ters in the training data. This makes the unsupervised feature selection an
ill-posed problem, because the clusters depend on the selected features, while
the features are selected according to the clusters. An additional problem in
clustering is the model order selection, or: which is the optimal number of
clusters?
Feature selection (FS) is a combinatorial computational complexity prob-
lem. FS methods are usually oriented to find suboptimal solutions in a feasible
number of iterations. One problem is how to explore the feature space so that
the search is as straightforward as possible. The other problem is the crite-
rion which evaluates the feature subsets. It is a delicate problem, as it has to
estimate the usefulness of a subset accurately and inexpensively. Most of the
feature selection research is focused on the latter topic, and it is the subject
of study of the following sections as well.

6.2 Wrapper and the Cross Validation Criterion


6.2.1 Wrapper for Classifier Evaluation
Wrapper feature selection consists of selecting features according to the classi-
fication results that these features yield. Therefore, wrapper feature selection
6.2 Wrapper and the Cross Validation Criterion 213

is a classifier-dependent approach. Contrarily, filter feature selection is classi-


fier independent, as it is based on statistical analysis on the input variables
(features), given the classification labels of the samples. In filter feature se-
lection the classifier itself is built and tested once the features are selected.
Wrappers build classifiers each time a feature set has to be evaluated. This
makes them more prone to overfitting than filters. It is also worth mentioning
that wrappers are usually applied as a multivariate technique, which means
that they test whole sets of features.
Let us see a simple example of wrapping for feature selection. For a super-
vised1 classification problem with four features and two classes, suppose we
have the following data set containing nine samples:
Features Class
Sample 1 = ( x11 x12 x13 x14 ), C1
Sample 2 = ( x21 x22 x23 x24 ), C1
Sample 3 = ( x31 x32 x33 x34 ), C1
Sample 4 = ( x41 x42 x43 x44 ), C1
Sample 5 = ( x51 x52 x53 x54 ), C2
Sample 6 = ( x61 x62 x63 x64 ), C2
Sample 7 = ( x71 x72 x73 x74 ), C2
Sample 8 = ( x81 x82 x83 x84 ), C2
Sample 9 = ( x91 x92 x93 x94 ), C2
A wrapper feature selection approach could consist of evaluating different
combinations of features. In the previous table each single feature is repre-
sented by a column: Fj = (x1j , x2j , . . . , x9j ), j ∈ {1, . . . , 4}. The evaluation
of a feature set involves building a classifier with the selected features and
testing it, so we have to divide the data set into two disjoint sets: the train
set for building the classifier and the test set for testing it. For large data sets
a good proportion is 75% for the train set and 25% for the test set. Usually
it is adequate to perform the partition on randomly ordered samples. On the
other hand, for small data sets a strategy which consists of taking only one
sample for the test and repeating the process for all the samples exists, as
explained later in this section.
It is very important to note that, even if we are provided with a separate
test set, we cannot use it for the feature selection process. In other words,
during the wrapper feature selection we use the train set which has to be
divided into subtrain and subtest sets in order to build classifiers and evaluate
them. Once this process is finished, there is a need to test the final results
with a data set which has not been used during the feature selection process.
For example, for the wrapper evaluation of the feature sets (F1 , F2 ) and
(F1 , F3 ), two classifiers have to be built and tested. Let us take as train set

1
In supervised classification, a classifier is built given a set of samples, each one
of them labeled with the class to which it belongs. In this section the term
classification always refers to supervised classification.
214 6 Feature Selection and Transformation

the samples {S1 , S3 , S6 , S8 , S9 } and the rest {S2 , S4 , S5 , S7 } as test set. Then,
the classifiers C1 and C2 have to be built with the following data:
C1 : Features Class C2 : Features Class
( x11 x12 ), C1 ( x11 x13 ), C1
( x31 x32 ), C1 ( x31 x33 ), C1
( x61 x62 ), C2 ( x61 x63 ), C2
( x81 x82 ), C2 ( x81 x83 ), C2
( x91 x92 ), C2 ( x91 x93 ), C2
and tested with the following data:
T estC1 : Features T estC2 : Features Output : Class
( x21 x22 ) ( x21 x23 ) C1
( x41 x42 ) ( x41 x43 ) C1
( x51 x52 ) ( x51 x53 ) C2
( x71 x72 ) ( x71 x73 ) C2
Denoted as Output is the set of labels that are expected to be returned
by the classifiers for the selected samples. The accuracy of the classifiers is
evaluated based on the similarity between its actual output and the desired
output. For example, if the classifier C1 returned C1, C2, C2, C2, while the
classifier C2 returned C1, C2, C1, C1, then C1 would be more accurate. The
conclusion would be that the feature set (F1 , F2 ) works better than (F1 , F3 ).
This wrapper example is too simple. Actually, drawing such conclusion from
just one classification test would be statistically unreliable, and cross vali-
dation techniques have to be applied in order to decide which feature set is
better than another.

6.2.2 Cross Validation


Cross validation (CV), also known as rotation estimation, is a validation tech-
nique used in statistics and, particularly in machine learning, which consists
of partitioning a sample of data in several subsets, and performing statisti-
cal analysis on different combinations of these subsets. In machine learning
and pattern recognition, CV is generally used for estimating the error of a
classifier, given a sample of the data. There are two most frequently used CV
methods: the 10-fold cross validiation and the leave-one-out cross validation
(LOOCV).
The 10-fold cross validation (10-fold CV) method consists of dividing the
training set into 10 equally sized partitions and performing 10 classification
experiments for calculating their mean error. For each classification, nine of
the partitions are put together and used for training (building the classifier),
and the other one partition is used for testing. In the next classification,
another partition is designed for testing the classifier built with the rest of the
partitions. Ten classification experiments are performed so that each partition
is used for testing a classifier. Note that the partitioning of the data set is
performed only once. It is also important to have a random sample order
before partitioning, in order to distribute the samples homogeneously among
the partitions.
6.2 Wrapper and the Cross Validation Criterion 215

Leave-one-out cross validation is used in the cases of very reduced training


sets. It is equivalent to K-fold CV with K equal to the total number of samples
present in the data set. For example, in the well-known NCI60 microarray data
set there are 60 samples and 14 classes, where some classes are represented
by only two samples. With 10-fold CV there would be cases in which all the
samples that represent a class would be in the test set, and the class would
not be represented in the training set. Instead, LOOCV would perform 60
experiments, in each one of which, one of the samples would test the classfifier
built with the resting 59 samples. The LOOCV error would be the mean error
of the 60 classification errors. Also, in the cases when the data set is so reduced
that there is no separate test set available, the classification results are usually
reported in terms of CV error over the training set.

6.2.3 Image Classification Example

Once explained the basic concept of wrapper feature selection and the cross
validation process, let us present an example in the field of computer vision.
The problem is the supervised classification of indoor and outdoor images
that come from two different sequences taken with a camera mounted on a
walking person. Both sequences are obtained along the same path. The first
one contains 721 images and is used as a training set. The second sequence,
containing 470 images, is used as a test set, so it does not take part in the
feature selection process. The images have a 320 × 240 resolution and they
are labeled with six different classes: an office, two corridors, stairs, entrance,
and a tree avenue, as shown in Fig. 6.1. The camera used is a stereo camera,
so range information (depth) is another feature in the data sets.
One of the most important decisions in a classification problem is how to
extract the features from the available data. However, this is not the topic of
this chapter and we will just explain a way to extract global low-level features
from an image. The technique consists of applying a set of basic filters to each
image and taking the histograms of their responses as the features that char-
acterize each image. Then, the feature selection process decides which features
are important for the addressed classification problem. The selected feature
set depends on the classifier, on the data, and on the labels of the training set.
The filters we use in this example are applied to the whole image and they
return a histogram of responses. This means that for each filter we obtain
information about the number of pixels in the image which do not respond to
it, the number of pixels which completely respond to it, and the intermediate
levels of response, depending on the number of bins in the histogram.
There are 18 filters, each one of which has a number of bins in the his-
togram. The features themselves are given by the bins of the histograms. An
example is shown in Fig. 6.2. The filters are the following:
• Nitzberg
• Canny
• Horizontal gradient
216 6 Feature Selection and Transformation

Fig. 6.1. A 3D reconstruction of the route followed during the acquisition of the
data set, and examples of each one of the six classes. Image obtained with 6-DOF
SLAM. Figure by F. Escolano, B. Bonev, P. Suau, W. Aguilar, Y. Frauel, J.M. Sáez
and M.A. Cazorla (2007
c IEEE). See Color Plates.

• Vertical gradient
• Gradient magnitude
• Twelve color filters Hi , 1 ≤ i ≤ 12
• Depth information
Some of them are redundant, for example the magnitude and the gradients.
Others are similar, like Canny and gradient’s magnitude. Finally, some fil-
ters may overlap, for example the color filters. The color filters return the
probability distribution of some definite color H (from the HSB color space).
The feature selection criterion of the wrapper method is based on the
performance of the classifier for a given subset of features, as already ex-
plained. For a correct evaluation of the classification error, the cross valida-
tion method is used. Tenfold CV is suitable for larger data sets, while LOOCV
is useful when the data set is very small. For the following experiments the
classification error reported is calculated using 10-fold cross validation. The
experiments are performed with a K-Nearest Neighbor (K-NN) classifier with
K = 1.
There are different strategies for generating feature combinations. The
only way to ensure that a feature set is optimum is the exhaustive search
6.2 Wrapper and the Cross Validation Criterion 217

Fig. 6.2. Responses of some filters applied to an image. From top-bottom and left-
right: input image, depth, vertical gradient, gradient magnitude, and four color fil-
ters. The rest of the filters are not represented as they yield null output for this
input image. Figure by B. Bonev, F. Escolano and M.A. Cazorla (2008
c Springer).

among feature combinations. Exhaustive search is condemned to the curse of


dimensionality, which refers to the prohibitiveness of the cost when searching
in a high-dimensional space. The complexity of an exhaustive search is
n  
 n
O(n) = (6.1)
i
i=1

The fastest way to select from a large amount of features is a greedy strategy.
Its computational complexity is


n
O(n) = i (6.2)
i=1

The algorithm is described in Alg. 11. At the end of each iteration a new
feature is selected and its CV error is stored. The process is also outlined in
Fig. 6.3.
218 6 Feature Selection and Transformation

Algorithm 11: Wrapper Feature Selection


Input: M samples, NF features
Initialize
DAT AM ×NF ← vectors of all (M ) samples
FS = ∅
F = {f eature1 , f eature2 , . . . , f eaturesNF }
while F = ∅ do
forall ∀i | f eaturei ∈ F do
DS = DAT A(FS ∪ {f eaturei })
Ei = 10FoldCrossValid(DS )
end
selected = arg mini Ei
/* also store Ei */
FS = FS ∪ {f eatureselected }
F = F ∼ {f eatureselected }
end
Output: FS

NF NS
10-Fold
M CV
M M Train
Test
Error
Images
Vectors Vectors

?
All Features Selected F.

Best F. Set

Fig. 6.3. The wrapper feature selection process.

The number of bins used to generate features is an important factor for


the feature selection results. A larger number of bins can (but not necessarily)
improve the classification (Fig. 6.4). An excessive number of bins overfit the
classifier.
Another important factor is the number of classes. As the number of classes
increases, the classification performance decays, and more features are needed.
In Fig. 6.5 we compare feature selection performance on the same data set,
labeled with different numbers of classes.
6.2 Wrapper and the Cross Validation Criterion 219
Greedy Feature Selection on 2−bin histograms Greedy Feature Selection on 12−bin histograms
30 20
18
25

10−fold CV error (%)


16
10−fold CV error (%)

14
20 12
10
15 8
6
10 4
2
5 0
0 10 20 30 40 50 60 70 0 100 200 300 400 500 600 700
Number of features Number of features

Fig. 6.4. Comparison of feature selection using 2 and 12 bin histograms, on the
eight-class indoor experiment.

Greedy Feature Selection on 4−bin histograms


20
8 Classes
6 Classes
15 4 Classes
10−fold CV error (%)

2 Classes

10

0
0 10 20 30 40 50
# Selected Features

Fig. 6.5. Evolution of the CV error for different number of classes. Figure by
B. Bonev, F. Escolano and M.A. Cazorla (2008
c Springer).

6.2.4 Experiments

With the wrapper feature selection process presented, we obtain a sequence


of CV classification errors and we can select the feature set with the lowest
error. Then, it is time for testing with the separate test set. The test simply
consists of querying the classifier for the class of each image from the test set,
and comparing this result with the actual class of the image.
The classifier used in this experiment is the K-nearest neighbors (K-NN).
An example with test images and their five nearest neighbors from the training
set is shown in Fig. 6.6. In Fig. 6.7 we have represented the number of the train-
ing image obtained for each one of the test images, ordered in the sequence
of acquisition. Ideally the plot would be a straight diagonal line where the
first and the last test images correspond to the first and the last train images,
220 6 Feature Selection and Transformation

Test Image 1st NN 2nd NN 3rd NN 4th NN 5th NN

Fig. 6.6. The nearest neighbors of different test images. The training set from which
the neighbors are extracted contains 721 images taken during an indoor–outdoor
walk. The amount of low-level filters selected for building the classifier is 13, out of
48 in total. Note that the test images of the first column belong to a different set of
images. Figure by B. Bonev, F. Escolano and M.A. Cazorla (2008c Springer).

respectively. This figure illustrates the applicability of this approach to local-


ization tasks, as well as the performance of feature selection on a large data set.

6.3 Filters Based on Mutual Information


6.3.1 Criteria for Filter Feature Selection

In most feature selection approaches there are two well-differentiated issues:


the search algorithm and the selection criterion. Another important issue is
the stopping criterion used to determine whether an algorithm has achieved a
good maximum in the feature space. This section is centered on some IT-based
feature selection criteria. A different issue is the way that feature combinations
are generated. An exhaustive search among the features set combinations
would have a combinatorial complexity with respect to the total number of
features. In the following sections we assume the use of a greedy forward
feature selection algorithm, which starts from a small feature set, and adds
one feature in each iteration.
6.3 Filters Based on Mutual Information 221

Confusion Trajectory for Fine Localization (P = 10 NN)


800

700

600

500
NN #

400

300

200

100

0
0 100 200 300 400 500
Test Image #

Fig. 6.7. The nearest neighbor (from among the train images, Y axis) of each one
of the test images, X axis. In the ideal case the line should be almost straight, as
the trajectories of both train and test sets are similar.

In the presence of thousands of features the wrapper approaches are


unfeasible because the evaluation of large feature sets is computationally
expensive. In filter feature selection the feature subsets are statistically eval-
uated. Univariate filter methods evaluate a single feature. A way to measure
the relevance of a feature for the classification is to evaluate its mutual in-
formation (MI) with the classification labels [43]. This is usually suboptimal
for building predictors [69] due to the possible redundancy among variables.
Peng et al. [125] use not only information about the relevance of a variable
but also an estimation of the redundancy among the selected variables. This
is the min-Redundancy Max-Relevance (mRMR) feature selection criterion.
For their measures they estimate mutual information between pairs of vari-
ables. Another feature selection approach estimates the mutual information
between a whole set of features and the classes for using the infomax crite-
rion. The idea of maximizing the mutual information between the features
and the classes is similar to the example illustrated in Fig. 7.9, where we can
see that in the first plot the classifier is the optimal, as well as the mutual
information between the two dimensions of the data and the classes (black
and white) is maximum. The mutual information is a mathematical measure
which captures the dependencies which provide information about the class
labels, disregarding those dependencies among features which are irrelevant
to the classification.
222 6 Feature Selection and Transformation

6.3.2 Mutual Information for Feature Selection

The primary problem of feature selection is the criterion which evaluates a


feature set. It must decide whether a feature subset is suitable for the classi-
fication problem, or not. The optimal criterion for such purpose would be the
Bayesian error rate for the subset of selected features:
 & '
E(S) = p(S) 1 − max(p(ci |S)) dS (6.3)
S i

where S is the vector of selected features and ci ∈ C is a class from all the
possible classes C existing in the data.
The Bayesian error rate is the ultimate criterion for discrimination; how-
ever, it is not useful as a cost, due to the nonlinearity of the max(·) function.
Then, some alternative cost function has to be used. In the literature there are
many bounds on the Bayesian error. An upper bound obtained by Hellman
and Raviv (1970) is
H(C|S)
E(S) ≤
2
This bound is related to mutual information, because mutual information can
be expressed as
I(S; C) = H(C) − H(C|S)

and H(C) is the entropy of the class labels which do not depend on the feature
subspace S. Therefore, the mutual information maximization is equivalent to
the maximization of the upper bound (Eq. 6.3) of the Bayesian error. There
is a Bayesian error lower bound as well, obtained by Fano (1961), and is also
related to mutual information.
The relation of mutual information with the Kullback–Leibler (KL) diver-
gence also justifies the use of mutual information for feature selection. The
KL divergence is defined as

p(x)
KL(P ||Q) = p(x) log dx
x q(x)

for the continuous case and


 p(x)
KL(P ||Q) = p(x) log
x
q(x)

for the discrete case. From the definition of mutual information, and given
that the conditional entropy can be expressed as p(x|y) = p(x, y)/p(y), we
have that
6.3 Filters Based on Mutual Information 223
 p(x, y)
I(X; Y ) = p(x, y) log (6.4)
p(x)p(y)
y∈Y x∈X
  p(x|y)
= p(y) p(x|y) log (6.5)
p(x)
y∈Y x∈X

= p(y)KL (p(x|y)||p(x)) (6.6)
y∈Y
= EY (KL (p(x|y)||p(x))) (6.7)

Then, maximizing mutual information2 is also equivalent to maximizing the


expectation of the KL divergence between the class-conditional densities
P (S|C) and the density of the feature subset P (S). In other words, the density
over all classes has to be as distant as possible from the density of each class
in the feature subset. Mutual information maximization provides a trade-off
between discrimination maximization and redundancy minimization.
There are some practical issues involved into the maximization of mutual
information between the features and the classes. Nowadays feature selection
problems involve thousands of features, in a continuous feature space. Es-
timating mutual information between a high-dimensional continuous set of
features, and the class labels, is not straightforward, due to the entropy esti-
mation. There exist graph-based methods which do not need the density esti-
mation of the data, thus allowing to estimate the entropy of high-dimensional
data with a feasible computational complexity (see Chapter 5).

6.3.3 Individual Features Evaluation, Dependence


and Redundancy

Some works on feature selection avoid the multidimensional data entropy es-
timation by working with single features. This, of course, is not equivalent to
the maximization of I(S; C). In the approach of Peng et al. [125] the feature
selection criterion takes into account the mutual information of each sepa-
rate feature and the classes, but also subtract the redundance of each separate
feature with the already selected ones. It is explained in the next section.
A simpler approach is to limit the cost function to evaluate only the mutual
information between each selected feature xi ∈ S and the classes C:

I(S∗ ; C) ≈ I(xi ; C) (6.8)
xi ∈S

Such cost is effective in some concrete cases, as reason Vasconcelos et al.


in [166]. The expression of the mutual information of the optimal feature

2
Some authors refer to the maximization of the mutual information between the
features and the classes as infomax criterion.
224 6 Feature Selection and Transformation

subset S∗ = {x∗1 , x∗2 , . . . , x∗NS∗ } of size NS ∗ can be decomposed into the


following sum:

N
I(S∗ ; C) = I(x∗i ; C)
i=1

N
+ ,
− I(x∗i ; S∗1,i−1 ) − I(x∗i ; S∗1,i−1 |C) (6.9)
i=2

where x∗i is the ith most important feature and S∗1,i−1 = {x∗1 , . . . , x∗i−1 } is the
set of the first i − 1 best features, which have been selected before selecting
x∗i . This expression is obtained by applying the chain rule of mutual informa-
tion. For the mutual information between N variables X1 , . . . , XN , and the
variable Y , the chain rule is

N
I(X1 , X2 , . . . , XN ; Y ) = I(Xi ; Y |Xi−1 , Xi−2 , . . . , X1 )
i=1
The property from Eq. 6.9 is helpful for understanding the kind of trade-
off between discriminant power maximization and redundancy minimization
which is achieved by I(S∗ ; C). The first summation measures the individual
discriminant power of each feature belonging to the optimal set. The second
summation penalizes those features x∗i which, together with the already se-
lected ones S∗1,i−1 , are jointly informative about the class label C. This means
that if S∗1,i−1 is already informative about the class label, the informativeness
of the feature x∗i is the kind of redundancy which is penalized. However, those
features which are redundant, but do not inform about the class label, are
not penalized.
Given this property, Vasconcelos et al. [166] focus the feature selection
problem on visual processing with low level features. Several studies report
that there exist universal patterns of dependence between the features of bi-
ologically plausible image transformations. These universal statistical laws
of dependence patterns are independent of the image class. This conjecture
implies that the second summation in Eq. 6.9 would probably be close to
zero, because of the assumption that the redundancies which carry informa-
tion about the class are insignificant. In this case, only the first summation
would be significant for the feature selection process, and the approximation
in Eq. 6.8 would be valid. This is the most relaxed feature selection cost,
in which the discriminant power of each feature is individually measured.
An intermediate strategy was introduced by Vasconcelos et al. They se-
quentially relax the assumption that the dependencies are not informative
about the class. By introducing the concept of l-decomposable feature sets
they divide the feature set into disjoint subsets of size l. The constraint is
that any dependence which is informative about the class label has to be
between the features of the same subset, but not between susbsets. If S∗ is
the optimal feature subset of size N and it is l-decomposable into the subsets
T1 , . . . , TN/l , then
6.3 Filters Based on Mutual Information 225


N
I(S∗ ; C) = I(x∗i ; C)
i=1

 &
N i−1/l
 '
− I(x∗i ; T̃j,i ) − I(x∗i ; T̃j,i |C) (6.10)
i=2 j=1

where T̃j,i is the subset of Tj containing the features of index smaller than k.
This cost function makes possible an intermediate strategy which is not as
relaxed as Eq. 6.8, and is not as strict as Eq. 6.9. The gradual increase of
the size of the subsets Tj allows to find the l at which the assumption about
noninformative dependences between the subsets becomes plausible.
The assumption that the redundancies between features are independent of
the image class is not realistic in many feature selection problems, even in the
visual processing field. In the following section, we analyze some approaches
which do not make the assumption of Eq. 6.8. Instead they take into consid-
eration the interactions between all the features.

6.3.4 The min-Redundancy Max-Relevance Criterion

Peng et al. present in [125] a Filter Feature Selection criterion based on mutual
information estimation. Instead of estimating the mutual information I(S; C)
between a whole set of features and the class labels (also called prototypes),
they estimate it for each one of the selected features separately. On the one
hand they maximize the relevance I(xj ; C) of each individual feature xj ∈ F.
On the other hand they minimize the redundancy between xj and the rest of
selected features xi ∈ S, i = j. This criterion is known as the min-Redundancy
Max-Relevance (mRMR) criterion and its formulation for the selection of the
mth feature is
⎡ ⎤
1 
max ⎣I(xj ; C) − I(xj ; xi )⎦ (6.11)
xj ∈F−Sm−1 m−1
xi ∈Sm−1

This criterion can be used by a greedy algorithm, which in each iteration


takes a single feature and decides whether to add it to the selected feature
set, or to discard it. This strategy is called forward feature selection. With
the mRMR criterion each evaluation of a new feature consists of estimating
the mutual information between a feature and the prototypes, as well as the
MI between that feature and each one of the already selected ones (Eq. 6.11).
An interesting property of this criterion is that it is equivalent to first-order
incremental selection using the Max-Dependency (MD) criterion. The MD
criterion, presented in the next section, is the maximization of the mutual
information between all the selected features (together) and the class, I(S, C).
First-order incremental selection consists of starting with an empty feature
set and add, incrementally, a single feature in each subsequent iteration. This
226 6 Feature Selection and Transformation

implies that by the time the mth feature xm has to be selected, there already
are m−1 selected features in the set of selected features Sm−1 . By defining the
following measure for the x1 , x2 , . . . , xn scalar variables (i.e., single features):

J(x1 , x2 , . . . , xn )
 
p(x1 , x2 , . . . , xn )
= · · · p(x1 , x2 , . . . , xn ) log dx1 · · · dxn ,
p(x1 )p(x2 ) · · · p(xn )

it can be seen that selecting the mth feature with mRMR first-order incre-
mental search is equivalent to maximizing the mutual information between
Sm and the class C. Equations 6.12 and 6.13 represent the simultaneous max-
imization of their first term and minimization of their second term. We show
the equivalence with mutual information in the following equation (Eq. 6.14):

I(Sm ; C) = J(Sm , C) − J(Sm ) (6.12)


= J(Sm−1 , xm , C) − J(Sm−1 , xm ) (6.13)
= J(x1 , . . . , xm−1 , xm , C) − J(x1 , . . . , xm−1 , xm )
 
p(x1 , . . . , xm , C)
= · · · p(x1 , . . . , xm , C) log dx1 · · · dxm dC
p(x1 ) · · · p(xm )p(C)
 
p(x1 , . . . , xm )
− · · · p(x1 , . . . , xm ) log dx1 · · · dxm
p(x1 ) · · · p(xm )
  
p(x1 , . . . , xm , C)
= · · · p(x1 , . . . , xm , C) log
p(x1 ) · · · p(xm )p(C)

p(x1 ) · · · p(xm )
· dx1 · · · dxm dC
p(x1 , . . . , xm )
 
p(x1 , . . . , xm , C)
= · · · p(x1 , . . . , xm , C) log dx1 · · · dxm dC
p(x1 , . . . , xm )p(C)
 
p(Sm , C) 1
= p(Sm , C) log a dSm dC = I(Sm ; C). (6.14)
p(Sm )p(C)

This reasoning can also be denoted in terms of entropy. We can write J(·) as

J(x1 , x2 , . . . , xn ) = H(x1 ) + H(x2 ) + · · · + H(xn ) − H(x1 , x2 , . . . , xn ),

therefore

J(Sm−1 , xm ) = J(Sm ) = H(xi ) − H(Sm )
xi ∈Sm

and

J(Sm−1 , xm , C) = J(Sm , C) = H(xi ) + H(C) − H(Sm , C)
xi ∈Sm
6.3 Filters Based on Mutual Information 227

which substituted in Eq. 6.13 results in

J(Sm−1 , xm , C) − J(Sm−1 , xm )
! "
 
= H(xi ) + H(C) − H(Sm , C) − H(xi ) − H(Sm )
xi ∈Sm xi ∈Sm
= H(C) − H(Sm , C) + H(S) = I(S, C)

There is a variant of the mRMR criterion. In [130] it is reformulated using


a different representation of redundancy. They propose to use a coefficient of
uncertainty which consists of dividing the MI between two variables xj and
xi by the entropy of H(xi ), xi ∈ Sm−1 :

I(xj ; xi ) H(xi ) − H(xi |xj ) H(xi |xj )


= =1−
H(xi ) H(xi ) H(xi )
This is a nonsymmetric definition which quantifies the redundancy with a
value between 0 and 1. The highest value possible for the negative term
H(xi |xj )/H(xi ) is 1, which happens when xi and xj are independent, then
H(xi |xj ) = H(xi ). The lowest value is 0, when both variables are completely
dependent, disregarding their entropy. With this redundancy definition the
mRMR criterion expression 6.11 becomes
⎡ ⎤

max ⎣I(xj ; C) − 1 I(xj ; xi ) ⎦
(6.15)
xj ∈F−Sm−1 m−1 H(xi )
xi ∈Sm−1

In the following section we present another mutual-information-based cri-


terion which estimates the MI between the selected features and the class.
However, multidimensional data are involved in the estimation, which requires
some alternative MI estimation method.

6.3.5 The Max-Dependency Criterion

The Max-Dependency (MD) criterion consists of maximizing the mutual in-


formation between the set of selected features S and the class labels C:

max I(S; C) (6.16)


S⊆F

Then, the mth feature is selected according to

max I(Sm−1 , xj ; C) (6.17)


xj ∈F−Sm−1

Whilst in mRMR the mutual information is always estimated between


two variables of one dimension, in MD the estimation of I(S; C) is not trivial
because S could consist of a large number of features. In [25] such estimation
228 6 Feature Selection and Transformation

is performed with the aid of Entropic Spanning Graphs for entropy estimation
[74], as explained in Chapter 4. This entropy estimation is suitable for data
with a high number of features and a small number of samples, because its
complexity depends on the number ns of samples (O(ns log(ns ))) but not
on the number of dimensions. The MI can be calculated from the entropy
estimation in two different ways, with the conditional entropy and with the
joint entropy:

 p(s, c)
I(S; C) = p(x, c) log (6.18)
p(x)p(c)
x∈S c∈C
= H(S) − H(S|C) (6.19)
= H(S) + H(C) − H(S, C) (6.20)
where x is a feature from the set of selected features S and c is a class label
belonging to the set of prototypes C.
Provided that entropy can be estimated for high-dimensional data sets,
different IT-based criteria can be designed, depending on the problem. For
example, the Max-min-Dependency (MmD) criterion (Eq. 6.21), in addition
to the Max-Dependency maximization, also minimizes the mutual information
between the set of discarded features and the classes:
max [I(S; C) − I(F − S; C)] (6.21)
S⊆F

Then, for selecting the mth feature, Eq. 6.22 has to be maximized:
max [I(Sm−1 ∪ {xj }; C) − I(F − Sm−1 − {xj }; C)] (6.22)
xj ∈F−Sm−1

The aim of the MmD criterion is to avoid leaving out features which have infor-
mation about the prototypes. In Fig. 6.8 we show the evolution of the criterion
as the number of selected features increases, as well as the relative values of
the terms I(S; C) and I(F − S; C), together with the 10-fold CV and test
errors of the feature sets.

6.3.6 Limitations of the Greedy Search

The term “greedy” refers to the kind of searches in which the decisions cannot
be undone. In many problems, the criterion which guides the search does not
necessarily lead to the optimal solution and usually falls into a local maximum
(minimum). This is the case of forward feature selection. In the previous
sections we presented different feature selection criteria. With the following toy
problem we show an example of incorrect (or undesirable) feature selection.
Suppose we have a categorical data set. The values of categorical variables
are labels and these labels have no order: the comparison of two categorical
values can just tell whether they are the same or different. Note that if the
data are not categorical but they are ordinal, regardless if they are discrete
6.3 Filters Based on Mutual Information 229
Mutual Information (MmD criterion) Feature Selection
100

90

80

10−fold CV error (%)


70
Test classification error (%)
I(S;C) − I(F−S,C)
% error and MI
60
I(S;C)
I(F−S,C)
50

40

30

20

10

0
0 5 10 15 20 25 30 35 40 45
# features

Fig. 6.8. MD and MmD criteria on image data with 48 features. Figure by B. Bonev,
F. Escolano and M.A. Cazorla (2008
c Springer).
or continuous, then a histogram has to be built for the estimation of the
distribution. For continuous data, a number of histogram bins have to be cho-
sen necessarily, and for some discrete, but ordinal data, it is also convenient.
For example, the distribution of the variable x = {1, 2, 1,002, 1,003, 100} could
be estimated by a histogram with 1,003 bins (or more) where only five bins
would have a value of 1. This kind of histogram is too sparse. A histogram
with 10 bins offers a more compact representation, though less precise, and
the distribution of x would look like ( 25 , 15 , 0, 0, 0, 0, 0, 0, 0, 25 ). There also are
entropy estimation methods which bypass the estimation of the probability
distribution, as detailed in Chapter 4. These methods, however, are not suit-
able for categorical varibles. For simplicity we present an example with cat-
egorical data, where the distribution of a variable x = {A, B, Γ, A, Γ } is
P r(x = A) = 23 , P r(x = B) = 13 , P r(x = Γ ) = 23 .
The data set of the toy-example contains five samples defined by three
features, and classified into two classes.
x1 x2 x3 C
A Z Θ C1
B Δ Θ C1
Γ E I C1
A E I C2
Γ Z I C2
The mutual information between each single feature xi , 1 ≤ i ≤ 3 and the
class C is
I(x1 , C) = 0.1185
I(x2 , C) = 0.1185 (6.23)
I(x3 , C) = 0.2911
230 6 Feature Selection and Transformation

Therefore, both mRMR and MD criteria would decide to select x3 first. For the
next feature which could be either x1 or x2 , mRMR would have to calculate
the redundancy of x3 with each one of them:

I(x1 , C) − I(x1 , x3 ) = 0.1185 − 0.3958 = −0.2773


(6.24)
I(x1 , C) − I(x1 , x3 ) = 0.1185 − 0.3958 = −0.2773

In this case both values are the same and it does not matter which one to
select. The feature sets obtained by mRMR, in order, would be: {x3 }, {x1 , x3 },
{x1 , x2 , x3 }.
To decide the second feature (x1 or x2 ) with the MD criterion, the mutual
information between each one of them with x3 , and the class C, has to be es-
timated. According to the definition of MI, in this discrete case, the formula is
# p(x1 , x3 , C)
$
I(x1 , x3 ; C) = p(x1 , x3 , C) log
x x
p(x1 , x3 )p(C)
C 3 1

The joint probability p(x1 , x3 , C) is calculated with a 3D histogram where the


first dimension has the values of x1 , i.e., A, B, Γ , the second dimension has
the values Θ, I, and the third dimension has the values C1 , C2 . The resulting
MI values are
I(x1 , x3 ; C) = 0.3958
(6.25)
I(x2 , x3 ; C) = 0.3958

Then MD would also select any of them, as first-order forward feature se-
lection with MD and mRMR is equivalent. However, MD can show us that
selecting x3 in first place was not a good decision, given that the combination
of x1 and x2 has much higher mutual information with the class:

I(x1 , x2 ; C) = 0.6730 (6.26)

Therefore, in this case, we should have used MD with a higher-order forward


feature selection, or another search strategy (like backward feature selection
or some nongreedy search). A second-order forward selection would have first
yielded the set {x1 , x2 }. Note that the mutual information of all the features
and C does not outperform it:

I(x1 , x2 , x3 ; C) = 0.6730, (6.27)

and if two feature sets provide the same information about the class, the
preferred is the one with less features: x3 is not informative about the class,
given x1 and x2 .
6.3 Filters Based on Mutual Information 231

The MmD criterion would have selected the features in the right order
in this case, because it not only calculates the mutual information about the
selected features, but it also calculates it for the nonselected features. Then, in
the case of selecting x3 and leaving unselected x1 and x2 , MD would prefer not
to leave together an unselected pair of features which jointly inform so much
about the class. However, MmD faces the same problem in other general cases.
Some feature selection criteria could be more suitable for one case or another.
However, there is not a criterion which can avoid the local maxima when used
in a greedy (forward or backward) feature selection. Greedy searches with a
higher-order selection, or algorithms which allow both addition and deletion
of features, can alleviate the local minima problem.

6.3.7 Greedy Backward Search

Even though greedy searches can fall into local maxima, it is possible to
achieve the highest mutual information possible for a feature set, by means of
a greedy backward search. However, the resulting feature set, which provides
this maximum mutual information about the class, is usually suboptimal.
There are two kinds of features which can be discarded: irrelevant features
and redundant features. If a feature is simply irrelevant to the class label, it
can be removed from the feature set and this would have no impact on the
mutual information between the rest of the features and the class. It is easy
to see that removing other features from the set is not conditioned by the
removal of the irrelevant one.
However, when a feature xi is removed due to its redundancy given other
features, it is not so intuitive if we can continue removing from the remaining
features, as some subset of them made xi redundant. By using the mutual
information chain rule we can easily see the following. We remove a feature xi
from the set Fn with n features, because that feature provides no additional
information about the class, given the rest of the features Fn−1 . Then we
remove another feature xi because, again, it provides no information about
the class given the subset Fn−2 . In this case, the previously removed one, xi ,
will not be necessary anymore, even after the removal of xi . This process can
continue until it is not possible to remove any feature because otherwise the
mutual information would decrease. Let us illustrate it with the chain rule of
mutual information:

n
I(S; C) = I(x1 , . . . , xn ; C) = I(xi ; C|xi−1 , xi−2 , . . . , x1 ) (6.28)
i=1

With this chain rule the mutual information of a multidimensional variable


and the class is decomposed into a sum of conditional mutual information. For
simplicity, let us see an example with four features:
232 6 Feature Selection and Transformation

I(x1 , x2 , x3 , x4 ; C) = I(x1 ; C)
+ I(x2 ; C|x1 )
+ I(x3 ; C|x1 , x2 )
+ I(x4 ; C|x1 , x2 , x3 )

If we decide to remove x4 , it is because it provides no information about C,


given the rest of the features, that is: I(x4 ; C|x1 , x2 , x3 ) = 0. Once removed, it
can be seen that x4 does not appear in any other terms, so, x3 , for example,
could be removed if I(x3 ; C|x1 , x2 ) = 0, without worrying about the previous
removal of x4 .
This backward elimination of features does not usually lead to the min-
imum feature set. In Fig. 6.9 we have illustrated a sample feature selec-
tion problem with four features. The feature x4 can be removed because
I(x4 ; C|x1 , x2 , x3 ) = 0. Actually, this feature is not redundant given other
features, but it is completely redundant, because I(x4 ; C) = 0. The next fea-
ture which could be removed is either x1 , x2 , or x3 because we have that
I(x1 ; C|x2 , x3 ) = 0, I(x2 ; C|x1 , x3 ) = 0, and I(x3 ; C|x1 , x2 ) = 0. In such sit-
uation greedy searches take their way randomly. See Fig. 6.10 to understand
that, if x3 is taken, the search falls into a local minimum, because neither x1
nor x2 can be removed, if we do not want to miss any mutual information with
the class. However, if instead of removing x3 , one of the other two features is re-
moved, the final set is {x3 }, which is the smallest possible one for this example.
The artificial data set “Corral” [87] illustrates well the difference between
forward and backward greedy searches with mutual information. In this data

I(X;C)

C
x1

x4 x2 x3

Fig. 6.9. A Venn diagram representation of a simple feature selection problem where
C represents the class information, and X = {x1 , x2 , x3 , x4 } is the complete feature
set. The colored area represents all the mutual information between the features of
X and the class information. The feature x4 does not intersect this area; this means
that it is irrelevant.
6.3 Filters Based on Mutual Information 233

I(X;C)
I(X;C)

C
x1 C

x3
x2

Fig. 6.10. In Fig. 6.9, the features {x1 , x2 } (together) do not provide any further
class information than x3 provides by itself, and vice versa: I(x1 , x2 ; C|x3 ) = 0 and
I(x3 ; C|x1 , x2 ) = 0. Both feature sets, {x1 , x2 } (left) and x3 (right), provide the
same information about the class as the full feature set.

set there are six binary features, {x1 , x2 , x3 , x4 , x5 , x6 }. The class label is also
binary and it is the result of the operation:

C = (x1 ∧ x2 ) ∨ (x3 ∧ x4 )

Therefore x1 , x2 , x3 , and x4 fully determine the class label C. The feature x5


is irrelevant, and x6 is a feature highly (75%) correlated with the class label.
Some samples could be the following:
x1 x2 x3 x4 x5 x6 C
0 1 1 0 1 0 0
0 1 1 1 1 1 1
1 0 0 0 0 1 0
1 0 1 1 0 1 1
Most feature selection approaches, and in particular those which perform
a forward greedy search, first select the highly correlated feature, which is an
incorrect decision. Contrarily, when evaluating the mutual information (the
MD criterion) in a backward greedy search, the first features to discard are
the irrelevant and the correlated ones. Then only the four features defining
the class label remain selected.
In practice, the mutual information estimations are not perfect; moreover,
the training set usually does not contain enough information to perfectly
define the distribution of the features. Then, rather than maintaining a zero
decrease of the mutual information when discarding features, the objective is
rather to keep it as high as possible, accepting small decreases.
Another way of proving that greedy backward feature elimination is the-
oretically feasible is presented in the next section.
234 6 Feature Selection and Transformation

6.3.8 Markov Blankets for Feature Selection

Markov blankets provide a theoretical framework for proving that some


features can be successively discarded (in a greedy way) from the feature
set without loosing any information about the class. The Markov blanket of a
random variable xi is a minimal set of variables, such that all other variables
conditioned on them are probabilistically independent of the target xi . (In a
Bayesian network, for example, the Markov blanket of a node is represented
by the set of its parents, children, and the other parents of the children.)
Before formally defining a Markov blanket, let us define the concept of
conditional independence. Two variables A and B are conditionally indepen-
dent, given a set of variables C, if P (A|C, B) = P (A|C). From this definition
some properties of conditional independence can be derived. Let us denote
the conditional independence between A and B given C as A ⊥ B | C. The
properties are:

Symmetry: A ⊥ B|C =⇒ B ⊥ A|C


Decomposition: A, D ⊥ B|C =⇒ A ⊥ B|C and D ⊥ B|C
Weak union: A ⊥ B, D|C =⇒ A ⊥ B|C, D (6.29)
Contraction: A ⊥ D|B, C and A ⊥ B|C =⇒ A ⊥ D, B|C
Intersection: A ⊥ B|C, D and A ⊥ D|C, B =⇒ A ⊥ B, D|C,

where the Intersection property is only valid for positive probabilities. (Nega-
tive probabilities are used in several fields, like quantum mechanics and math-
ematical finance.)
Markov blankets are defined in terms of conditional independence. The set
of variables (or features) M is a Markov blanket for the variable xi , if xi is
conditionally independent of the rest of the variables F − M − {xi }, given M:

P (F − M − {xi }|M, xi ) = P (F − M − {xi }|M)

or
xi ⊥ F − M − {xi } | M (6.30)
where F is the set of features {x1 , . . . , xN }. Also, if M is a Markov blanket
of xi , then the class C is conditionally independent of the feature given the
Markov blanket: xi ⊥ C | M. Given these definitions, if a feature xi has a
Markov blanket among the set of features F used for classification, then xi
can safely be removed from F without losing any information for predicting
the class.
Once a Markov blanket for xi is found among F = {x1 , . . . , xN } and xi is
discarded, the set of selected (still not discarded) features is S = F − {xi }.
In [97] it is proven that, if some other feature xj has a Markov blanket among
S, and xj is removed, then xi still has a Markov blanket among S − {xj }.
This property of the Markov blankets makes them useful as a criterion for a
greedy feature elimination algorithm. The proof is as follows:
Let Mi ⊆ S be a Markov blanket for xi , not necessarily the same blanket
which was used to discard the feature. Similarly, let Mj ⊆ S be a Markov
6.3 Filters Based on Mutual Information 235

blanket for xj . It can happen that Mi contains xj , so we have to prove that,


after the removal of xj , the set Mi = Mi − {xj }, together with the Markov
blanket of Mj , is still a Markov blanket for the initially removed xi . Intuitively,
when we remove xj , if it forms part of a Markov blanket for some already
removed feature xi , then the Markov blanket of Mj will still provide the
conditional information that xj provided in Mi . By the definition of Markov
blankets in Eq. 6.30, we have to show that, given the blanket Mi ∪ Mj , the
feature xi is conditionally independent of the rest of the features; let us denote
them as X = S − (Mi ∪ Mj ) − {xj }. We have to show that

xi ⊥ X | Mi ∪ Mj (6.31)

In first place, from the assumption about the Markov blanket of xj we have
that
xj ⊥ S − Mj − {xj } | Mj
Using the Decomposition property (Eq. 6.29) we can decompose the set
S − Mj − {xj } and we obtain

xj ⊥ X ∪ Mi | Mj

Using the Weak union property (Eq. 6.29), we can derive from the last
statement:
xj ⊥ X | Mi ∪ Mj (6.32)
For xi we follow the same derivations and we have

xi ⊥ X ∪ (Mj − Mi ) | Mi ∪ {xj }

and, therefore,
xi ⊥ X | Mj ∪ Mi ∪ {xj } (6.33)
From Eqs. 6.32 and 6.33, and using the Contraction property (Eq. 6.29) we
derive that
{xi } ∪ {xj } ⊥ X | Mj ∪ Mi
which, with the Decomposition property (Eq. 6.29), is equivalent to Eq. 6.31;
therefore, it is true that after the removal of xj , the subset Mi ∪ Mj is a
Markov blanket for xi .
In practice it would be very time-consuming to find a Markov blanket for
each feature before discarding it. In [97] they propose a heuristic in which
they fix a size K for the Markov blankets for which the algorithm searches.
The size K depends very much on the nature of the data. If it is too low, it is
not possible to find good Markov blankets. If it is too high, the performance
is also negatively affected. Among other experiments, the authors of [97] also
experiment with the “Corral” data set, already presented in the previous
section. With the appropriate K they successfully achieve the correct feature
selection on it, similar to the result shown in the previous section with the
MD greedy backward elimination.
236 6 Feature Selection and Transformation

6.3.9 Applications and Experiments

The applications of filter feature selection are manifold. Usually filter-based


techniques are used in problems where the use of a wrapper is not feasible,
due to computational time or due to the overfitting which a wrapper may
cause with some data sets.
The three filter-based criteria presented are suitable for experiments on
data with many features. Particularly, the MD criterion is capable of compar-
ing very high-dimensional data sets. However, it is not suitable for data sets
with a large number of samples. The microarray data sets are characterized
by a high number of features and a low number of samples, so they are very
appropriate for illustrating the performance of the presented criteria. An ap-
plication of feature selection to microarray data is to identify small sets of
genes with good predictive performance for diagnostic purposes. Traditional
gene selection methods often select genes according to their individual dis-
criminative power. Such approaches are efficient for high-dimensional data
but cannot discover redundancy and basic interactions among genes. Multi-
variate filters for feature selection overcome this limitation as they evaluate
whole sets of features instead of separate features.
Let us see an example with the well-known NCI60 data set. It contains
60 samples (patients), each one containing 6,380 dimensions (features), where
each dimension corresponds to the expression level of some gene. The samples
are labeled with 14 different classes of human tumor diseases. The purpose of
feature selection is to select a set containing those genes which are useful for
predicting the disease.
In Fig. 6.11 we have represented the increase of mutual information
which leads the greedy selection of new features, and the leave-one-out cross

Feature selection with the MD criterion


45

40
LOOCV error (%)
I(S;C)
35
% error and MI

30

25

20

15

10
20 40 60 80 100 120 140 160
# features

Fig. 6.11. Maximum dependency feature selection performance on the NCI microar-
ray data set with 6,380 features. The mutual information of the selected features is
represented.
6.3 Filters Based on Mutual Information 237
MD Feature Selection mRMR Feature Selection
CNS CNS 1
CNS CNS
CNS CNS
RENAL RENAL
BREAST BREAST
CNS CNS
CNS CNS 0.9
BREAST BREAST
NSCLC NSCLC
NSCLC NSCLC
RENAL RENAL
RENAL RENAL
RENAL RENAL 0.8
RENAL RENAL
RENAL RENAL
RENAL RENAL
RENAL RENAL
BREAST BREAST
NSCLC NSCLC
RENAL RENAL 0.7
UNKNOWN UNKNOWN
OVARIAN OVARIAN
MELANOMA MELANOMA
PROSTATE PROSTATE
OVARIAN OVARIAN
OVARIAN OVARIAN 0.6
OVARIAN OVARIAN
OVARIAN OVARIAN
Class (disease)

OVARIAN OVARIAN
PROSTATE PROSTATE
NSCLC NSCLC
NSCLC NSCLC
NSCLC NSCLC
0.5
LEUKEMIA LEUKEMIA
K562B−repro K562B−repro
K562A−repro K562A−repro
LEUKEMIA LEUKEMIA
LEUKEMIA LEUKEMIA
LEUKEMIA LEUKEMIA 0.4
LEUKEMIA LEUKEMIA
LEUKEMIA LEUKEMIA
COLON COLON
COLON COLON
COLON COLON
COLON COLON 0.3
COLON COLON
COLON COLON
COLON COLON
MCF7A−repro MCF7A−repro
BREAST BREAST
MCF7D−repro MCF7D−repro
BREAST BREAST 0.2
NSCLC NSCLC
NSCLC NSCLC
NSCLC NSCLC
MELANOMA MELANOMA
BREAST BREAST
BREAST BREAST 0.1
MELANOMA MELANOMA
MELANOMA MELANOMA
MELANOMA MELANOMA
MELANOMA MELANOMA
MELANOMA MELANOMA
MELANOMA MELANOMA 0
→ 2080

→ 6145
1177
1470
1671

3227
3400
3964
4057
4063
4110
4289
4357
4441
4663
4813
5226
5481
5494
5495
5508
5790
5892
6013
6019
6032
6045
6087

6184
6643
→ 135
246
663
766
982

→ 2080

→ 6145
19

1378
1382
1409
1841

2081
2083
2086
3253
3371
3372
4383
4459
4527
5435
5504
5538
5696
5812
5887
5934
6072
6115
6305
6399
6429
6430
6566
→ 135
133
134
233
259
381
561

Number of Selected Gene Number of Selected Gene

Fig. 6.12. Feature selection on the NCI DNA microarray data. The MD (left) and
mRMR (right) criteria were used. Features (genes) selected by both criteria are
marked with an arrow. See Color Plates.

validation error.3 The error keeps on descending until 39 features are selected,
then it increases, due to the addition of redundant and noisy features. Al-
though there are 6,380 features in total, only feature sets up to size 165 are
represented on the graph.
In Fig. 6.12 we have represented the gene expression matrices of the fea-
tures selected by MD and mRMR. There are only three genes which were
selected by both criteria. This is due to the differences in the mutual infor-
mation estimation and to the high number of different features in contrast to
the small number of samples.
Finally, in Fig. 6.13, we show a comparison of the different criteria in
terms of classification errors. The data set used is extracted from image fea-
tures with a total number of 48 features. Only the errors of the first 20 features
sets are represented, as for larger feature sets the error does not decrease. Both
the 10-fold CV error and the test error are represented. The latter is calculated
with a separate test set which is not used in the feature selection process, that
is why the test error is higher than the CV error. The CV errors of the feature

3
This error measure is used when the number of samples is so small that a test
set cannot be built. In filter feature selection the LOOCV error is not used as a
selection criterion. It is used after the feature selection process, for evaluating the
classification performance achieved.
238 6 Feature Selection and Transformation

Fig. 6.13. Feature selection performance on image histograms data with 48 fea-
tures, 700 train samples, and 500 test samples, labeled with six classes. Comparison
between the Max-Dependency (MD), Max-min-Dependency (MmD) and the min-
Redundancy Max-Relevance (mRMR) criteria. Figure by B. Bonev, F. Escolano and
M.A. Cazorla (2008
c Springer).

sets yielded by MmD are very similar to those of MD. Regarding mRMR and
MD, in the work of Peng et al. [125], the experimental results given by mRMR
outperformed MD, while in the work of Bonev et al. [25] MD outperformed
mRMR for high-dimensional feature sets. However, mRMR is theoretically
equivalent to first-order incremental MD. The difference in the results is due
to the use of different entropy estimators.

6.4 Minimax Feature Selection for Generative Models


6.4.1 Filters and the Maximum Entropy Principle
The maximum entropy principle has been introduced in this book in
Chapter 3, and it is one of the IT principles most widely used. We re-
mind the reader the basic idea: when we want to learn a distribution (pdf)
from the data and we have expectation constraints (the expectation of several
statistical models (features) G(·) must match the samples), the most unbiased
(neutral) hypothesis is to take the distribution with maximum entropy which
satisfies the constraints:

p∗ (ξ) = arg max − p(ξ) log p(ξ) dξ
p(ξ)

s.t. p(ξ)Gj (ξ) dξ = E(Gj (ξ)) = αj , j = 1, . . . , m

p(ξ) dξ = 1 (6.34)
6.4 Minimax Feature Selection for Generative Models 239

The typical shape of such pdf is an exponential depending on a linear


combination of features whose coefficients are the Lagrange multipliers:
1 m
p∗ (ξ) = e r=1 λr Gr (ξ) (6.35)
Z(Λ, ξ)
Herein we consider the problem of learning a probability distribution of
a given type of texture from image examples of that texture. Given that pdf
we must be able to reproduce or generate images of such textures. Let f (I)
be the true unknown distribution of images I. In pattern recognition such
distribution may represent a set of images corresponding to similar patterns
(similar texture appearance, for instance). As noted by Field [54], f (I) is
highly non-Gaussian. So we have very high-dimensional patterns (images)
belonging to a non-Gaussian distribution. A key point here is the selection
of the features G(·). An interesting approach consists of analyzing the images
by applying a certain set of filters (Gabor filters, Laplacian of Gaussians,
and others like the ones used in Section 6.2) to the images and then take
the histograms of the filtered images as features. Then Gj (I) denotes here the
histogram of the image I after applying the jth filter (e.g. an instance of a
Gabor filter parameterized by a given angle and variance). Such histogram is
quantized into say L bins. The use of the filters for extracting the significant
information contained in the images is an intelligent method of bypassing the
problem of the high dimensionality of the images. However, it is important
to select the most informative filters which is the other side of the coin of the
minimax approach described later in this section. Let {Iobsi } with i = 1, . . . , N
be a set of observed images (the training set for learning) and {Gj (I)} with
j = 1, . . . , m a set of histograms, each one derived from the jth filter. Taking
the first-order moment, N the average, the statistics of the observations are
given by: αj = 1/N i=1 Gj (Iobs i ) which are vectors of L dimensions (bins).
These statistics determine the right hand of the constraints in Eq. 6.34. The
Lagrange multipliers contained in the vector Λ = (λ1 , . . . , λm ) characterize
the log-likelihood of p(Iobs ; Λ) = p∗ (I):

N
L(Λ) = log p(Iobs ; Λ) = log p(Iobs
i ; Λ) (6.36)
i=1

and the log-likelihood has two useful properties, related to its derivatives, for
finding the optimal Λ:
• First derivative (Gradient)
∂L(Λ) 1 ∂Z(Λ, I)
= = E(Gj (I)) − αj ∀j (6.37)
∂λj Z(Λ, I) ∂λj
• Second derivative (Hessian)
∂ 2 L(Λ)
= E((Gj (I) − αj )(Gj (I) − αj )T ) ∀j, k (6.38)
∂λj ∂λk
240 6 Feature Selection and Transformation

The first property provides an iterative method for obtaining the optimal Λ
through gradient ascent:
dλj
= E(Gj (I)) − αj , j = 1, . . . , m (6.39)
dt
The convergence of the latter iterative method is ensured by the property
associated to the Hessian. It turns out that the Hessian of the log-likelihood
is the covariance matrix of (G1 (I), . . . , Gm (I)), and such a covariance matrix
is definite positive under mild conditions. Definite positiveness of the Hessian
ensures that L(Λ) is concave, and, thus, a unique solution for the optimal
Λ exists. However, the main problem of Eq. 6.63 is that the expectations
E(Gj (I)) are unknown (only the sample expectations αj are known). An
elegant, though computational intensive way of estimating E(Gj (I)) is to use
a Markov chain, because Markov chain Monte Carlo methods, like a Gibbs
sampler (see Alg. 13), ensure that in the limit (M → ∞) we approximate the
expectation:

1 
M
E(Gj (I)) ≈ Gj (Isyn syn
i ) = αj (Λ), j = 1, . . . , m (6.40)
M i=1

where Isyn
i are samples from p(I; Λ), Λ being the current multipliers so far.
Such samples can be obtained through a Gibbs sampler (see Alg. 13 where G
is the number of intensity values) starting from a pure random image. The jth
filter is applied to the ith generated image resulting the Gj (Isyn
i ) in the latter
equation. It is interesting to remark here that the Λ determine the current,
provisional, solution to the maximum entropy problem p(I; Λ), and, thus, the
synthesized images match partially the statistics of the observed ones. It is
very interesting to remark that the statistics of the observed images are used
to generate the synthesized ones. Therefore, if we have a fixed set of m filters,
the synthesizing algorithm proceeds by computing at iteration t = 1, 2, . . .

dλtj
= Δtj = αsyn
j (Λ ) − αj ,
t
j = 1, . . . , m (6.41)
dt
Then we obtain λtj + 1 ← λtj + Δtj and consequently Λt + 1 . Then, a new it-
eration begins. Therefore, as we approximate the expectations of each sub-
band and then we integrate all the multipliers in a new Λt + 1 , a new model
p(I; Λt + 1 ) from which we draw samples with the Markov chains is obtained.
As this model matches more and more the statistics of the observations it is not
surprising to observe that as the iterations progress, the results resemble more
and more the observations, that is, the images from the target class of textures.
For a fixed number of filters m, Alg. 12 learns a synthesized image from an ob-
served one. Such an algorithm exploits the Gibbs sampler (Alg. 13) attending
to the Markov property (intensity depends on the one of the neighbors). After
6.4 Minimax Feature Selection for Generative Models 241

Algorithm 12: FRAME


Input: Iobs input image (target), m number of filters
Initialize
Select a group of m filters: Sm = {F1 , F2 , . . . , Fm }
Compute Gj (Iobs ) for j = 1, . . . , m
Set λj ← 0 j = 1, . . . , m
Set Λ ← (λ1 , . . . , λm )
Initialize Isyn as a uniform white noise texture
repeat
Calculate Gj (Isyn ) for j = 1, . . . , m
Obtain αsynj (Λ) for j = 1, . . . , m
Compute Δj = αsyn j (Λ) − αj for j = 1, . . . , m
Update λj ← λj + Δj
Update p(I; Λ) with the new Λ
Use a Gibbs sampler to flip Isyn for w weeps under p(I; Λ)
until (d(Gj (Iobs ), Gj (Isyn )) < ) for j = 1, . . . , m;
Output: Isyn

Algorithm 13: Gibbs sampler


Input: I input image, Λ model
Initialize
flips← 0
repeat
Randomly pick a location x = (x, y) under uniform distribution
forall v = 0, . . . , G − 1 do
Calculate p(I(x) = v|I(z) : z ∈ N (v)) by evaluating p(I; Λ) at v
end
Randomly flip I(x) ← v under p(v|z)
flips← flips + 1
until (flips = w × |I|);
Output: Isyn

applying the sampler and obtaining a new image, the conditional probabili-
ties of having each value must be normalized so that the sum of conditional
probabilities sum one. This is key to provide a proper histogram later on.
As the FRAME (Filters, Random Fields and Maximum Entropy) algorithm
depends on a Markov chain it is quite related to simulated annealing in the
sense that it starts with a uniform distribution (less structure-hot) and con-
verges to the closest unbiased distribution (target structure-cold) satisfying
the expectation constraints. The algorithm converges when the distance be-
tween the statistics of the observed image and the ones of the synthesized
image does not diverge too much (d(·) can be implemented as the sum of the
component-by-component absolute differences of these vectors).
242 6 Feature Selection and Transformation

6.4.2 Filter Pursuit through Minimax Entropy

The FRAME algorithm synthesizes the best possible image given the m fil-
ters used, but: (i) how many filters do you need for having a good result; (ii)
if you constrain the number of filters, which is reasonable for computational
reasons, what are the best filters? For instance, in Fig. 6.14, we show how
the quality estimation improves as we use more and more filters. A perfect

Fig. 6.14. Texture synthesis with FRAME: (a) observed image, (b) initial random
image, (c,d,e,f) synthesized images with one, two, three and six filters. Figure by
S.C. Zhu, Y.N. Wu and D. Mumford (1997
c MIT Press).
6.4 Minimax Feature Selection for Generative Models 243

filter selection algorithm should analyze what combinations of the m filters


in a potentially large filter bank Bm ⊇ Sm yield the best synthesized tex-
ture (define S). But, as we have seen along the present chapter, this is not
feasible, because different textures need different amounts of filters, and m is
unknown beforehand for a given texture type. The usual alternative, which is
the one followed in [182, 183], is an incremental selection of filters (features).
Let Ss = {Fi1 , Fi2 , . . . , Fis }, where i1 , . . . , is are indexes identifying filters in
{1, . . . , m}, be the filters selected at the sth step (obviously B0 = ∅). Then,
how to extend Bs to Bs+1 ? Let Fis+1 be the filter maximizing the divergence
d(β) = D(αsyn β (Λs ), αβ ) between the sub-band statistics of both the synthe-
sized and the observed image. The arguments of D(., .) are the marginal dis-
tributions of p(I; Λs , Ss ) and f (I), respectively. Therefore, maximizing D(., .)
implies maximizing the difference between the latter distributions. Thus, this
is why Fis+1 = arg maxFβ ∈B/Ss d(β). Usually, the Lp norm is used (for instance
with p = 1):
1
Fis+1 = arg max |αβ − αsyn β (Λs )|p , (6.42)
Fβ ∈B/Ss 2

where the operator | · |p taking vectorial argument consists of applying it to


each pair of coordinates and taking the sum. Then, the filter pursuit algorithm
is detailed in Alg. 14.

Algorithm 14: Filter Pursuit


Input: Bm bank of filters, Iobs , Isyn
Initialize
s=0S←∅
p(I) ← uniform distribution
Isyn ← uniform noise
forall j = 1, . . . , m do
Compute Iobs,j by applying filter Fj to Iobs
Compute histogram αj of Iobs,j
end
repeat
forall Fβ ∈ B/S do
Compute Isyn,j by applying filter Fj to Isyn
Compute histogram αsyn j of Isyn,j
d(β) = 12 |αj − αj |
syn

end
Choose Fis+1 as the filter maximizing d(β) among those belonging to B/S
S ← S ∪ {Fis+1 }
s←s+1
Given p(I) and Isyn run the FRAME algorithm to obtain p∗ (I) and Isyn∗
p(I) ← p∗ (I) and Isyn ← Isyn∗
until (d(β) < );
Output: S
244 6 Feature Selection and Transformation

The rationale of selecting the selecting providing more difference (less


redundance) with respect to the yet selected filters is closely connected to
the minimax entropy principle. This principle can be seen as suggesting that
Sm should minimize the Kullback–Leibler divergence between p(I; Λm , Sm )
and f (I) (see Prob. 6.12). Therefore, at each step we should select

Fis+1 = arg max H(p(I; Λs , Ss )) − H(p(I; Λs+1 , Ss+1 )) (6.43)


Fβ ∈B/Ss

Therefore, Alg. 14 finds



Sm = arg min max H(p(I)) (6.44)
Sm ⊂B Ωm

where Ωm is the space of probability distributions satisfying the m expectation


constraints imposed by the m filters. In addition, the effectiveness of gain
associated to incorporating a given filter to S is measured by the following
decrease of Kullback–Leibler divergence:

d(β) = D(f (I)||p(I; Λs , Ss )) − D(f (I)||p(I; Λs+1 , Ss+1 )) (6.45)

and, more precisely,


1
d(β) = (αβ − Ep(I;Λm ,Sm ) (αβ ))T C−1 (αβ − Ep(I;Λm ,Sm ) (αβ )) (6.46)
2
C being the covariance matrix of the βth histogram. This is a kind of
Mahalanobis distance.

6.5 From PCA to gPCA


6.5.1 PCA, FastICA, and Infomax

The search for a proper space where to project the data is more related to the
concept of feature transformation than to the one of feature selection. How-
ever, both concepts are complementary in the sense that both of them point
towards optimizing/simplifying the classification and/or clustering problems.
It is then no surprising that techniques like PCA (principal component anal-
ysis), originally designed to dimensionality reduction, have been widely used
for face recognition or classification (see the yet classic papers of Pentland
et al. [126, 160]). It is well known that in PCA the N vectorized training
images (concatenate rows or columns) xi ∈ Rk are mapped to a new space
(eigenspace) whose origin is the average vector x̂. The differences between
input vectors and the average are di = (xi − x̂) (centered patterns). Let X
be the k × N matrix whose colunms are the di . Then, the N eigenvectors
φi of XT X are the orthonormal axes of the eigenspace, and the meaning of
their respective eigenvalues λ1 ≥ λ2 ≥ · · · ≥ λN is the importance of each
6.5 From PCA to gPCA 245

new dimension of the patterns (the variances of a k-dimensional Gaussian


centered on the average). Projecting input patterns on the eigenspace is done
by y = ΦT (xi − x̂), ΦT = Φ−1 being the transpose of Φ whose columns are
the eigenvectors (we are only changing the base/space of the data). Deproject-
ing the data without loss of information is straightforward: Φy = (xi − x̂) and
xi = x̂ + Φy. However, it is possible to obtain deprojections with small loss
of information if we remove from Φ the eigenvectors (columns) correspond-
ing to the less important eigenvalues. The resulting matrix is Φ̃. Then the
T
approximated projection is ỹ = Φ̃ (xi − x̂) and has as many dimensions as
eigenvectors retained, say r. The deprojection is x̃i = x̂ + Φ̃ỹ where the error
||x̃i − xi || is given by the sum of the magnitudes of eigenvalues suppressed.
The usual result is that with r << k it is possible to deproject without being
able for the human being to perceive the difference.
The usual way of using PCA for recognition is to project and give pattern
corresponding to the nearest projection as a result. This will be problematic
when we have two classes/distributions of patterns (see Fig. 6.15, left) whose
projections along the main axes are interleaved. This fact has motivated im-
provements of PCA in many directions, ICA (Independent Component Anal-
ysis) being one of them [41]. The underlying rationale of ICA is that relaxing
the orthogonality constraint between axes and optimizing their placement
so that the projections become statistically independent, the performance of
recognition would increase (see Fig. 6.15, center). In other words, given x it
is desirable to find a linear transformation W where the dimensions of the
transformation y = Wx are as more independent as possible. It is when we
consider the quantification of statistical independence when information the-
ory comes naturally: statistical independence implies zero mutual information.
That is, given a set of patters S = {x1 , . . . , xN } when xi ∈ Rk , ICA may be
formulated as finding

W ∗ = arg min I(y1 , . . . , yk ) : y = Wx x ∈ S (6.47)


Ω

where Ω is the space of k × k invertible matrices and I is the mutual in-


formation between the transformed variables. However, as multidimensional
mutual information cannot be measured for a large k (unless we use a by-
pass method) initial approximations to the problem assimilated statistical
independence with non-Gaussianity (we used a similar trick for defining the

8 8 8
6 6 6
4 4 4
2 2 2
0 0 0
−2 −2 −2
−4 −4 −4
−6 −6 −6
−8 −8 −8
−8 −6 −4 −2 0 2 4 6 8 −8 −6 −4 −2 0 2 4 6 8 −8 −6 −4 −2 0 2 4 6 8

Fig. 6.15. From left to right: PCA axes, ICA axes, and gPCA axes (polynomials).
246 6 Feature Selection and Transformation

concept of Gaussian deficiency for mixtures in Chapter 5). Such concept is


derived from the one of neg-entropy: J(y) = H(yGauss ) − H(y), yGauss be-
ing a Gaussian variable with the same covariance matrix as y. Therefore,
as H(·) is the entropy, and the variable with maximum entropy among all
with the same covariance is the Gaussian, we have that J(y) ≥ 0. It is,
thus, interesting to maximize the neg-entropy. However, we have the same
problem: the estimation of a multidimensional entropy. However, if y is a one-
dimensional standardized variable (zero mean and unit variance) we have that
J(y) ≈ [E(G(y)) − E(G(v))]2 , v being a standardized Gaussian variable, G(·)
a non-quadratic function like G1 (y) = (1/a) log cosh(ay) or G2 (y) = e−y /2 ,
2

and 1 ≤ a ≤ 2 is a properly fixed constant [79]. Then, as for a linear invertible


transformation y = Wx mutual information may be defined as


k 
k
I(y1 , . . . , yk ) = H(yi ) − H(y) = H(yi ) − H(x) − log | det W| (6.48)
i=1 i=1

However, we may assume that the yi are uncorrelated (if two variables are
independent then they are uncorrelated but the converse is not true in general)
and of unit variance: data are thus white. Given original centered z, a withening
transformation, which decorrelates the data, is done by z̃ = ΦD−1/2 ΦT , Φ is
the matrix of eigenvectors of the covariance matrix of centered data E(zzT )
and D = diag{λ1 , . . . , λk }. White data y satisfy E(yyT ) = I (unit variances,
that is, the covariance is the identity matrix). Then

E(yyT ) = WE(xxT )WT = I , (6.49)

and this implies that

1 = det I = det(WE(xxT )WT ) = (det W)E(xxT )(det WT ) (6.50)

which in turns implies that det W must be constant. For withened yi , en-
tropy and neg-entropy differ by a constant and therefore I(y1 , . . . , yk ) =
k k
C − i=1 J(yi ) and ICA may proceed by maximizing i=1 J(yi ).
One of the earlier algorithms for ICA is based on the infomax principle [14].
Consider a nonlinear scalar function gi (·) like the transfer function in a neural
network, but satisfying gi = fi (yi ), fi being the pdfs (derivatives coincident
with the densities). In these networks there are i = 1, 2, . . . , k neurons, each
one with weight wi . The weights must be chosen to maximize the transferred
information. Thus, the network must maximize H(g1 (wT1 x), . . . , gk (wTk x)).
For a single input, say x, with pdf fx (x) (see Fig. 6.16, top), and a single
neuron with transfer function g(x), the amount of information transferred de-
pends on fy (y ≡ g(x)), and the shape of fy depends on the matching between
the threshold w0 and the mean x̄ and variance of fx and also on the slope of
g(x). The optimal weight is the one maximizing the output information. The
purpose is to find w maximizing I(x, y) = H(y) − H(y|x) then, as H(y|x) is
6.5 From PCA to gPCA 247

sum of entropies of y given different values of x, and it may be assumed to be


independent of a specific value. Then, we have

∂I(x, y) ∂H(y)
= (6.51)
∂w ∂w
This is coherent with maximizing the output entropy. We have that
 8 8
fx (x) 8 ∂y 8
fy (y) = ⇒ H(y) = −E(ln fy (y)) = E ln 88 88 − E(ln fx (x))
|∂y/∂x| ∂x
(6.52)
and a maximization algorithm may focus on the first term (output dependent).
Then  8 8  −1  
∂H ∂ 8 ∂y 8 ∂y ∂ ∂y
Δw ∝ = ln 88 88 = (6.53)
∂w ∂w ∂x ∂x ∂w ∂x
Then, using as g(·) the usual sigmoid transfer: y = g(x) = 1/(1 + e−u ) : u =
wx + w0 whose derivative ∂y/∂x is wy(1 − y) it is straightforward to obtain
1
Δw ∝ + x(1 − 2y), Δw0 ∝ 1 − 2y (6.54)
w
When extending to many units (neurons) we have a weight matrix W to esti-
mate, a bias vector w0 (one bias component per unit), and y = g(Wx + w0 ).
Here, the connection between the multivariate input–output pdfs depends on
the absolute value of the Jacobian J:
⎛ ∂y1 ∂y1 ⎞
∂x1 . . . ∂xk
fx(x) ⎜ .. ⎟
fy (y) = , J = det ⎝ ... . ⎠ (6.55)
|J| ∂yk ∂yk
∂x1 . . . ∂xk

Extending to multiple unit, we obtain

ΔW ∝ (WT )−1 + (1 − 2y)xT , Δw0 ∝ 1 − 2y (6.56)

which is translated to individual weights wij as follows:

cof wij
wij ∝ + xj (1 − 2yi ) (6.57)
det W
cof wij being the co-factor of component wij , that is, (−1)i+1 times the
determinant of the matrix resulting from removing the ith row and jth col-
umn of W.
The latter rules implement an algorithm which maximizes I(x; y) through
maximizing H(y). Considering a two-dimensional case, infomax maximizes
H(y1 , y2 ), as H(y1 , y2 ) = H(y1 ) + H(y2 ) − I(y1 , y2 ) this is equivalent to min-
imizing I(y1 , y2 ) which is the purpose of ICA. However, the latter algorithm
does not guarantee the finding of a global minimum unless certain conditions
248 6 Feature Selection and Transformation

Fig. 6.16. Infomax principle. Top: nonlinear transfer functions where the threshold
w0 matches the mean of fx (a); selection of the optimal weight wopt attending to the
amount of information transferred (b). Bottom: maximizing the joint entropy does
not always result in minimizing the joint mutual information properly – see details
in text. Figures by A.J. Bell and T.J. Sejnowski (1995
c MIT Press).

are satisfied. This is exemplified in Fig. 6.16(bottom). In (a) we have as input


two independent variables uniformly distributed, and sigmoidal neurons are
used. If the input pdfs are not well matched to the nonlinearity we obtain a
solution (c) which has more joint entropy than the correct one (b) but it is
clearly worse because it has a larger mutual information (more dependence be-
tween the output variables). This kind of configuration appears only when the
pdfs of the input are sub-Gaussian (negative kurtosis). However, many nat-
ural signals and images are super-Gaussian. Anyway, the transfer functions
may be tuned to avoid the latter problem.
Regarding the meaning of the axes in comparison to PCA (where axes
retain more and more details of the input patterns as they are less important),
in ICA, the axes are more general and represent edge and Gabor filters, and
the coding of the images is sparser [15]. These axes are consistent with Field’s
hypothesis suggesting that cortical neurons with line and edge selectivity form
a sparse (distributed) code of natural images. The infomax results shown in
Fig. 6.17 are consistent with the ones obtained by the sparseness maximization
net presented in [121].
The FastICA algorithm, proposed in [78, 80], produces similar axes as the
ones obtained by infomax. It has also a design for one unit which is extensible
for multiple units. The basic idea for a unit is to find an axis w maximizing
6.5 From PCA to gPCA 249

Fig. 6.17. Left: axes derived from PCA, ICA, and other variants. Right: axes derived
from ICA. Figures by A.J. Bell and T.J. Sejnowski (1995
c MIT Press).

J(wT x) ≈ [E(G(wT x)) − E(G(v))]2 (6.58)

subject to ||x|| = 1. Input data x are assumed to be withened, and this implies
that wx has unit variance. Let g1 (y) = tanh(ay) and g2 (y) = ue−u /2 , with
2

the usual setting a = 1, be the derivatives of the G(·) functions. Then, FastICA
starts with a random w and computes

w+ = E(xg(wT x)) − E(g  (wT x))w (6.59)

because of the KKT conditions. Next, w+ is normalized: w+ ← w+ /||w+ ||,


and if a fixed point is not reached then Eq. 6.59 is applied again. Convergence
could reach w or −w (same direction). Thus the solution to ICA is up to
sign. Expectations are approximated by samples means. The extension to
more units consists of running the FastICA algorithm for one unit to estimate
weight vectors w1 , . . . , wk and then decorrelate wT1 x, . . . , wTk x after every
iteration. A good solution to this problem is to estimate the first vector with
one-unit FastICA and then exploit the projection of the solution of the second
one onto the first. In general, if we have estimated p < k vectors w1 , . . . , wp ,
then run the one-unit FastICA to obtain wp+1 and after every iteration step
subtract from it the sum of its projections over the yet estimated vectors:

p
wp+1 = wp+1 − wTp+1 wj wj (6.60)
i=1

Then, the obtained vector is renormalized: wp+1 = wp+1 / wTp+1 wp+1 .
250 6 Feature Selection and Transformation

6.5.2 Minimax Mutual Information ICA

Infomax and FastICA are well-known ICA algorithms but information


theory has inspired better ones, like the Minimax Mutual Information ICA
Algorithm [52] or Minimax ICA. In this context, the minimax principle stands
for minimizing mutual information within a maximum entropy framework.
The algorithm incorporates in this framework the original ideas of Pierre
Comon [41] related to rotate the axes until finding their optimal placement,
the one minimizing
k mutual information. Then, the starting point is to mini-
mize I(y) = i=1 H(yi )−H(y), with y = (y1 , . . . , yk )T . Of course, withening
of input data x is assumed. But, in this case, it is assumed that y is linked
to x through a rotation matrix R. As joint entropy H(y) is invariant under
rotations, the problem is reduced to minimize


k
J(Θ) = H(yi ), (6.61)
i=1

where Θ are the parameters defining the rotation matrix R. Such parameters
are the k(k − 1)/2 Givens angles θpq of a k × k rotation matrix. The rotation
matrix Rpq (θpq ) is built by replacing the entries (p, p), (p, q) and (q, p) of the
identity matrix by cos θpq , − sin θpq and cos θpq , respectively. Then, the R is
computed as the product of all the 2D rotations:


n−1 
n
R(Θ) = Rpq (θpq ) (6.62)
p=1 q=p+1

Then, the method proceeds by estimating the optimal θpq minimizing Eq. 6.61.
A gradient descent method over the given angles would proceed by computing

∂J(Θ)  ∂H(yi )
k
= (6.63)
∂θpq i=1
∂θpq

which implies computing the derivative of the entropy and thus implies en-
tropy estimation. At this point, the maximum entropy principle tells us to
choose the most unbiased distribution (maximum entropy) satisfying the ex-
pectation constraints:
 +∞
p∗ (ξ) = arg max − p(ξ) log p(ξ) dξ
p(ξ) −∞
 +∞
s.t p(ξ)Gj (ξ) dξ = E(Gj (ξ)) = αj , j = 1, . . . , m
−∞
 +∞
p(ξ) dξ = 1 (6.64)
−∞
6.5 From PCA to gPCA 251

where the solution has the form


1 m
p∗ (ξ) = e r=1 λr Gr (ξ) (6.65)
Z(Λ, ξ)

Paying attention to the constraints we have


 +∞
αj = p(ξ)Gj (ξ) dξ (6.66)
−∞
 
Integration by parts states that udv = uv − v du. Considering the form of
the solution and applying this rule to

u = p(ξ) dv = Gj (ξ)
  

m +∞
du = λr Gr (ξ) p(ξ), v = Fj (ξ) = Gj (ξ) dξ (6.67)
r=1 −∞

Therefore
  
8 +∞ 
m
αj = p(ξ)Fj (ξ) 8+∞
−∞ − Fj (ξ) λr Gr (ξ) p(ξ) dξ (6.68)
−∞ r=1

When the constraint functions are chosen among the moments of the variables,
the integrals Fj (ξ) do not diverge faster than the exponential decay of the
pdf representing the solution to the maximum entropy problem. Under these
conditions, the first term of the latter equation tends to zero and we have

m  +∞
αj = − λr Fj (ξ)Gr (ξ)p(ξ) dξ
r=1 −∞


m 
m
=− λr E(Fj (ξ)Gr (ξ)) = − λr βjr (6.69)
r=1 r=1

where the βjr may be obtained from the sample means for approximating
E(Fj (ξ)Gr (ξ)). Once we also estimate the αj from the sample, the vector of
Lagrange multipliers Λ = (λ1 , . . . , λm )T is simply obtained as the solution of
a linear system:

Λ = −β −1 α : α = (α1 , . . . , αm )T , β = [βjr ]m×m (6.70)

In addition to satisfying the standard expectation constraints, the pdf satisfies


the constraints derived from obtaining E(Fj (ξ)Gr (ξ)) from the samples (see
Prob. 6.8). Then, exploiting α and β it is possible to obtain an analytic
expression for Eq. 6.63. Firstly, we have
252 6 Feature Selection and Transformation
 +∞
∂H(ξ) ∂
=− p(ξ) log p(ξ) dξ
∂θpq ∂θpq −∞
 +∞ # $
∂p(ξ) ∂p(ξ)
=− log p(ξ) + dξ
−∞ ∂θpq ∂θpq
 +∞  +∞
∂p(ξ) ∂p(ξ)
=− log p(ξ) dξ − dξ
−∞ ∂θpq −∞ ∂θpq
 +∞    +∞
∂p(ξ) m
∂p(ξ)
=− Z(Λ, ξ) + λr Gr (ξ) dξ − dξ
−∞ ∂θ pq r=1 −∞ ∂θpq
 +∞ m  +∞
∂ ∂p(ξ)
= −(1 + Z(Λ, ξ)) p(ξ) dξ − λr Gr (ξ) dξ
∂θpq −∞ −∞ ∂θpq
   r=1
1
  
0

m  +∞ m
∂ ∂αr
=− λr p(ξ)Gr (ξ) dξ = − λr (6.71)
∂θpq −∞ ∂θpq
r=1    r=1
αr

All the derivations between Eqs. 6.64 and 6.71 are referred to one generic
variable ξ. Let yi = ξ be the random variable associated to the ith dimension
of the output (we are solving a maximum entropy problem for each individual
output variable). Updating the notation consequently, we obtain

∂H(yi )  i ∂αri
m
= λr (6.72)
∂θpq r=1
∂θpq

Therefore, there is a very interesting link between the gradient of the cost
function and the expectation constraints related to each output variable. But
how to compute ∂αri /∂θpq ? If we approximate αri by the sample mean, that
N
is, αri = (1/N ) j=1 Gr (yij ). Now, the latter derivative is expanded using the
chain rule:

∂H(yi )   j ∂yij
N
= Gr (yi )
∂θpq j=1
∂θpq
 T  T
N
∂y j
∂Rj:
 j i
= Gr (yi )
j=1
∂Rj: ∂θpq


N  T
∂R
= Gr (yij ) (6.73)
j=1
∂θpq j:

where j constrains the derivative to the jth row of the matrix. Anyway, the
derivative of R with respect to θpq is given by
6.5 From PCA to gPCA 253
p−1 k  
∂R   
q−1
uv pv
= R (θuv ) R (θpv )
∂θpq u=1 v=u+1 v=p+1
  
∂Rpq (θpq ) 
k 
k−1 
k
× pv
R (θpv ) uv
R (θuv ) (6.74)
∂θpq v=q+1 u=p+1 v=u+1

and the final gradient descent rule minimizes the sum of entropies as follows:


k
∂H(yi )
Θt+1 = Θt − η (6.75)
i=1
∂Θ

and the process stabilizes at R∗ (when spurious minima of entropy are


avoided). Furthermore, a closer analysis of the partial derivatives in Eq. 6.73,
reveals that
 
∂J(Θ)  ∂H(yi )  + i ,T + i ,−T
k k
∂αi
= = α β (6.76)
∂θpq i=1
∂θpq i=1
∂θpq

that is, the negative gradient direction depends both on the αi and the β i ,
which actually depend on αi and the multipliers. As the αi depend on the
typically, higher-order moments used to define the constraints, the update
direction depends both on the gradients of the moments and the gradients of
non-Gaussianity: the non-Gaussianity of sub-Gaussian signals is minimized,
whereas the one of super-Gaussian signals is maximized. As we show in
Fig. 6.18 (left), as the entropy (non-Gaussianity) of the input distribution
increases, the same happens with the super-Gaussianity (positive kurtosis)
which is a good property as stated when discussing infomax.

24
0.2
22 Minimax ICA
0.15 Jade
20
Comon MMI
Coefficient

0.1 Fast ICA


18

0.05 16 Mermaid

14
0
12
−0.05
1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3 10
Generalized Gaussian Parameter (β) 100 150 200 250 300 350 400 450 500

Fig. 6.18. Left: increasing of super-Gaussianity for different Generalized Gaussian


Distributions parameterized by β:β = 1 gives the Laplacian, β = 2, the Gaussian,
and β → ∞ the uniform distribution. Right: SIR comparison between different ICA
algorithm as the number of samples increases. Figure by D. Erdogmus, K.E. Hild II,
Y.N. Rao and J.C. Prı́ncipe (2004
c MIT Press).
254 6 Feature Selection and Transformation

Regarding the performance of minimax ICA, one measure for comparing


ICA algorithms is to compute the so-called signal interference ratio SIR. As-
suming input data resulting from a linear mixing of vectors s with indepen-
dent components and then withened with a matrix W, that is, x = WHs, let
O = R∗ WH be the argument for computing the SIR
1 
N 2
maxq (Oiq )
SIR(dB) = 10 log10 (6.77)
N i=1 Oi: OTi: − maxq (Oiq
2 )

where, in each row, the source corresponding to the maximum entry is con-
sidered the main signal at that output. Given this measure, it is interesting to
analyze how the SIR performance depends on the number of available sam-
ples independently of the number of constraints. In Fig. 6.18(right) we show
that when generating random mixing matrices, the average SIR is better and
better for minimax ICA as the number of samples increases (such samples are
needed to estimate high-order moments) being the best algorithm (even than
FastICA) and sharing this position with JADE [35] when the number of sam-
ples decreases. This is consistent with the fact that JADE uses fourth-order
cumulants (cumulants are defined by the logarithm of the moment generating
function).
The development of ICA algorithms is still in progress. ICA algorithms
are also very useful for solving the blind source separation problem summa-
rized in the cocktail party problem: given several mixed voices demix them. It
turns out that the inverse of W is the mixing matrix [79]. Thus, ICA solves
the source separation problem up to the sign. However, regarding pattern
classification, recent experimental evidence [167] shows that whitened PCA
compares sometimes to ICA: when feature selection is performed in advance,
FastICA, withened PCA, and infomax have similar behaviors. However, when
feature selection is not performed and there is a high number of components,
infomax outperforms the other two methods in recognition rates.
In general, ICA is considered a technique of projection pursuit [10], that
is, a way of finding the optimal projections of the data (here optimal means
the projection which yields the best clarification of the structure of multidi-
mensional data). Projection pursuit is closely related to feature selection and
Linear Discriminant Analysis (LDA) [111]. The main difference with respect
to ICA and PCA is that LDA looks for projecting the data to maximize the
discriminant power. In [66], a technique dubbed the Adaptive Discriminant
Analysis (ADA) proceeds by iteratively selecting the axis yielding the max-
imum mutual information between the projection, the class label, and the
projection on the yet selected axes. This is, in general, untractable as we have
seen along the chapter. In [66], theoretical results are showed for mixtures of
two Gaussians.

6.5.3 Generalized PCA (gPCA) and Effective Dimension


Generalized PCA, or gPCA, is a proper algebraic approach for solving the
chicken-and-egg problem of finding subspaces for the data and finding their
6.5 From PCA to gPCA 255

Fig. 6.19. Top: arrangement of three subspaces (one plane V1 and two lines V2 and
V3 ). Bottom: different instances of the problem with increasing difficulty (from left
to right). Figure by Y. Ma, A.Y. Yang, H. Derksen and R. Fossum (2008
c SIAM).

parametric models [109,169]. Given A set of vectors in Rk (the ambient space),


suppose that such set can be arranged as a set of n subspaces (clusters)
Z = V1 ∪ · · · ∪ Vn ⊂ Rk . Each of the subspaces may have different dimen-
sions d1 , . . . , dn , and, thus, they may have different bases. Given the two latter
elements, it is then possible to associate points to subspaces (clustering, clas-
sification, segmentation). For instance, in Fig. 6.19, we have an arrangement
of one plane and two lines. Such subspaces are hidden within the data and
the task of determining them may be very hard if outlying samples populate
the space.
As we have considered above, gPCA is an algebraic approach. More pre-
cisely: (i) the number of subspaces is given by the rank of a matrix; (ii) the
subspace basis comes from the derivatives of polynomials; and (iii) cluster-
ing is equivalent to polynomial fitting. Tasting the flavor of gPCA can be
done through a very simple example due to René Vidal for his tutorial in
CVPR’08. It consists of finding clusters in R. Assume we have n groups of
samples: the first one satisfying x ≈ b1 and the second one x ≈ b2 , and
the nth one satisfying x ≈ bn . Therefore, a point x may belong either to
the first group or to the second one, or to the nth one. Here, or means
product: (x − b1 )(x − b2 ) · · · (x − bn ) = 0. Therefore, we have the polyno-
mial: pn (x) = xn + c1 xn−1 + · · · + cn = 0. Suppose that we have N sam-
ples x1 , . . . , xN , then all these points must satisfy the following system of
equations:
⎛ n ⎞
x1 . . . x1 1
⎜ xn2 . . . x2 1 ⎟ + ,T
⎜ ⎟
Pn c = ⎜ . .. .. ⎟ c = 0, c = 1 c1 . . . cn (6.78)
⎝ .. . . ⎠
xn2 . . . x2 1
  
Pn
256 6 Feature Selection and Transformation

Then, the number of clusters is given by analyzing the rank n = min{i :


rank(Pi ) = i}; the cluster centers are the roots of the polynomial pn (x). Then
the solution to the problem is unique if N > n and has a closed form when n ≤
4. Suppose now that we have n planes, that is, the samples x = (x1 , . . . , xk )T ∈
Rk are divided into n groups where each group fits a plane. A plane, like
V1 , in Fig. 6.19 is defined by b1 x1 + b2 x2 + · · · + bk xk = 0 ≡ bT x = 0.
Therefore, following the or rule, two planes are encoded by a polynomial
p2 (x) = (bT1 x)(bT2 x). An important property of the latter polynomial is that
it can be expressed linearly with respect to the polynomial coefficients. For
instance, if we have planes in R3 :

p2 (x) = (bT1 x)(bT2 x) = (b11 x1 + b21 x2 + b31 x3 )(b12 x1 + b22 x2 + b32 x3 )


= (b11 x1 )(b12 x1 ) + (b11 x1 )(b22 x2 ) + (b11 x1 )(b32 x3 )
+(b21 x2 )(b12 x1 ) + (b21 x2 )(b22 x2 ) + (b21 x2 )(b32 x3 )
+(b31 x3 )(b12 x1 ) + (b31 x3 )(b22 x2 ) + (b31 x3 )(b32 x3 )
= (b11 b12 )x21 + (b11 b22 + b21 b12 )x1 x2 + (b11 b32 + b31 b12 )x1 x3
+(b21 b22 )x22 + (b21 b32 + b31 b22 )x2 x3 + (b31 b32 )x23
= c1 x21 + c2 x1 x2 + c3 x1 x3 + c4 x22 + c5 x2 x3 + c6 x23
= cT νn (x) (6.79)

νn (x) being a vector of monomials of degree n (in the latter case n = 2).
[k] + ,
In general, we have Mn = n+k−1 = (n + k − 1)!/((n + k − 1)(n − 1)!)
n
[3] +,
monomials, each one with a coefficient ch . In the latter example, M2 = 42 =
4!/(4(1!)) = 6 monomials and coefficients. Therefore, we have a mechanism to
[k]
map or embed a point x ∈ Rk to a space of Mn dimensions where the basis
is given by monomials of degree n and the coordinates are the coefficients
ch for these monomials. Such embedding is known as the Veronese map of
degree n [71]:
⎛ ⎞ ⎛ ⎞
x1 xn1
⎜ x2 ⎟ ⎜ xn−1 ⎟
[k] ⎜ ⎟ ⎜ 1 x2 ⎟
νn : Rk → RMn , where νn ⎜ . ⎟ = ⎜ .. ⎟
⎝ .. ⎠ ⎝ . ⎠
xk xnk

Given the Veronese maps or mappings for each element in Z (sample)


x1 , . . . , xN it is then possible to obtain the vector of common coefficients c
[k]
as the left null space of the Mn × N matrix of mappings Ln (embedded data
matrix):
cT Ln = cT (νn (x1 ), . . . , νn (xN )) = 0N (6.80)
This null space may be determined by performing Singular Value De-
composition (SVD). The SVD factorization of Ln = UΣVT decomposes the
matrix into: U, whose columns form an orthonormal input basis vectors of
6.5 From PCA to gPCA 257

Ln (left singular vectors), V, whose columns form the output orthonormal


basis vectors (right singular vectors), and Σ, a diagonal matrix which con-
tains the singular values. Singular values σ are non-negative real numbers.
The singular values of Ln satisfy: Ln v = σu and LTn u = σv, v and u being
unit vectors, called left singular vectors for σ in the first case and right ones
in the second. The right singular vectors corresponding to zero (or close to
zero – vanishing) singular values span the right null space of Ln . The left sin-
gular vectors associated to nonzero singular values span the range of Ln . The
rank of Ln is the number of its nonzero singular vectors. Thus, the number
of subspaces can be obtained by observing the rank of the embedded data
[k]
matrix: m = min{i : rank(Li ) = Mi − 1}.
Let C be the matrix whose columns are the singular vectors associated to
vanishing singular values, that is, C = (c1 , . . . , cm ) forms a basis of the null
space of Ln . Let qj (x) = cTj νn (x), and Q(x) = (q1 (x), . . . , qm (x))T a set of
polynomials. Then Q(x) = CT νn (x) is a matrix of dimension m × k, which
is key to compute thebasis of the subspaces. We can express a polynomial
n
in terms of q(x) = j=1 (bTj x) = (bT1 x) · · · (bTn x), and each vector bj is
orthogonal to the subspace Vj (in the case of a plane this is obvious because
it is the vector defining that plane). For a sample xi ∈ Vi (semi-supervised
learning) we have that bTi xi = 0 by definition.
The derivative of q with respect to x is
∂q(x)  
Dpn (x) = = (bj ) (bl x) (6.81)
∂x j l=j

When evaluated at xi ∈ Vi , the derivative vanishes at all addends j = i


because these addends include bTi xi = 0 in the factorization. Then, we have

Dpn (xi ) = bi (bl xi ) (6.82)
l=i

and consequently the normal direction to Vi is given by [168]


Dpn (xi )
bi = (6.83)
||Dpn (xi )||
All the derivatives may be grouped in the Jacobian associated to the col-
lection of polynomials. Let
⎛ ∂q1 ∂q1 ⎞
∂x1 ... ∂xk
⎜ .. ⎟
J (Q(x)) = ⎝ ... ..
. . ⎠ (6.84)
∂qm ∂qm
∂x1 ... ∂xk

be the m×k Jacobian matrix of the collection of Q. Then, it turns out that the
rows of J (Q(xi )) evaluated at xi span an orthogonal space to Vi . This means
that the right null space of the Jacobian yields a basis Bi = (b1 , . . . , bdi )
of Vi , where di = k − rank(J (Q(xi ))T ) and usually di << k. If we repeat
258 6 Feature Selection and Transformation

the latter rationale for each Vi we obtain the basis for all subspaces. Finally,
we assign a sample x to subspace Vi if BTi x = 0 or we choose the subspace
Vi minimizing ||BTi x|| (clustering, segmentation, classification).
Until now we have assumed that we know an example xi belonging to
Vi , and then it is straightforward to obtain the basis of the corresponding
subspace. However, in general, such assumption is unrealistic and we need an
unsupervised learning version of gPCA. In order to do that we exploit the
Sampson distance. Assuming that the polynomials in Q are linearly indepen-
dent, given a point x in the arrangement, that is Q(x) = 0, if we consider the
first-order Taylor expansion of Q(x) at x, the value at x̃ (the closest point to
x) is given by

Q(x̃) = Q(x) + J (Q(x))(x̃ − x) (6.85)

which implies

||x̃ − x||2 ≈ Q(x)T (J (Q(x))J (Q(x))T )† Q(x̃) (6.86)

where (·)† denotes the pseudo-inverse. Then, ||x̃−x||2 (square of the Euclidean
distance) approximates the Sampson distance. Then, we will choose to point
lying in, say, the nth subspace, xn as the one minimizing the latter Sampson
distance but having a non-null Jacobian (points having null derivatives lie at
the intersection of subspaces and, thus, yield noisy estimations of the normals).
This allows us to find the basis Bn . Having this basis for finding the point xn−1
it is useful to exploit the fact that if we have a point xi ∈ Vi , points belonging
to ∪nl=i Vl satisfy ||BTi x|| · · · ||BTn x|| = 0. Therefore, given a point xi ∈ Vi ,
(for instance i = 1) a point xi−1 ∈ Vi−1 can be obtained as

Q(x)T (J (Q(x))J (Q(x))T )† Q(x̃) + δ
xi−1 = arg min (6.87)
x∈S:J (Q(x))=0 ||BTi x|| · · · ||BTn x|| + δ

where δ > 0 is designed to solve the 0/0 indetermination (perfect data).


The above description is the basic ideal approach of gPCA. It is ideal in
the following regard: (i) there are no noisy data which may perturb the re-
sult of the SVD decompositions and, consequently, the estimations of both
the collection of polynomials and the basis; (ii) the number of subspaces
is known beforehand. Therefore, robustness and model-order selection are
key elements in gPCA. Both elements (robustness and model-order selection)
are considered in the Minimum Effective Dimension gPCA algorithm (MED
gPCA) [75] where information theory plays a fundamental role. Along this
the book we have considered several information-theoretic model-order se-
lection criteria and their applications in de-grouping problems: e.g MDL for
segmentation/clustering and BIC for clustering. In gPCA we have a mixture
of subspaces and we want to identify both these subspaces and the optimal
number of them. Furthermore, we need to do that robustly. The application
of these criteria to gPCA is not straightforward, mainly due to the fact that
there is not necessarily a consensus between information-theoretic criteria and
6.5 From PCA to gPCA 259

algebraic/geometric structure of data [75]. Thus, the MED gPCA algorithm


relies on a new concept known as Effective Dimension (ED) and modifies the
Akaike Information Criterion (AIC) [2] in order to include the ED. This new
criterion yields a useful bridge between information theory and geometry for
the purposes of model-order selection. Let us start by defining ED. Given an
arrangement Z = V1 ∪ · · · ∪ Vn ⊂ Rk of n subspaces, each one of dimension
di < k, and Ni sample points Xi = {xij }N j=1 belonging to each subspace Vi ,
i

the effectivedimension of the complete set of sample points S = ∪ni=1 Xi ,


n
being N = i=1 Ni = |S|, is defined as
 n 
1  n
ED(S, Z) = di (k − di ) + Ni ki (6.88)
N i=1 i=1

Thus the ED is the average of two terms. The first one di (k − di ) is the total
number of reals needed to specify a di dimensional space in Rk (the product
of the dimension and its margin with respect to the dimension of ambient
space). This product is known as the dimension (Grassman dimension) of
the Grassmanian manifold of di dimensional subspaces of Rk . On the other
hand, the second term Ni ki is the total number of reals needed to code the
Ni samples in Vi . A simple quantitative example may help to understand the
concept of ED. Having three subspaces (two lines and a plane in R3 as shown
in Fig. 6.19) with 10 points lying on each line and 30 ones on the plane,
we have 50 samples. Then, if we consider having three subspaces, the ED is
50 (1 × (3 − 1) + 1 × (3 − 1) + 2 × (3 − 2) + 10 × 1 + 10 × 1 + 30 × 2) that
1
1
is 50 (6 + 80) = 1.72. However, if we decide to consider that the two lines
define a plane and consider only two subspaces (planes) the ED is given by
50 (2 × (3 − 2) + 2 × (3 − 2) + 20 × 2 + 30 × 2) = 2.08 which is higher than
1

the previous choice, and, thus, more complex. Therefore, the better choice is
having two lines and one plane (less complexity for the given data). We are
minimizing the ED. Actually the Minimum Effective Dimension (MED) may
be defined in the following terms:

MED(S) = min ED(S, V) (6.89)


V:S⊂V

that is, as the minimum ED among the arrangements fitting the samples.
Conceptually, the MED is a kind of information theoretic criterion adapted
to the context of finding the optimal number of elements (subspaces) of an
arrangement, considering that such elements may have different dimensions.
It is particularly related to the AIC. Given the data S : |S| = N , the model
parameters M, and the free parameters kM for the considered class of model,
the AIC is defined as
2 kM
AIC = − log P (S|M) + 2 ≈ E(− log P (S|M)) (6.90)
N N
when N → ∞. For instance, for the Gaussian noise model with equal variance
in all dimensions (isotropy) σ 2 , we have that
260 6 Feature Selection and Transformation

1 
N
log P (S|M) = − ||xi − x̃i ||2 (6.91)
2σ 2 i=1

x̃i being the best estimate of xi given the model P (S|M). If we want to adapt
the latter criterion to a context of models with different dimensions, and we
consider the isotropic Gaussian noise model, we obtain that AIC minimizes

1 
N
dM 2
AIC(dM ) = ||xi − x̃i ||2 + 2 σ (6.92)
N i=1 N

selecting the model class M∗ , dM being the dimension of a given model class.
AIC is related to BIC (used in the XMeans algorithm, see Prob. 5.14) in
the sense that in the latter one, one must simply replace the factor 2 in the
second summand by log(N ) which results in a higher amount of penalization
for the same data and model. In the context of gPCA, we are considering
Grassmanian subspaces of Rk with dimension d(k − d) which implies that
AIC minimizes

1 
N
d(k − d) 2
AIC(d) = ||xi − x̃i ||2 + 2 σ (6.93)
N i=1 N

The AIC is extended in [92] in order to penalize subspaces of higher dimen-


sions for encoding the data. This can be easily done by adding N d to the
Grassmannian dimensions. This is the Geometric AIC (GAIC):

1 
N
d(k − d) + N d 2
GAIC(d) = ||xi − x̃i ||2 + 2 σ (6.94)
N i=1 N

When applied to encoding the samples S with a given arrangement of, say n,
subspaces Z the GAIC is formulated as

1  
N n
dj (k − dj ) + N dj 2
GAIC(S, Z) = ||xi − x̃i ||2 + 2 σ
N i=1 j=1
N

1 
N
= ||xi − x̃i ||2 + 2N σ 2 ED(S, Z) (6.95)
N i=1

which reveals a formal link between GAIC and the ED (and the MED when
GAIC is minimized). However, there are several limitations that preclude
the direct use of GAIC in the context of gPCA. First, having subspaces of
different dimensions makes almost impossible to choose a prior model. Second,
the latter rationale can be easily extended to assuming a distribution for the
sample data. And third, even when the variance is known, the GAIC minimizes
the average residual ||xi − x̃i ||2 which is not an effective measure against
outliers. The presence of outliers is characterized by a very large residual.
6.5 From PCA to gPCA 261

This a very important practical problem where one can find two extremes in
context of high noise rates: samples are considered either as belonging to a
unique subspace or to N one-dimensional spaces, each one defined by a sample.
Thus, it seems very convenient in this context to find the optimal trade-off
between dealing noise and obtaining a good model fitness. For instance, in the
SVD version of PCA, the k×N data matrix X = (x1 , x2 , . . . , xN ) is factorized
through SVD X = UΣVT , the columns of U give a basis and the rank
provides the dimension of the subspace. As in the eigenvectors/eigenvalues
formulation of PCA, the singular values left represent the squared error of the
representation (residual error). Representing the ordered singular values vs.
the dimension of the subspace (see Fig. 6.20, left) it is interesting to note the
existence of a knee point after the optimal dimension. The remaining singular
values represent the sum of square errors, and weights w1 , w2 > 0, derived from
the knee points, represent the solution to an optimization problem consisting
on minimizing


N
JPCA (Z) = w1 ||xi − x̃i ||2 + w2 dim(Z) (6.96)
i=1

Z being the subspace, x̃i the closest point to xi in that subspace and dim(Z)
the dimension of the subspace. The latter objective function is closely related
to the MLD-AIC like principles: the first term is data likelihood (residual error
in the terminology of least squares) and the second one is complexity. In order
to translate this idea into the gPCA language it is essential to firstly redefine
the MED properly. In this regard, noisy data imply that the ED is highly de-
pendent on the maximum allowable residual error, known as error tolerance τ .
The higher τ the lower optimal dimension of the subspace (even zero dimen-
sional). However, when noisy data arise, and we set a τ lower to the horizontal
coordinate of the knee point, samples must be either considered as indepen-
dent subspaces or as unique one-dimensional subspace. The consequence is

K K

knee point effective


dimension dimension knee point
MED*
k

w 1x + w 2y w 1x + w 2y
0 τ σK 0 τ* τmax τ
singular values error tolerance

Fig. 6.20. Optimal dimensions. In PCA (left) and in gPCA (right). In both cases
K is the dimension of the ambient space and k is the optimal dimension. Figures by
K. Huang, Y. Ma and R. Vidal (2004
c IEEE).
262 6 Feature Selection and Transformation

that the optimal dimension (MED in gPCA) approaches the dimension of


the ambient space. Thus, a redefinition of the MED must consider the error
tolerance:
MED(X , τ ) = min ED(X̃, Z) (6.97)
Z:||X−X̃||∞ ≤τ
N
being ||X − X̃||∞ = i=1 ||xi − x̃i ||∞ and ||x − x̃||∞ = maxx∈X ||x − x̃||. In
Fig. 6.20 we plot the MED vs. τ . With the new definition of MED we must
optimize functions of the type:


N
JgPCA (Z) = w1 ||xi − x̃i ||∞ + w2 MED(X , τ ) (6.98)
i=1

Adopting a criterion like the one in Eq. 6.93 in gPCA by preserving the ge-
ometric structure of the arrangement can be done by embedding this criterion
properly in the algebraic original formulation of gPCA. As we have explained
above, an arrangement of n subspaces can be described by a set of polynomi-
als of degree n. But when subspaces with different dimensions arise (e.g. lines
and planes) also polynomials with less dimensions than n can fit the data. In
the classical example of two lines and one plane (Fig. 6.19) a polynomial of
second degree may also fit the data. The intriguing and decisive question is
whether this plane may be then partitioned into two lines. If so we will re-
duce the ED, and this will happen until no subdivision is possible. In an ideal
[k]
noise-free setting, as Mn decides the rank of matrix Ln and then bounds the
number of subspaces, it is interesting to start by n = 1, and then increase n
until we find a polynomial of degree n which fits all the data (Ln has lower
rank). Then, it is possible to separate all the n subspaces following the gPCA
steps described above. Then, we can try to apply the process recursively to
each of the n until there are no lower dimensional subspaces in each group or
there are too much groups. In the real case of noisy data, we must fix n and
find the subspaces (one at a time) with a given error tolerance. For a fixed n:
(i) find the fist subspace and assign to it the points with an error less than τ ;
(ii) repeat the process to fit the remaining subspaces to the data points (find
points associated to subspaces and then the orthogonal directions of the sub-
space). However, the value of τ is key to know whether the rank of Ln can be
identified correctly. If we under-estimate this rank, the number of subspaces
is not enough to characterize all the data. If we over-estimate the rank we
cannot identify all the subspaces because all points have been assigned to a
subspace in advance. We can define a range of ranks (between rmin and rmax )
and determine whether the rank is under or over-estimated in this range. If
none of the ranks provides a segmentation in n subspaces within a tolerance
of τ , it is quite clear that we must increase the number of subspaces.
In Alg. 15 α =< ., . > denotes the subspace angle which is an estimation
of the amount of dependence between two subspaces (low angle indicates high
dependence and vice versa). Orthogonal subspaces have a π/2 angle. If the
6.5 From PCA to gPCA 263

Algorithm 15: Robust-gPCA


Input: X samples matrix, τ tolerance
Initialize
n = 1, success=false
repeat
[k]
×N
Ln (X) ← (νn (x1 ), . . . , νn (xN ))T ∈ RMn 
[k]
rmax = Mn − 1, rmin = arg mini σi (Ln )
 i−1 ≤ 0.02
j=1 σj (Ln )

while (rmin ≤ rmax )and(not(success)) do


r = (rmin + rmax )/2;
over-estimated=false under-estimated=false
[k]
Compute the last Mn − r eigenvectors {bi } of Ln 
Obtain the polynomials {qj (x) = bTj νn (x)} : q(x) = n T
j=1 (bj x)
Find n samples xj , xk where
αjk =< Bj , Bk >: Bz = span{J (Q(xz ))} > 2τ
Otherwise (no samples satisfying constraint) over-estimated=true
if over-estimated then rmax ← r − 1
else
Assign each point in X to its closest subspace considering
error tolerance τ and obtain n groups Xk
if fail then
under-estimated = true
end
if under-estimated then
rmin ← r + 1
else
success = true
end
end
end
if success then
forall k = 1 : n do
Robust-gPCA(Xk ,τ )
end
else
n←n+1
end
until (success;
or(n ≥ nmax )) Output: MED gPCA

highest angle we can find is too small, this indicates that the classic examples
of two lines and one plane are illustrated in Fig. 6.21. The algorithm starts
with n = 1, enters the while loop and there is no way to find a suitable rank
for that group (always over-estimated). Then, we increase n to n = 2, and
it is possible to find two independent subspaces because we have two groups
(the plane and the two lines). Then, for each group a new recursion level
264 6 Feature Selection and Transformation

Fig. 6.21. Right: result of the iterative robust algorithm for performing gPCA recur-
sively while minimizing the effective dimension. Such minimization is derived from
the recursive partition. Left: decreasing of ED with τ . Figure by Y. Ma, A.Y. Yang,
H. Derksen and R. Fossum (2008c SIAM).

Fig. 6.22. Unsupervised segmentation with gPCA. Figure by K. Huang, Y. Ma and


R. Vidal (2004
c IEEE). See Color Plates.

is started, and for each of them we start by setting n = 1. The recursion


level associated to the plane stops after finding a suitable rank. This implies
succeeding, enter a second level of recursion, started without succeeding, and
having rmin > rmax . Then n increases until nmax is found and the algorithm
stops. On the other hand, the first level of recursion for the two lines increases
until n = 2 where it is possible to find independent subspaces and a third level
of recursion starts. For each line, the rationale for the plane is applicable. As
it is expected, entering new levels of recursion decreases the ED. Finally, in
Fig. 6.22 we show the results of unsupervised segmentation. Each pixel is
associated to a 16 × 16 window. As each pixel has color information (RGB)
we have patterns of 16×16×3 = 768 dimensions for performing PCA. The first
12 eigenvectors are used for gPCA segmentation. Only one level of recursion is
needed in this case, because for each of the groups there is no need re-partition
the group again.
6.5 From PCA to gPCA 265

Problems

6.1 Filters and wrappers


There are two major approaches to feature selection: filter and wrapper. Which
one is more prone to overfitting? Does filter feature selection always improve
the classification results?

6.2 Filter based on mutual information – Estimation


of mutual information
The mutual information I(S; C) can be calculated in two different ways, with
the conditional entropy 6.19 and with the joint entropy 6.19. Do you think
that the conditional form is suitable for the feature selection problem? Note
that C is discrete. Would the conditional form be suitable for a problem with
a continuous C?

6.3 Mutual information calculation


When using mutual information as a criterion for feature selection, if the
features are continuous, some estimation technique has to be used. However,
if the data are categorical, the mutual information formula for the discrete
case can be applied. In the following toy example the data are categorical.
How many feature sets are possible? Calculate the mutual information with
the class, for each one of them.
Data set with four samples defined by three features and classified into
two classes:
x1 x2 x3 C
A Δ Θ C1
B Z I C1
A E I C2
Γ Z I C2

6.4 Markov blankets for feature selection


The Makrov blanket criterion for feature selection removes those features for
which a Markov blanket is found among the set of remaining features. What
happens in the case of completely irrelevant features? Which Markov blanket
is found for them?

6.5 Conditional dependence in feature selection


Markov blankets and the MD criteria do not assume independence between
features, and can have into account complex dependence patterns. Remember
that both of them are filters for feature selection, so they are independent
of the classifier. Are they of benefit to the classification rates of any kind
of classifier? A hint: the “Corral” data set (already explained in previous
sections), when used with the Naive Bayes classifier, performs better with all
the features, than with just the first four of them (which completely determines
the class label). Why?
266 6 Feature Selection and Transformation

6.6 Mutual information and conditional dependence


Venn diagrams are useful for the intuitive understanding mutual information.
Draw a Venn diagram for the already described “Corral” data set.

6.7 Filter based on mutual information – complexity


The mRMR and the MD criteria have different complexity issues. The esti-
mation of entropy is an important complexity factor. The following classic
data sets have different dimensionalities and numbers of samples. State your
reasons for using one criterion or another on each one of them. Apart from
complexity, consider some implementation issues, for example, storing the re-
sults of calcula which would be repeated many times.
• NCI data set. Contains 60 samples (patients), each one of them with 6,380
features (genes). The samples are labeled with 14 different classes of human
tumor diseases.
• Census Income, also known as Adult data set. Contains 48,842 samples
(people from the census), described by 14 features. The labels indicate
whether income exceeds $50,000/year based on census data.
The MD criterion can be used both in forward and backward searches. Which
one would you use for each one of the previous data sets? Which data set
would be more feasible for an exhaustive search?

6.8 Satisfying additional constraints in maximum entropy


Prove that, when computing the Lagrange multipliers by exploiting Eq. 6.69,
the resulting maximum entropy pdf satisfies also the extra constraints:

1 
N
E(Fj (x)Gr (x)) = Fj (xi )Gr (xi )
N i=1

for j = 1, . . . , m and k = 1, . . . , m, N being the number of samples xi , and


Gr (x) = xr . Obtain the β variables and the Lagrange multipliers for that
case.

6.9 MML for feature selection and Gaussian clusters


In [101], the authors propose a method for selecting features while estimating
clusters through Gaussian mixtures. The basic IT criterion used in such work
is the Minimum Message Length (MML) [173] to measure the saliency of the
data dimensions (features). Such criterion can be formulated in the following
terms:
 
1 c 1
MML(Θ) = − log p(Θ) − log(D|Θ) + |I(Θ)| + 1 + log (6.99)
2 2 12

where Θ are the parameters of the model (means, variances), D are the data
(then log(D|Θ) is the log-likelihood), and I(Θ) is the Fisher information ma-
trix.
6.5 From PCA to gPCA 267

Irrelevant features have a near-zero saliency (MML). Consider, for


instance, the paradigmatic case in the positive quadrant of R2 (samples
as defined as (xi , yi )) where two different vertical oriented Gaussians have the
same covariance matrix Σ with horizontal variance clearly smaller than the
vertical one: σx2 << σy2 . The mean in y of the first Gaussians is clearly greater
than the second (by an amount of dy ), the same happens in the x-dimension
with dx . This means that the Gaussians may be considered independent.
Show then, in this case, that both x and y dimensions are relevant. However,
when both dy tends to 0, and dx is small but not zero, and we replace both
Gaussians by a unique one using PCA, it turns out that only the y dimension
is relevant. Quantify such relevances through MML. Study the change of
relevance as dx grows (always using the PCA approximation for the sake of
simplicity).

6.10 Complexity of the FRAME algorithm


What is the computational complexity of Alg. 12? Set a given image size,
intensity range, sweeps, and filters and give an estimation of its execution
time in a current laptop.

6.11 FRAME and one-dimensional patterns


The method described in Alg. 12 can be used also for synthesizing/reproducing
one-dimensional patterns (like sounds coming from speech or spike trains).
Conjecture what should be the role of filters like edge-detectors (contrast),
averaging filters (smoothing), or simply intensity ones (use directly the his-
togram of the signal intensity). Think about using an adequate number of
filters depending on the number of examples available.

6.12 Minimax entropy and filter pursuit


When describing Alg. 14 we have presented a connection between selecting
the optimal filter and minimizing the Kullback–Leibler divergence between
f (I) and p(I; Λm , Sm ). Prove that such divergence can be expressed in terms
of the differences of the entropies of the latter distributions. If so, as H(f (I))
is fixed, in order to minimize such Kullback–Leibler divergence we only need
to minimize H(p(I; Λm , Sm )). Hint: use the fact that

Ep(I;Λm ,Sm ) (αj ) = Ef (I) (αj ), j = 1, . . . , m

6.13 Kullback–Leibler gradient


In Alg. 14 prove that if we have two alternatives of Lagrange multipliers Λ
and the optimal Λ∗

D(f (I)||p(I; Λ)) = D(f (I)||p(I; Λ∗ )) + D(p(I; Λ∗ )||p(I; Λ))

which means that, given the positiveness of the Kullback–Leibler divergence,


the optimal choice Λ∗ has always the minimal divergence.
268 6 Feature Selection and Transformation

6.14 Arrangements, Veronese maps, and gPCA


In gPCA, an arrangement of n subspaces can be expressed by a set of poly-
nomials of degree n. Consider the following configuration arrangement in
R3 (each sample is of the form x = (x1 , x2 , x3 )T : Z = V1 ∪ V2 ), where
V1 = {x : x1 = x2 = 0} (a vertical line passing through the origin) and
V2 = {x : x3 = 0} (the horizontal plane passing through the origin). The
union may be found as following:

V1 ∪ V2 = {x : (x1 = x2 = 0) ∨ (x3 = 0)}


= {x : (x1 = 0 ∨ x3 = 0) ∧ (x2 = 0 ∨ x3 = 0)}
= {x : (x1 x3 = 0) ∧ (x2 x3 = 0)}

and then the union is represented by q1 (x) = (x1 x3 ) and q2 (x) = (x2 x3 ).
Then, the generic Jacobian matrix is given by
⎛ ∂q ∂q ∂q ⎞  
1 1 1
∂x1 ∂x2 ∂x3 x3 0 x1
J (Q(x)) = ⎝ ⎠=
∂q2 ∂q2 ∂q2
∂x1 ∂x2 ∂x3
0 x3 x2

where the first rows Dp1 (x) and the second are Dp2 (x). Then choosing a
z1 = (0 0 1)T point with x3 = 1 for the line, and choosing a point z2 = (1 1 0)T
with x1 = x2 = 1 for the plane. We have
   
100 001
J (Q(z1 )) = , J (Q(z2 )) =
010 001

where it is clear that the mull of the transpose of the Jacobian yields the basis
of the line (a unique vector B1 = {(0 0 1)T } because the Jacobian has rank 2)
and of the second (with rank 1) B1 = {(0 1 0)T , (−1 0 0)T }. Check that the
obtained basis vectors are all orthogonal to their corresponding Dp1 (zi ) and
Dp2 (zi ). Given the latter explanations, compute these polynomials from the
[k]
Veronese map. Hint: for n = 2 in R3 (k = 3) we have Mn = 6 monomials and
coefficients. To that end suggest 10 samples for each subspace. Then compute
the Q collection of vectors and then the Jacobian assuming that we have a
known sample per subspace. Show that the results are coherent with the ones
described above. Find also the subspaces associated to each sample (segmen-
tation). Repeat the problem unsupervisedly (using the Sampson distance).
Introduce noise samples and reproduce the estimation and the segmentation
to test the possible change of rank and the deficiencies in the segmentation.

6.15 gPCA and Minimum Effective Dimension


Consider the following configuration in R3 : two orthogonal planes, one fitting
500 samples and the other plane fitting 100, and two lines, each one associated
to 200 samples. Compute the MED and the optimal arrangement.
6.6 Key References 269

6.6 Key References

• I. Guyon and A. Elisseeff. “An Introduction to Variable and Feature Se-


lection”. Journal of Machine Learning Research 3:1157–1182 (2003)
• K. Torkkola. “Feature Extraction by Non-Parametric Mutual Information
Maximization”. Journal of Machine Learning Research 3:1415–1438 (2003)
• H. Peng, F. Long, and C. Ding. “Feature Selection Based on Mutual
Information: Criteria of Max-Dependency, Max-Relevance, and Min-
Redundancy”. IEEE Transactions on Pattern Analysis and Machine
Intelligence 27(8):1226–1238 (2005)
• B. Bonev, F. Escolano, and M. Cazorla. “Feature Selection, Mutual In-
formation, and the Classification of High-Dimensional Patterns”. Pattern
Analysis and Applications 1433–7541 (2008)
• A. Vicente, P.O. Hoyer, and A. Hyvärinen. “Equivalence of Some Com-
mon Linear Feature Extraction Techniques for Appearence-Based Object
Recognition Tasks”. IEEE Transactions on Pattern Analysis and Machine
Intelligence 29(5):896–900 (2007)
• N. Vasconcelos and M. Vasconcelos. “Scalable Discriminant Feature Selec-
tion for Image Retrieval and Recognition”. Computer Vision and Pattern
Recognition Conference, Washington, DC (USA) (2004)
• D. Koller and M. Sahami. “Toward Optimal Feature Selection”. ICML-
96: Proceedings of the Thirteenth International Conference on Machine
Learning, pp. 284–292, San Francisco, CA: Morgan Kaufmann, Bari (Italy)
(1996)
• M. Law, M. Figueiredo, and A.K. Jain. “Simultaneous Feature Selection
and Clustering Using a Mixture Model”. IEEE Transactions on Pattern
Analysis and Machine Intelligence 26(9):1154–1166 (2004)
• S.C. Zhu, Y.N. Wu, and D.B. Mumford. “FRAME: Filters, Random field
And Maximum Entropy: Towards a Unified Theory for Texture Modeling”.
International Journal of Computer Vision 27(2):1–20 (1998)
• A. Hyvärinen and E. Oja. “Independent Component Analysis: Algorithms
and Applications”. Neural Networks 13(4–5):411–430 (2000)
• T. Bell and T. Sejnowski. “An Information-Maximization Approach to
Blind Separation and Blind Deconvolution”. Neural Computation 7:1129–
1159 (1995)
• D. Erdogmus, K.E. Hild II, Y.N. Rao, and J.C. Prı́ncipe. “Minimax Mu-
tual Information Approach for Independent Component Analysis”. Neural
Computation 16:1235–1252 (2004)
• Y. Ma, A.Y. Yang, H. Derksen, and R. Fossum. “Estimation of Sub-
space Arrangements with Applications in Modeling and Segmenting Mixed
data”. SIAM Review 50(3):413–458 (2008)
7
Classifier Design

7.1 Introduction
The classic information-theoretic classifier is the decision tree. It is well known
that one of its drawbacks is that it tends to overfitting, that is, it yields large
classification errors with test data. This is why the typical optimization of
this kind of classifiers is some type of pruning. In this chapter, however, we
introduce other alternatives. After reminding the basics of the incremental
(local) approach to grow trees, we introduce a global method which is ap-
plicable when a probability model is available. The process is carried out by
a dynamic programming algorithm. Next, we present an algorithmic frame-
work which adapts decision trees for classifying images. Here it is interesting
to note how the tests are built, and the fundamental lesson is that the large
amount of possible tests (even when considering only binary relationships
between parts of the image) recommends avoiding building unique but deep
trees, in favor of a bunch of shallow trees. This is the keypoint of the chap-
ter, the emergence of ensemble classifying methods, complex classifiers built
in the aggregation/combination of simpler ones, and the role of IT in their de-
sign. In this regard, the method adapted to images is particularly interesting
because it yields experimental results in the domain of OCRs showing that
tree-averaging is useful. This work inspired the yet classical random forests
approach where we analyze their generalization error and show applications in
different domains like bioinformatics, and present links with Boosting. Follow-
ing the ensemble focus of this chapter, next section introduces two approaches
to improve Boosting. After introducing the Adaboost algorithm we show how
boosting can be driven both by mutual information maximization (infomax)
and by maximizing Jensen–Shannon divergence (JBoost). The main difference
between the two latter IT-boosting approaches relies on feature selection (lin-
ear or nonlinear features). Then, we introduce to the reader the world of
maximum entropy classifiers, where we present the basic iterative-scaling al-
gorithm for finding the Lagrange multipliers. Such algorithm is quite different
from the one described in Chapter 3, where we estimate the multipliers in
F. Escolano et al., Information Theory in Computer Vision and Pattern Recognition, 271

c Springer-Verlag London Limited 2009
272 7 Classifier Design

a different way (coupled with a generative model). We also estabish connec-


tions among iterative scaling, the family of exponential distributions and in-
formation projection. Finally we cannot close this chapter and the book itself
without paying special attention to the extension of information projection
(Bregman divergences), which has recently inspired new methods for building
linear classifiers.

7.2 Model-Based Decision Trees

7.2.1 Reviewing Information Gain

Let X = {x} be the training set (example patterns) and Y = {y} the class
labels which are assigned to each element of X in a supervised manner. As it
is well known, a classification tree, like CART [31] or C4.5 [132], is a model
extracted from the latter associations which tries to correctly predict the
most likely class for unseen patterns. In order to build such model, one must
consider that each pattern x ∈ X is a feature vector x = (x1 , x2 , . . . , xN ), and
also that each feature xt has associated a test I(xi > ci ) ∈ {0, 1}, whose value
depends on whether the value xi is above (1) or below (0) a given threshold ci .
Thus, a tree T (binary when assuming the latter type of test) consists of a set
of internal nodes Ṫ , each one associated to a test, and a set of terminal nodes
(leaves) ∂T , each one associated to a class label. The outcome hT (x) ∈ Y
relies on the sequence of outcomes of the tests/questions followed for reaching
a terminal.
Building T implies establishing a correspondence π(t) = i, with i ∈
{1, 2, . . . , N }, between each t ∈ Ṫ and each test Xt = I(xi > ci ). Such
correspondence induces a partial order between the tests (nodes) in the tree
(what tests should be performed first and what should be performed later),
and the first challenging task here is how to choose the proper order. Let Y a
random variable encoding the true class of the examples and defined over Y.
The uncertainty about such true class is encoded, as usual, by the entropy

H(Y ) = P (Y = y) log2 P (Y = y) (7.1)
y∈Y

P (Y = y) being the proportion of examples in X labeled as y. In the classical


forward greedy method the test associated to the root of the tree, that is,
π(t) = i, with t = 1, is the one gaining more information about Y , that is,
the one maximizing
H(Y ) − Ht (Y |Xt ) (7.2)
being a sort of conditional entropy Ht (Y |Xt ) based on the possible outcomes
of the test:

Ht (Y |Xt ) ≡ P (Xt = 0)Ht0 (Y ) + P (Xt = 1)Ht1 (Y ) (7.3)


7.2 Model-Based Decision Trees 273

where P (Xt = 1) = |X t|
|X | is the fraction of examples in X satisfying the test
Xt and P (Xt = 0) = 1 − P (Xt = 1). Moreover, Ht0 and Ht1 are the entropies
associated to the two descents of t:

Ht0 (Y ) ≡ H(Y |Xt = 0) = − P (Y = y|Xt = 0) log2 P (Y = y|Xt = 0)
y∈Y

Ht1 (Y ) ≡ H(Y |Xt = 1) = − P (Y = y|Xt = 1) log2 P (Y = y|Xt = 1)
y∈Y

where P (Y = y|Xt = 1) is the fraction of Xt of class y, and P (Y = y|Xt = 0)


is the fraction of X ∼ Xt of class y. Consequently, Ht (Y |Xt ) is consistent with
the definition of conditional entropy:

Ht (Y |Xt ) = P (Xt = 0)H(Y |Xt = 0) + P (Xt = 1)H(Y |Xt = 1)



= P (Xt = k)H(Y |Xt = k)
k∈{0,1}

Once i = π(t) is selected for t = 1 it proceeds to select recursively π(t0 ) and


π(t1 ). Let us assume, for simplicity, that X ∼ Xt is so homogeneous that
Ht0 (Y ) <  and  ≈ 0. This means that, after renaming t0 = l, we have
l ∈ ∂T , that is, a leaf, and all examples in X ∼ Xt will be labeled with the
most frequent class, say yl . On the other hand, if Ht1 (Y ) > , after renaming
t = t1 , we should select a feature j = π(t) maximizing

Ht (Y ) − Ht (Y |X1 = 1, Xt ) = H(Y |X1 = 1) − Ht (Y |X1 = 1, Xt ) . (7.4)



Ht (Y |X1 = 1, Xt ) = P (X1 = 1, Xt = k)H(Y |X1 = 1, Xt = k)
k∈{0,1}
1
where, for instance, P (X1 = 1, Xt = 1) = |X1|X |Xt | is the fraction of examples
satisfying tests X1 and Xt . In addition, if we denote by Qt the set of outcomes
(test results) preceding Xt , the feature j = π(t) should minimize

Ht (Y ) − Ht (Y |Qt , Xt ) = H(Y |Qt ) − Ht (Y |Qt , Xt ) (7.5)

Such a process should be applied recursively until it is not possible to refine


any node, that is, until all leaves are reached. As the nature of the process is
greedy, it is very probable to obtain a suboptimal tree in terms of classification
performance.

7.2.2 The Global Criterion

Alternatively to the greedy method, consider that1 the path 1 for 1 reaching
a leaf l ∈ ∂T has length P and let Ql = Xπ(1) Xπ(2) . . . Xπ(P −1) .
Then P (Qt ) = P (Xπ(1) = kπ(1) , . . . , Xπ(P −1) = kπ(P −1) ) is the probability
274 7 Classifier Design

of reaching that leaf. A global error is partially given by minimizing the av-
erage terminal entropy
 
H(Y |T ) = P (Ql )Hl (Y ) = P (Ql )H(Y |Ql ) (7.6)
l∈∂T l∈∂T

which means that it is desirable to either have small entropies at the leaves,
when the P (Ql ) are significant, or to have small P (Ql ) when H(Y |Ql ) becomes
significant.
The latter criterion is complemented by a complexity-based one, which is
compatible with the MDL principle: the minimization of the expected depth

Ed(T ) = P (Ql )d(l) (7.7)
l∈∂T

d(l) being the depth of the leaf (the depth of the root node is zero).
When combining the two latter criteria into a single one it is key to high-
light that the optimal tree depends on the joint distribution of the set of
possible tests X = {X1 , . . . , XN } and the true class Y . Such joint distribu-
tion (X, Y ) is the model M. It has been pointed out [8, 62] that a model
is available in certain computer vision applications like face detection where
there is a “rare” class a, corresponding to the object (face) to be detected,
and a “common” one b corresponding to the background. In this latter case
we may estimate the prior density p0 (y), y ∈ Y = {a, b} and exploit this
knowledge to speed up the building of the classification tree.
Including the latter considerations, the global optimization problem can be
posed in terms of finding

T ∗ = arg min C(T , M) = H(Y |T ) + λEd(T ) (7.8)


T ∈Ω

where Ω is the set of possible trees (partial orders) depending on the available
features (tests) and λ > 0 is a control parameter. Moreover, the maximal depth
of T ∗ is bounded by D = maxl∈∂T d(l).
What is interesting in the latter formulation is that the cost C(T , M) can
be computed recursively. If we assume that the test Xt is associated to the
root of the tree we have

C(T , M) = λ + P (Xt =0)C(T0 , {M|Xt =0}) + P (Xt = 1)C(T1 , {M|Xt =1})



= λ+ P (Xt = k)C(Tk , {M|Xt = k})
k∈{0,1}

and therefore

C(Tk , {M|Xt = k}) = H(Y |Tk , Xt = k) + λEd(Tk )


7.2 Model-Based Decision Trees 275

Tk being the k-th subtree and {M|Xt = k} the resulting model after fixing
Xt = k. This means that, once Xt is selected for the root, we have only N − 1
tests to consider. Anyway, finding C ∗ (M, D), the minimum value of C(T , M)
over all trees with maximal depth D requires a near O(N !) effort, and thus is
only practical for a small number of tests and small depths.
In addition, considering that C ∗ (M, D) is the minimum value of C(T , M)
over all trees with maximal depth D, we have that for D > 0:

⎨ H(p0 ) 
C ∗ (M, D) = min λ + min P (Xt = k)C ∗ ({M|Xt = k}, D − 1)
⎩ t∈X
k∈{0,1}
(7.9)

and obviously C (M, 0) = H(p0 ). As stated above, finding such a minimum
is not practical in the general case, unless some assumptions were introduced.

7.2.3 Rare Classes with the Greedy Approach

For instance, in a two-class context Y = {a, b}, consider that we have a “rare”
class a and a “common” one b, and we assume the prior p0 (a) ≈ 10−1 and
p0 (b) = 1−p0 (a). Consider for instance the following test: always X1 = 1 when
the true class is the rare one, and X1 = 1 randomly when the true class is the
common one; that is P (X1 = 1|Y = a) = 1 and P (X1 = 1|Y = b) = 0.5. If the
rare class corresponds to unfrequent elements to detect, such test never yields
false negatives (always fires when such elements appear) but the rate of false
positives (firing when such elements do not appear) is 0.5. Suppose also that
we have a second test X2 complementary to the first one: X2 = 1 randomly
when the true class is the rare one, and never X2 = 1 when the true class is
the common one, that is P (X2 = 1|Y = a) = 0.5 and P (X2 = 1|Y = b) = 0.
Suppose that we have three versions of X2 , namely X2 , X2 , and X2 which is
not a rare case in visual detection where we have many similar tests (features)
with similar lack of importance.
Given the latter specifications (see Table 7.1), a local approach has to
decide between X1 and any version of X2 for the root r, based on Hr (Y |X1 )
and Hr (Y |X2 ). The initial entropy is

H(Y ) = − P (Y = a) log2 P (Y = a) − P (Y = b) log2 P (Y = b)


     
p0 (a) p0 (b)
   
5 5 9, 995 9, 995
=− log2 − log2 = 0.0007
104 104 104 104
  
≈0
276 7 Classifier Design

Table 7.1. Simple example.


  
Example X1 X2 X2 X2 Class
x1 1 0 0 0 b
x2 1 0 0 0 b
... ... ... ... ... ...
x5,000 1 0 0 0 b
x5,001 0 0 0 0 b
x5,002 0 0 0 0 b
... ... ... ... ... ...
x9,995 0 0 0 0 b
x9,996 1 0 0 1 a
x9,997 1 0 1 0 a
x9,998 1 1 0 0 a
x9,999 1 1 0 1 a
x10,000 1 1 1 0 a
  
Test X1 X2 X2 X2 Class

x1 1 0 0 0 a

x2 1 0 1 1 a

x3 1 1 1 1 a

When computing conditional entropies for selecting X1 or X2 for the root we


have

H(Y |X1 ) = P (X1 = 0) H(Y |X1 = 0) +P (X1 = 1)H(Y |X1 = 1)


  
0
⎧ ⎫

⎪     ⎪

(5, 000 + 5) ⎨ 5 5 5, 000 5, 000 ⎬
= − log2 − log2
104 ⎪
⎪ 5, 005 5, 005 5, 005 5, 005 ⎪⎪
⎩    ⎭
≈0
= 0.0007 = H(Y )

whereas
    
H(Y |X2 ) = P (X2 = 0)H(Y |X2 = 0) + P (X2 = 1) H(Y |X2 = 1)
  
0
   
(9, 995 + 2) 2 2 9, 995 9, 995
= − log 2 − log 2
104 9, 997 9, 997 9, 997 9, 997
H(Y )
= 0.0003 ≈
2
7.2 Model-Based Decision Trees 277

and
    
H(Y |X2 ) = P (X2 = 0)H(Y |X2 = 0) + P (X2 = 1) H(Y |X2 = 1)
  
0
   
(9, 995 + 3) 3 3 9, 995 9, 995
= − log 2 − log 2
104 9, 998 9, 998 9, 998 9, 998
= 0.0004.

and H(Y |X2 )
    
H(Y |X2 ) = P (X2 = 0)H(Y |X2 = 0) + P (X2 = 1) H(Y |X2 = 1)
  
0
   
(9, 995 + 3) 3 3 9, 995 9, 995
= − log2 − log2
104 9, 998 9, 998 9, 998 9, 998
= 0.0004
The latter results evidence that H(Y |X2 ), for all versions of the X2 test,
will be always lower than H(Y |X1 ) because H(Y |X1 ) is dominated by the
case X1 = 1 because in this case the fraction of examples of class b is reduced
to 1/2 and also all patterns with class a are considered. This increases the
entropy approaching H(Y ). However, when considering X2 , the dominating
option is X2 = 0 and in case almost all patterns considered are of class b
except few patterns of class a which reduces the entropy.

Therefore, in the classical local approach, X2 will be chosen as the test for

the root node. After such a selection we have in X ∼ X2 almost all of them of

class b but two of class a, whereas in X2 there are three examples of class a.

This means that the child of the root node for X2 = 0 should be analyzed

more in depth whereas child for X2 = 1 results in a zero entropy and does not
require further analysis (labeled with class a). In these conditions, what is the
best following test to refine the root? Let us reevaluate the chance of X1 :
0
  
  
H(Y |X2 = 0, X1 ) = P (X2 = 0, X1 = 0) H(Y |X2 = 0, X1 = 0)
 
+ P (X2 = 0, X1 = 1)H(Y |X2 = 0, X1 = 1)
(5, 000 + 2) 
= H(Y |X2 = 0, X1 = 1)
104
 
(5, 000 + 2) 2 2
= − log2
104 5, 002 5, 002
 
5, 000 5, 000
− log2 = 0.0003
5, 002 5, 002
278 7 Classifier Design

and consider also X2 :
     
H(Y |X2 = 0, X2 ) = P (X2 = 0, X2 = 0)H(Y |X2 = 0, X2 = 0)
   
+P (X2 = 0, X2 = 1) H(Y |X2 = 0, X2 = 1)
  
0
(9, 995 + 1)  
= H(Y |X2 = 0, X2 = 0)
104  
(9, 995 + 1) 1 1
= − log2
104 9, 996 9, 996
 
9, 995 9, 995
− log2 = 0.0002
9, 996 9, 996

and X2 :
     
H(Y |X2 = 0, X2 ) = P (X2 = 0, X2 = 0)H(Y |X2 = 0, X2 = 0)
   
+ P (X2 = 0, X2 = 1) H(Y |X2 = 0, X2 = 1)
  
0
(9, 995 + 2)  
= H(Y |X2 = 0, X2 = 0)
104  
(9, 995 + 2) 2 2
= − log2
104 9, 997 9, 997
 
9, 995 9, 995
− log2 = 0.0003
9, 997 9, 997

Again, X1 is discarded. What happens now is that the child for X2 = 1
has an example of class a and it is declared a leaf. On the other hand, the

branch for X2 = 0 has associated many examples of class b and only one of
class a, and further analysis is needed. Should this be the time for X1 ? Again
   
H(Y |X2 = 0, X2 = 0, X1 ) = P (X2 = 0, X2 = 0, X1 = 0)
0
  
 
H(Y |X2 = 0, X2 = 0, X1 = 0)
 
+P (X2 = 0, X2 = 0, X1 = 1)
 
H(Y |X2 = 0, X2 = 0, X1 = 1)
(9, 995 + 1)  
= H(Y |X2 = 0, X2 = 0, X1 = 1)
104  
(9, 995 + 1) 1 1
= − log2
104 9, 996 9, 996
 
9, 995 9, 995
− log2 = 0.0002
9, 996 9, 996
7.2 Model-Based Decision Trees 279

However, if we consider X2 what we obtain is
     
H(Y |X2 = 0, X2 = 0, X2 ) = P (X2 = 0, X2 = 0, X2 = 0)
0
  
  
H(Y |X2 = 0, X2 = 0, X2 = 0)
(9, 995 + 1)   
+ H(Y |X2 = 0, X2 = 0, X2 = 1)
104  
(9, 995 + 1) 1 1
= − log 2
104 9, 996 9, 996
 
9, 995 9, 995
− log2 = 0.0002
9, 996 9, 996

Then we may select X2 and X1 . This means that with a lower value of p0 (a),
X1 would not be selected with high probability. After selecting, for instance
 
X2 , we may have that the child for X2 = 0 is always of class b and the

child for X2 = 1 is a unique example of class a. Consequently, all leaves
are reached without selecting test X1 . Given the latter greedy tree, let us
  
call it Tlocal = (X2 , X2 , X2 ) attending to its levels, we have that the rate
of misclassification is 0 when the true class Y = b although to discover it we
should reach the deepest leaf. When we test the tree with a examples not
contemplated in the training we have that only one of them is misclassified
(the misclassification error when Y = a is 18 (12.5%)). In order to compute the
mean depth of the tree we must calculate the probabilities of reaching each
of the leaves. Following an inorder traversal we have the following indexes for
the four leaves l = 3, 5, 6, 7:

P (Q3 ) = P (X2 = 1) = 3 × 10−4
 
P (Q5 ) = P (X2 = 0, X2 = 1) = 10−4
  
P (Q6 ) = P (X2 = 0, X2 = 0, X2 = 0) = 9, 995 × 10−4
  
P (Q7 ) = P (X2 = 0, X2 = 0, X2 = 1) = 10−4

and then

Ed(Tlocal ) = P (Ql )d(l)
l∈∂Tlocal
= P (Q3 )d(3) + P (Q5 )d(5) + P (Q6 )d(6) + P (Q7 )d(7)
= 3 × 10−4 × 1 + 10−4 × 2 + 9, 995 × 10−4 × 3 + 10−4 × 3
= 2.9993

which is highly conditioned by P (Q6 )d(6), that is, it is very probable to reach
that leaf, the unique labeled with b. In order to reduce such average depth it
should be desirable to put b leaves at the lowest depth as possible, but this
280 7 Classifier Design

implies changing the relative order between the tests. The fundamental ques-
tion is that in Tlocal uneffective tests (features) are chosen so systematically
by the greedy method, and tests which work perfectly, but for a rare class are
relegated or even not selected.
On the other hand, the evaluation H(Y |Tlocal ) = 0 results from a typical
tree where all leaves have null entropy. Therefore the cost C(Tlocal , M) =
λEd(Tlocal ), and if we set λ = 10−4 we have a cost of 0.0003.

7.2.4 Rare Classes with Global Optimization

In order to perform global optimization following the latter example, we are


going to assume that we have roughly two classes of tests X = {X1 , X2 }
  
with similar probability distributions, say X1 = {X1 , X1 , X1 } and X2 =
  
{X2 , X2 , X2 }. In computer vision arena, each of these versions could be un-
derstood as filters φ() with similar statistical behavior. In order to make the
problem tractable it seems reasonable to consider the context of building trees
with instances of X1 and X2 , and the results of the test for 104 data are in-
dicated in Table 7.2. In this latter case, the probabilities of each type of filter
are:

P (X1 = 0) = 4, 995 × 10−4 = 0.4995


P (X1 = 1) = 5, 005 × 10−4 = 0.5005
P (X2 = 0) = 9, 997 × 10−4 = 0.9997
P (X2 = 1) = 3 × 10−4 = 0.0003

The latter probabilities are needed for computing C ∗ (M, D) in Eq. 7.9,
and, of course, its associated tree. Setting for instance D = 3, we should
compute C ∗ (M, D), being M = p0 because the model is mainly conditioned
by our knowledge of such prior. However, for doing that we need to compute
C ∗ (M1 , D − 1) where M1 = {M|Xt = k} for t = 1 . . . N (in this N = 2) and
k = 0, 1 is the set of distributions:

M1 = {p(·|Xt = k), 1 ≤ t ≤ N, k ∈ {0, 1}}

and in turn compute C ∗ (M2 , D − 2) being

M1 = {p(·|Xt = k, Xr = l), 1 ≤ t, r ≤ N, k, l ∈ {0, 1}}

and finally C ∗ (M3 , D − 3) = C ∗ (M3 , 0) from

M3 = {p(·|Xt = k, Xr = l, Xs = m), 1 ≤ t, r, s ≤ N, k, l, m ∈ {0, 1}}

The latter equations are consistent with the key idea that the evolution of
(X, Y ) depends only on the evolution of the posterior distribution as the tests
are performed. Assuming conditional independence between the tests (X1 and
7.2 Model-Based Decision Trees 281

Table 7.2. Complete set of tests.


     
Example X1 X1 X1 X2 X2 X2 Class
x1 1 1 1 0 0 0 b
... ... ... ... ... ... ... ...
x1,250 1 1 1 0 0 0 b
x1,251 1 1 0 0 0 0 b
... ... ... ... ... ... ... ...
x2,499 1 1 0 0 0 0 b
x2,500 1 0 0 0 0 0 b
... ... ... ... ... ... ... ...
x3,750 1 0 0 0 0 0 b
x3,751 1 0 1 0 0 0 b
... ... ... ... ... ... ... ...
x5,000 1 0 1 0 0 0 b
x5,001 0 1 1 0 0 0 b
... ... ... ... ... ... ... ...
x6,250 0 1 1 0 0 0 b
x6,251 0 1 0 0 0 0 b
... ... ... ... ... ... ... ...
x7,500 0 1 0 0 0 0 b
x7,501 0 0 0 0 0 0 b
... ... ... ... ... ... ... ...
x8,750 0 0 0 0 0 0 b
x8,751 0 0 1 0 0 0 b
... ... ... ... ... ... ... ...
x9,995 0 0 1 0 0 0 b
x9,996 1 1 1 0 0 1 a
x9,997 1 1 1 0 1 0 a
x9,998 1 1 1 1 0 0 a
x9,999 1 1 1 1 0 1 a
x10,000 1 1 1 1 1 0 a
     
Test X1 X1 X1 X2 X2 X2 Class

x1 1 1 1 0 0 0 a

x2 1 1 1 0 1 1 a

x3 1 1 1 1 1 1 a

X2 in this case) it is possible to select the best test at each level and this will
be done by a sequence of functions:

Ψd : Md → {1 . . . N }

In addition, from the latter equations, it is obvious that we have only two
tests in our example: X1 and X2 , and consequently we should allow to use
282 7 Classifier Design

them repeatedly, that is, allow p(·|X1 = 0, X1 = 1) and so on. This can be
achieved in practice by having versions or the tests with similar statistical
behavior as we argued above. Thus, the real thing for taking examples and
computing the posteriors (not really them by their entropy) will be the un-
 
derlying assumption of p(·|X1 = 0, X1 = 1), that is, the first time X1 is
named we use its first version, the second time we use the second one, and
so
Don. However, although we use this trick,
D the algorithm needs to compute
0=1 |M d | which means, in our case, 0=1 (2 × N )d
= 1 + 4 + 16 + 64 pos-
teriors. However, if we do not take into account the order in which the tests
are performed along a branch the complexity is
 
d + 2M − 1
|Md | =
2M − 1

which reduces the latter numbers to 1 + 4 + 10 + 20 (see Fig. 7.1, top) and
makes the problem more tractable.
Taking into account the latter considerations it is possible to elicit a dy-
namic programming like solution following a bottom-up path, that is, starting
from computing the entropies of all posteriors after D = 3. The top-down
collection of values states that the tests selected at the first and second levels
are the same X2 (Fig. 7.1, bottom). However, due to ambiguity, it is possible
to take X2 as optimal test for the third level which yields Tlocal ≡ Tglobal1

X1=0 X1=1 X2=0 X2=1

X1=0 X1=0 X1=1 X1=1 X1=0 X1=0 X1=1 X2=0 X2=0 X2=1
X2=0 X2=1 X2=0 X2=1 X1=0 X1=1 X1=1 X2=0 X2=1 X2=1

X1=0 X1=0 X1=0 X1=0 X1=0 X11=0 X1=0 X1=1 X1=1 X1=1 X1=1 X1=1 X1=0 X1=0 X1=0 X1=1 X2=0 X2=0 X2=0 X2=1
X2=0 X2=0 X2=0 X2=0 X2=1 X2=1 X2=1 X2=0 X2=0 X2=0 X2=1 X2=1 X1=0 X1=0 X1=1 X1=1 X2=0 X2=0 X2=1 X2=1
X1=0 X1=1 X2=0 X3=
=1 X1=0 X1=1 X2=1 X1=1 X2=0 X2=1 X1=1 X2=1 X1=0 X1=1 X1=1 X1=1 X2=0 X2=1 X2=1 X2=1

X2

.0003 H(p0 )=0.0007


=0.0001

X1 X11 X2 X1 X2

.0002 .0005 .0002 .0002

X1 X1 X2
X1 X2 X2 X2 X1 X1 X2 X2 X1
.0001 .0007 .0004 .0001 .0001 .0001 .0013 .0001 .0001 .0007

0.000 0.000 0.000 .0012 .0003 0.000 0.000 0.000 0.000 0.000 0.000 .0057 0.000 0.000

Fig. 7.1. Bottom-up method for computing the tree with minimal cost. Top: cells
needed to be filled during the dynamic programming process. Bottom: bottom-up
process indicating the cost cells and the provisional winner test at each level. Dashed
lines correspond to information coming from X1 and solid lines to information com-
ing from X2 .
7.2 Model-Based Decision Trees 283

having exactly the optimal cost. However, it seems that it is also possible to
  
take X1 for the third level and obtaining for instance Tglobal2 = (X2 , X2 , X1 ).
The probabilities of the four leaves are:

P (Q3 ) = P (X2 = 1) = 3 × 10−4 ,
 
P (Q5 ) = P (X2 = 0, X2 = 1) = 10−4 ,
  
P (Q6 ) = P (X2 = 0, X2 = 0, X1 = 0) = 4, 995 × 10−4 ,
  
P (Q7 ) = P (X2 = 0, X2 = 0, X1 = 1) = 5, 001 × 10−4 ,

and the mean depth is



Ed(Tglobal2 ) = P (Ql )d(l)
l∈∂Tglobal2

= P (Q3 )d(3) + P (Q5 )d(5) + P (Q6 )d(6) + P (Q7 )d(7)


= 3 × 10−4 × 1 + 10−4 × 2 + 4, 995 × 10−4 × 3
+ 5,001 × 10−4 × 3
= 2.9993

and

H(Y |Tglobal2 ) = P (Ql )d(l)
l∈∂Tglobal2

= P (Q3 )d(3) + P (Q5 )d(5) + P (Q6 )d(6) + P (Q7 )d(7)


= 0 + 0 + 0 + H(Y |Q7 )
 
5, 000 5, 000
= 0+0+0+ log2
5, 0001 5, 001
= 0.0003

Let us evaluate the global cost C(T , M) of the obtained tree

C(Tglobal2 , M) = H(Y |Tglobal2 ) + λEd(Tglobal2 )


= 0.0003 + 0.0001 × 2.9993
= 0.0005999 > C ∗ (M, D)
  
0.0003

which indicates that this is a suboptimal choice, having, at least, the same
misclassification error than Tlocal . However, in deeper trees what is typically
observed is that the global method improves the misclassification rate of the
local one and produces more balanced trees including X1 -type nodes (Fig. 7.2).
284 7 Classifier Design

H(Y)=0.0007 H(Y|X’2)=0.0003
X´2 X´2

H(Y|X’2=0,X’’2)=0.0002

X˝2 A X˝2 A
H(Y|...,X’’’2)=0.0002
A A
X˝´2 X´1

B A B AB
Fig. 7.2. Classification trees. Left: using the classical greedy approach. Ambiguity:
nonoptimal tree with greater cost than the optimal one.

7.3 Shape Quantization and Multiple Randomized Trees

7.3.1 Simple Tags and Their Arrangements

Classical decision trees, reviewed and optimized in the latter section, are de-
signed to classify vectorial data x = (x1 , x2 , . . . , xN ). Thus, when one wants to
classify images (for instance bitmaps of written characters) with these meth-
ods, it is important to extract significant features, and even reduce N as
possible (see Chapter 5). Alternatively it is possible to build trees where the
tests Xt operate over small windows and yield 0 or 1 depending on whether
the window corresponds to a special configuration, called tag. Consider for in-
stance the case of binary bitmaps. Let us, for instance, consider 16 examples
of the 4 simple arithmetic symbols: +, −, ÷, and ×. All of them are 7 × 7
binary bitmaps (see Fig. 7.3). Considering the low resolution of the examples
we may retain all the 16 tags of dimension 2 × 2 although the tag correspond-
ing to number 0 should be deleted because it is the least informative (we will
require that at least one of the pixels is 1 (black in the figure)).
In the process of building a decision tree exploiting the tags, we may assign
a test to each tag, that is X1 , . . . , X15 which answer 1 when the associated
tag is present in the bitmap and 0 otherwise. However, for the sake of invari-
ance and higher discriminative power, it is better to associate the tests to
the satisfaction of “binary” spatial relationships between tags, although the
complexity of the learning process increases significantly with the number of
tags. Consider for instance four types of binary relations: “north,” “south,”
“west” and “east.” Then, X5↑13 will test whether tag 5 is north of tag 13,
whereas X3→8 means that tag 5 is at west of tag 8. When analyzing the lat-
ter relationships, “north,” for instance, means that the second row of the tag
must be greater than the first row of the second tag.
Analyzing only the four first initial (ideal) bitmaps we have extracted
38 canonic binary relationships (see Prob. 7.2). Canonic means that many
relations are equivalent (e.g. X1↑3 is the same as X1↑12 ). In our example the
7.3 Shape Quantization and Multiple Randomized Trees 285

Fig. 7.3. Top: Sixteen example bitmaps. They are numbered 1 to 16 from top-
left to bottom-right. Each column has examples of a different class. Bitmaps 1 to
4 represent the ideal prototypes for each of the four classes. Bottom: The 16 tags
coding the binary numbers from 0000 to 1111. In all these images 0 is colored white
and 1 is black.

+1, +1,
initial entropy is H(Y ) = −4 4 log2 4 = 2 as there are four classes and
four examples per class.

7.3.2 Algorithm for the Simple Tree

Given B, the set of canonic binary relations, called binary arrangements, the
basic process for finding a decision tree consistent with the latter training set,
could be initiated by finding the relation minimizing the conditional entropy.
The tree is shown in Fig. 7.4 and X3↑8 is the best local choice because it
allows to discriminate between two superclasses: (−, ×) and (+, ÷) yielding
the minimal relative entropy H(Y |X3↑8 ) = 1.0. This choice is almost obvious
because when X3↑8 = 0 class − without “north” relations is grouped with
class × without the tag 3. In this latter case, it is then easy to discriminate
between classes − and × using X3→3 which answers “no” in ×. Thus, the
leftmost path of the tree (in general: “no”,“no”,. . .) is built obeying to the
286 7 Classifier Design

H(Y)=2 H(Y|X3 8 )=1.0


X3 8
2,6,10,14 4,8,12,16 1,5,9,13 3,7,11,15
H(Y|X3 8=0,X3 )=0.5
H(Y|X3 8=1,X5 )=0.2028
X3 X5+5
1,5,9 3,7,11,15

X4+4 H(Y|X3 8=1,X5 =1,X4 )=0.0


2,6,10,14 4,8,12,16 13

3,7,11,15 1,5,9

Fig. 7.4. Top: tree inferred exploiting binary arrangements B and the minimal
extensions At found by the classical greedy algorithm. Bottom: minimal extensions
selected by the algorithm for example 5. Gray masks indicate the tag and the number
of tag is in the anchoring pixel (upper-left).

rule to find the binary arrangement which is satisfied by almost one example
and best reduces the conditional entropy. So, the code of this branch will have
the prefix 00 . . .. But, what happens when the prefix 00 . . . 1 appears?
The rightmost branch of the tree illustrates the opposite case, the prefix is
11 . . . . Once a binary arrangement has been satisfied, it proceeds to complete
it by adding a minimal extension, that is, a new relation between existing tags
or by adding a new tag and a relation between this tag and one of the existing
ones. We denote by At the set of minimal extensions for node t in the tree.
In this case, tags are anchored to the coordinates of the upper-leftmost pixel
(the anchoring pixel) and overlapping between new tags (of course of different
types) is allowed. Once a 1 appears in the prefix, we must find the pending
arrangement, that is, the one minimizing the conditional entropy. For instance,
our first pending arrangement is X5+5→3 , that is, we add tag 5 and the relation
5 → 3 yielding a conditional entropy of H(Y |X3↑8 = 1, X5→3 ) = 0.2028. After
that, it is easy to discriminate between + and ÷ with the pending arrangement
X4+4→5 yielding a 0.0 conditional entropy. This result is close to the ideal
Twenty Questions (TQ): The mean number of queries EQ to determine the
true class is the expected length of the codes associated to the terminal leaves:
C1 = 00, C2 = 01, C3 = 10, C4 = 110, and C5 = 111:
 1 1 1 1 3
EQ = P (Cl )L(Cl ) = ×2+ ×2+ ×1+ ×3+ × 3 = 2.3750
4 4 16 4 16
l∈T
7.3 Shape Quantization and Multiple Randomized Trees 287

that is H(Y ) ≤ EQ < H(Y ) + 1, being H(Y ) = 2. This is derived from the
Huffman code which determines the optimal sequence of questions to classify
an object. This becomes clear from observing that in the tree, each test divides
(not always) the masses of examples in subsets with almost equal size.

7.3.3 More Complex Tags and Arrangements

In real experiments for handwritten character recognition, the example images


have larger resolutions (for instance 70 × 70) and the size of the, usually
squared, window should be increased in order to capture relevant details of the
shapes. Even for a N = 4 height, 2N ×N = 65, 536 masks should be considered.
This implies the need of discovering clusters of masks. We may take a large
sample U of 4 × 4 windows from the training bitmaps. Then, we build a
binary classification tree U, where the tags will be the nodes of the tree. If the
window size is 4 × 4, we may perform 16 binary questions of the type “Is pixel
(i,j)=1?”. Of course, the root node contains no questions (this is indicated by
a gray tone in Fig. 7.4 (top)). In order to select the following question (pixel)
at node t, it must divide the subpopulation of samples Ut into two groups with
similar mass as possible. This is the underlying principle of Twenty Questions
and it is addressed to reduce the entropy of the empirical distribution of
configurations: the Huffman code is the one with minimal expected length
and this means that following this approach, the upper bound of the entropy
will be minimal and also the entropy itself.
The first level has two tags and their children inherit the questions of the
parent before selecting a new one. The second level has four tags and so on.
The number of questions in a tag corresponds to the level (root is level zero).
And for five questions we have 2 + 4 + 8 + 16 + 32 = 62 tags. Furthermore,
taking into account also the fact that eight binary relations must be consid-
ered (including “north-west”, and so on). As we have seen above, exploiting a
sequence of binary arrangements (let us denote such sequence as A = {2 &,
2 ' 7}) has an interesting impact in reducing the entropy of the conditional
distribution P (Y = y|XA = 1) (see Fig. 7.5). Higher order relations may
be considered. All of these relations may be metric ones, provided that scale
translational invariances are preserved. For instance, an example of ternary
relations between three anchoring pixels x, y and z is ||x − y|| < ||x − z||, and
if we add a fourth point t we have the quaternary relation ||x − y|| < ||z − t||.
As the number of tags, relationships, and their order increase, it is more and
more probable to reach leaves of zero conditional entropy, but the shape class
is correctly determined. However, the complexity of the learning process is
significantly increased, and this is why in practice the number of tags and the
number of relations are bounded (for instance 20 tags and 20 relations). Fur-
thermore, it is more practical to complete minimal arrangements (for instance
when binary relationships are considered) without trying to yield a connected
graph, that is, with a set of binary graphs.
288 7 Classifier Design

0.4

0.3
P(Y=c|XA=1)

0.2

0.1

0
0 1 2 3 4 5 6 7 8 9
Digit

Fig. 7.5. Top: first three tag levels and the most common configuration below each
leaf. Center: instances of geometric arrangements of 0 and 6. Bottom: conditional
distribution for the 10 digits and the arrangement satisfied by the shapes in the
middle. Top and Center figures by Y. Amit and D. Geman (1997c MIT Press).
7.3 Shape Quantization and Multiple Randomized Trees 289

7.3.4 Randomizing and Multiple Trees

Let us then suppose that we have a set X of M possible tests and also that
these tests are not too complex to be viable in practice. The purpose of a
classification tree T is to minimize H(Y |X), which is equivalent to maximizing
P (Y |X). However, during the construction of T not all tests are selected, we
maximize instead P (Y |T ) which is only a good approximation of maximizing
P (Y |X) when M , and hence the depth of the tree, is large.
Instead of learning a single deep tree, suppose that we replace it by a
collection of K shallow trees T1 , T2 , . . . , TK . Then we will have to estimate

P (Y = c|Tk ) k = 1, . . . , K.

Considering that the number M of possible tests X is huge, and considering


also that a random selection from the set B of binary tests is likely to increase
the information about the Y random variable1 it seems reasonable to select
a random subset of B for each node for growing a particular tree instead of
reviewing “all” the possible arrangements. Besides reducing the complexity of
the process, randomization has three additional good effects. Firstly, it allows
to characterize the same shape from different points of views (one per tree)
as we show in Fig. 7.6 (top). Secondly, it reduces the statistical dependency
between different trees. And thirdly, it introduces more robustness during
classification.
The latter second and third advantages are closely related and emerge from
the way that the different trees are combined. These trees may be aggregated
as follows. Suppose that we have |Y| = C classes. Thus, the leaves of each tree
will have only values c = 1, . . . , C. However, as we bound the depth of each
tree, say to D, it is expected to have some high entropic leaves. Generally
speaking we are interested in estimating P (Y = c|Tk = l), where l ∈ ∂Tk
is the posterior probability that the correct class is c at leaf l. So we must
estimate C × 2D parameters per tree (K × C × 2D contemplating all trees).
This means that we will have |X |/(C2D ) examples available per parameter
and consequently large training sets are needed. However, this need may be
attenuated considering that most of the parameters will be zero and estimating
only, for instance, the five largest elements (bins) of a given posterior.
Once all posteriors are estimated, and following the notation

μTk (c) = P (Y = c|Tk = l)

we have that μTk (x) denotes the posterior for the leaf reached by a given
input x. Aggregation by averaging consists of computing:

1 
K
μ̄(x) = μTk (x) (7.10)
K
k=1
1
In the example in Fig. 7.4 we have first selected a binary arrangement.
290 7 Classifier Design

Fig. 7.6. Randomization and multiple trees. Top: inferred graphs in the leaves of
five differently grown trees. Center: examples of handwritten digit images, before
(up) and after (down) preprocessing. Bottom: conditional distribution for the ten
digits and the arrangement satisfied by the shapes in the middle. Figure by Y. Amit
and D. Geman (1997
c MIT Press).
7.4 Random Forests 291

Finally, the class assigned to a given input x is the mode of the averaged
distribution:
Ŷ = arg max μ̄c (7.11)
c

This approach has been successfully applied to the OCR domain . Considering
the NIST database with 223, 000 binary digits written by more than two
thousand writers, and using 100, 000 for training and 50, 000 for testing (see
Fig. 7.6, center) the classification rate significantly increases with the number
of trees (see Fig. 7.6, bottom). In these experiments the average depth of the
trees was 8.8 and the average number of terminal nodes was 600.

7.4 Random Forests

7.4.1 The Basic Concept

As we have seen in the latter section, the combination of multiple trees in-
creases significantly the classification rate. The formal model, developed years
later by Breiman [30], is termed “Random Forests” (RFs). Given a labeled
training set (X , Y), an RF is a collection of tree classifiers F = {hk (x), k =
1, . . . , K}. Probably the best way of understanding RFs is to sketch the way
they are usually built.
First of all, we must consider the dimension of the training set |X | and
the maximum number of variables (dimensions) of their elements (examples),
which is N . For each tree. Then, the kth tree will be built as follows: (i) Select
randomly and with replacement |X | samples from X , a procedure usually
denoted bootstrapping2 ; this is its training set Xk . (ii) Each tree is grown
by selecting randomly at each node n << N variables (tests) and finding the
best split with them. (iii) Let each tree grow as usual, that is, without pruning
(reduce the conditional entropy as much as possible). After building the K
forest, an x input is classified by determining the most voted class among all
the trees (the class representing the majority of individual choices), namely

hF (x) = M AJ{h1 (x), h2 (x), . . . , hK (x)}

a procedure usually denoted as bagging.

7.4.2 The Generalization Error of the RF Ensemble

The analysis of multiple randomized trees (Section 7.3.4) reveals that from
a potentially large set of features (and their relationships) building multi-
ple trees from different sets of features results in a significant increment of
the recognition rate. However, beyond the intuition that many trees improve
2
Strictly speaking, bootstrapping is the complete procedure of extracting several
training sets by sampling with replacement.
292 7 Classifier Design

the performance, one must measure or bound, if possible, the error rate of
the forest, and check in which conditions it is low enough. Such error may be
quantified by the margin: the difference between the probability of predict-
ing the correct class and the maximum probability of predicting the wrong
class. If (x, y) (an example of the training and its correct class) is a sample of
the random vector X , Y which represents the corresponding training set, the
margin is formally defined as

mar(X , Y) = P (hF (x) = y) − max P (hF (x) = z) (7.12)


z =y

The margin will be in the range [−1, 1] and it is obvious that the largest the
margin the more confidence in the classification. Therefore, the generalization
error of the forest is

GE = PX ,Y (mar(X , Y) < 0) . (7.13)

that is, the probability of having a negative margin over the distribution X , Y
of training sets. Therefore, for the sake of the classification performance, it is
interesting that the GE, the latter probability, has a higher bound as low as
possible. In the case of RFs such a bound is related to the correlation between
the trees in the forest. Let us for instance define the strength of the forest as

s = EX ,Y mar(X , Y) (7.14)

that is, the expectation of the margin over the distribution X , Y, and let

z ∗ = arg max P (hF (x) = z) (7.15)


z =y

Then, we may redefine the margin in the following terms:

mar(X , Y) = P (hF (x) = y) − P (hF (x) = z ∗ ) (7.16)


⎡ ⎤
⎢ ⎥
= EΘ ⎣I(hF (x) = y) − I(hF (x) = z ∗ )⎦
  
rmar(X ,Y,Θ)

= EΘ [rmar(X , Y, Θ)]

I(·) being an indicator (counting) function, which returns the number of times
the argument is satisfied, rmar(., ., .) the so-called raw margin, and Θ = {Θk }
a bag of parameters sets (one set per tree). For instance, in the case of bagging
we have Θk = Xk . Consequently, the margin is the expectation of the raw
margin with respect to Θ, that is, with respect to all the possible ways of
bootstraping the training set.
Let us now consider two different parameter sets Θ and Θ . Assuming i.i.d.,
the property (EΘ f (Θ))2 = EΘ f (Θ)×EΘ f (Θ ) is satisfied. Consequently, and
applying Eq. 7.16, it is verified that
7.4 Random Forests 293

mar(X , Y)2 = EΘ,Θ [rmar(X , Y, Θ) × rmar(X , Y, Θ )] (7.17)

and consequently, defining

v = V arX ,Y (mar(X , Y))


= EX ,Y [mar(X , Y)2 ] − s2
= EX ,Y [EΘ,Θ [rmar(X , Y, Θ) × rmar(X , Y, Θ )]]] − s2

then, as s2 = (EX ,Y [EΘ rmar(X , Y, Θ)])2 , interchanging EX ,Y [EΘ,Θ ][·] and


EΘ,Θ [EX ,Y ][·] and the same holds for EX ,Y [EΘ ][.], and applying the definition
of covariance Cov(A, B) = E(A × B) − E(A) × E(B) we have that

v = EΘ,Θ {CovX ,Y [rmar(X , Y, Θ), rmar(X , Y, Θ )]}


= EΘ,Θ {ρ(Θ, Θ ) × Std(Θ) × Std(Θ )} (7.18)

ρ(., .) being the correlation between rmar(X , Y, Θ) and rmar(X , Y, Θ ) hold-


ing both parameter sets fixed and Std(.) being the standard deviation of the
corresponding raw margin holding the argument fixed. The latter definition
establishes an interesting link between the training set and the bootstrapped
sets.
Let us now define ρ̄(Θ, Θ ) as the mean value of the correlation:
v
  
 
E  {ρ(Θ, Θ ) × Std(Θ) × Std(Θ )}
ρ̄(Θ, Θ ) =
Θ,Θ
(7.19)
EΘ,Θ [Std(Θ) × Std(Θ )]

As Θ and Θ are independent with the same distribution the following relation
holds:
v
ρ̄(Θ, Θ ) = (7.20)
(EΘ [Std(Θ)])2
which is equivalent to

v = ρ̄(Θ, Θ ) × (EΘ [Std(Θ)])2 (7.21)

And we have
v ≤ ρ̄(EΘ [Std(Θ)])2 ≡ ρ̄EΘ [V ar(Θ)] (7.22)
2 2
Furthermore, as V arΘ (Std(Θ)) = EΘ [Std(Θ) ]−(EΘ [Std(Θ)]) , we have that

v ≤ (EΘ [Std(Θ)2 ] ≡ EΘ [V ar(Θ)]) (7.23)

As V ar(Θ) = EX ,Y [rmar(X , Y, Θ)2 ]−(EX ,Y [rmar(X , Y, Θ)])2 , consequently:

EΘ [V ar(Θ)] = EΘ [EX ,Y [rmar(X , Y, Θ)2 ]] − EΘ [(EX ,Y [rmar(X , Y, Θ)])2 ]


≤ 1 − s2 (7.24)
294 7 Classifier Design

Thus, we have two interesting connections: (i) an upper bound for the
variance of the margin depending on the average correlation; and (ii) another
upper bound between the expectation of the variance with respect to Θ, which
depends on the strength of the forest. The coupling of the latter connections
is given by the Chebishev inequality. Such inequality gives an upper bound
for rare events in the following sense: P (|x − μ| ≥ α) ≤ σ 2 /α2 . In the case of
random forests we have the following setting:

(1 − s2 )
GE = PX ,Y (mar(X , Y) < 0) ≤ ρ̄ (7.25)
s2
which means that the generalization error is bounded by a function depending
on both the correlation and the strength of the set classifiers. High correla-
tion between trees in the forest results in poor generalization and vice versa.
Simultaneously, low strength of the set of classifiers results in poor general-
ization.

7.4.3 Out-of-the-Bag Estimates of the Error Bound

When building the tree Tk ∈ F and selecting Xk = Θk ⊂ X through boot-


strapping, we have seen that independent selection has a deep impact on
the bounds of the generalization error. Thus, it is interesting to increase the
strength and reduce the correlation as much as possible. A useful mechanism
for quantifying and controlling these factors is the so called out-of-the-bag
(oob) estimate of GE. This mechanism starts by discarding 1/3 of the |X |
samples for building each tree (oob samples). Let
K
k=1 I(hk (x) = z : (x, y) ∈ (Xk , Y))
Q(x, z) = K (7.26)
k=1 I((x, y) ∈ (Xk , Y))

be the oob proportions of votes for a wrong class z. Such votes come from the
test set, and this means that yQ(x, z) is formally connected with GE. First
of all, it is an estimate for p(hF (x) = z). From the definition of the strength
in Eq. 7.14, we have

s = EX ,Y (P (hF (x) = y) − max P (hF (x) = z)) (7.27)


z =y

Thus, ŝ, the approximated strength, is defined as


1 
ŝ = (Q(x, y) − max Q(x, z)) (7.28)
|X | z =y
x∈X

With respect to the correlation, it is defined in Eq. 7.20 as a function of the


variance and the expectation of the standard deviation:
7.4 Random Forests 295
v
ρ̄ =
(EΘ [Std(Θ)])2

x∈X (Q(x, y) − maxz =y Q(x, z)) − ŝ
1 2 2
|X |
≈ (7.29)
(EΘ [Std(Θ)])2

The standard deviation is defined as Std(Θ) = py + pz∗ + (py − pz∗ )2 , py =
EX ,Y (hF (x) = y), pz∗ = EX ,Y (hF (x) = z ∗ ). Then, after building the kth
classifier from Θk we use Q(x, z) in order to compute z ∗ for every example in
X . Consequently, we have
1 
p̂y (Θk ) = I(hk (x) = y) .
3 |X | x∈X ∼Xk
1

1 
p̂z∗ (Θk ) = 1 I(hk (x) = z ∗ ) (7.30)
3 |X | x∈X ∼Xk

Therefore, Std(Θk ) can be approximated from the latter expressions. Finally,


Std(Θ) is approximated by the average of Std(Θk ) over all the classifiers form-
ing the forest. Thus, having estimates both for the variance and the standard
deviation we have the estimate for the correlation.

7.4.4 Variable Selection: Forest RI vs. Forest-RC

Besides the bootstrapping and its consequences in the GE bound, another


key element of random forests is the way of selecting n << N features (tests)
for building each individual classifier. Random selection of features, with a
fixed size n, is argued to be a good mechanism. This method is known as
Forest-RI. A typical choice of n is n = log2 N + 1. It is interesting not to
include too much variables, because the higher n the higher the correlation.
However, the higher n, the higher the strength. Consequently, it is important
to find a value for n which represents a good trade-off between increment-
ing correlation and incrementing the strength. In order to do that, a good
mechanism is to exploit the oob estimates. In this regard, when there are few
input variables, Forest-RI may lead to a high generalization error because an
important percentage of the variables will be selected. In these cases, it may
be interesting to build new variables by performing linear combinations of
exisiting ones. At each node, l variables are selected and combined linearly
with coefficients belonging to [−1, 1]. Then, n linear combinations are gen-
erated. Anyway, the final choice of n will depend on the trade-off described
above. This latter mechanism is known as Forest-RC. This mechanism has
been recently applied to image classification under the approach of bag-of-
visual-words(BoW) [28, 53, 147, 178, 181]. Broadly speaking, images are repre-
sented by several invariant (scale and affine) features like improved versions of
the Kadir–Brady one described in Chapter 2. Such features are encoded with
296 7 Classifier Design

suitable descriptors like the popular SIFT one [107], a 128-feature vector with
good orientation invariance properties. The descriptors (once clustered) define
a visual vocabulary. The more extended classification mechanism is to obtain
a histogram with the frequencies of each visual word for each image. Such
histogram is compared with stored histograms in the database and the closest
one is chosen as output. The two main weaknesses of the BoW approach are:
(i) to deal with clutter; and (ii) to represent the geometric structure properly
and exploit it in classification. Spatial pyramids [102] seek a solution for both
problems. The underlying idea of this representation is to partition the image
into increasingly fine subregions and compute histograms of the features inside
each subregion. A refinement of this idea is given in [27]. Besides histograms
of descriptors, also orientation histograms are included (Fig. 7.7, top). With
respect to robustness against clutter, a method for the automatic selection of
the Region of Interest (ROI) for the object (discarding clutter) is also pro-
posed in the latter paper. A rectangular ROI is learnt by maximizing the
area of similarity between objects from the same class (see Fig. 7.7, bottom).
The idea is to compute histograms only in these ROIs. Thus, for each level in
the pyramid, constrained to the ROI, two types of histograms are computed:
appearance ones and shape ones. This is quite flexible, allowing to give more
weight to shape or to appearance depending on the class. When building a
tree in the forest, the type of descriptor is randomly selected, as well as the
pyramid level from which it is chosen. If x is the descriptor (histogram) of a
given learning example, each bin is a feature, considering that the number of
bins, say N , depends on the selected pyramid level. Anyway, the correspond-
ing test for such example is nT x + b ≤ 0 for the right child, and >0 for the
left one. Thus all trees are binary. The components of n are chosen randomly
from [−1, 1]. The variable b is chosen randomly between 0 and the distance of
x from the origin. The number of zeros nz imposed for that test is also chosen
randomly and yields a number n = N − nz of effective features used. For each
node, r tests, with r = 100D (increasing linearly with the node depth D),
are performed and the best one in terms of conditional entropy reduction is
selected. The purpose of this selection is to reduce the correlation between
different trees, accordingly with the trade-off described above. The number of
effective features is not critical for the performance, and the method is com-
parable with state-of-the-art approaches to BoW [178] (around 80% with 101
categories and only 45% with 256 ones).
Another interesting aspect to cover within random forest is not the num-
ber of features used but the possibility of selecting variables on behalf of their
importance (Chapter 6 is devoted to feature selection, and some techniques
presented there could be useful in this context). In the context of random
forests, important variables are the ones which, when noised, maximize the
increasing of misclassification error with respect to the oob error with all vari-
ables intact. This definition makes equal importance to sensitiveness. How-
ever, this criterion has been proved to be inadequate in some settings: for
instance in scenarios with heterogeneous variables [151], where subsampling
7.4 Random Forests 297

Fig. 7.7. Top: appearance and shape histograms computed at different levels of
the pyramid (the number of bins of each histogram depends on the level of the
pyramid where it is computed). Bottom: several ROIs learn for different categories
in the Caltech-256 Database (256 categories). Figure by A. Bosch, A. Zisserman and
X. Muñoz (2007
c IEEE). See Color Plates.
298 7 Classifier Design

without replacement is proposed as an alternative mechanism for driving ran-


dom forests. With respect to a definition of importance, in this context, in
terms of the information content about the class to which the example be-
longs, it seems an interesting open problem (see also [164]).

7.5 Infomax and Jensen–Shannon Boosting


Boosting algorithms are becoming popular among research community. The
underlying idea of Boosting is to combine a set of weak learners to improve
them and build a strong classifier. This idea is supported by the work of
Kearns and Valiant [95], who proved that, when enough training data is avail-
able, learners whose performance is slightly better than random guessers could
be joined to form a good classifier. Later, Schapire presented a polynomial
time Boosting algorithm [142]. Adaboost algorithm [61] is considered the first
practical approximation of Boosting to real-world problems.
Given a set of N labeled samples (x1 , y1 ), ..., (xN , yN ), xi ∈ X , yi ∈ {0, 1},
AdaBoost algorithm builds a strong classifier hf (x) from a linear combination
of T simple weak classifiers or hypothesis ht (x). Examples of weak learners
are mononeural perceptrons, decision trees, and in general, any classifier that
works at least slightly better than a random guesser, with error t < 1/2.
Simpler weak learners will yield better results than using multilayer percep-
trons, support vector machines, and other complexer methods. The main loop
of the algorithm, that is shown at Alg. 16, is repeated T times, in order to
learn T weak classifiers from weighted samples; after each iteration, wrongly

Algorithm 16: Freund and Schapire Adaboost algorithm


Input: set of N labeled examples (x1 , y1 ), ..., (xN , yN )
Initialize the samples weight vector D1 (i) = N1 , i = 1, .., N
for t=1 to T do
Train a new weak classifier ht providing it with the distribution Dt

N
Calculate the error of ht as t = D(i)|ht (xi ) − yi |
i=1
if t > 12 then
Stop algorithm
end
Set αt = (1−t t )
1−|h (x )−y |
Update weights vector Dt+1 (i) = Dt (i)αt t i i
Normalize weights vector so Dt+1 is a distribution
end
Output:⎧Strong classifier:
⎪ 

T
1 1
T
1
1, (log )ht (x) ≥ log
hf (x) = α 2 α

⎩ t=1
t
t=1
t
0, otherwise
7.5 Infomax and Jensen–Shannon Boosting 299

classified samples increase their weights, and as a consequence, the algorithm


will focus on them at next iteration. Classification error of each weak learner
is also stored in order to build the strong classifier.
An example can be seen in Fig. 7.8. As stated before, samples are weighted
during algorithm, allowing weak learners to focus on previously misclassified
data. Some learning algorithms can handle this reweighting in a natural way,
but in other cases it is not possible. Another possibility is to use weights in
order to perform a resampling of the training samples.
This section will show how information theory can be applied jointly with
Boosting through two algorithms. The first one, Infomax Boosting algorithm
[108], is based on the Infomax principle introduced by Linsker for neural
networks [105]. The main idea is that neurons are trained in order to maximize
the mutual information between their inputs and output. Translation of this
idea to Adaboost is simple: the algorithm will try to select in each iteration the
weak learner that maximizes the mutual information between input samples
and class labels.
On the other hand, we may find several algorithms based on divergence
measures like Kullback–Leibler to select at each iteration the best weak clas-
sifier. In the case of most of these methods, the only difference is the di-
vergence measure applied. At present, best results are achieved by JSBoost
learning [76], based on Jensen–Shannon divergence, which yields several ad-
vantages over other dissimilarity measures.
Both boosting algorithms described in this section were first applied to face
detection. The objective of these algorithms is to detect the exact positions
of all faces in an image. The image is splitted into image patches, at different
scales, and the classifier classifies each image patch as being face or not face.
In order to achieve this, during training process two classes are used: a set of
positive samples (image patches corresponding to faces) and a set of negative
samples (image patches that do not contain any face), having all these samples
the same size. Depending on the boosting algorithm, different features are
learnt as weak learners to build the strong classifier. Benefits of using features
are clear: computational cost is lower, due to the fact that image patches are
transformed into a lower level representation, and specific domain knowledge
may be incorporated to learning process. An example of features as basic
classifiers in a boosting algorithm may be found in Viola’s face detection
algorithm based on Adaboost [170]. In this case, rectangle features based on
Haar basis functions are applied to 24 × 24 image regions, as Fig. 7.10 shows.
The set of all possible features is large, therefore Adaboost is applied to extract
the most discriminative ones.

7.5.1 The Infomax Boosting Algorithm

The main idea of this algorithm is that information theory may be used dur-
ing Boosting in order to select the most informative weak learner in each
300 7 Classifier Design
1.5 1.5

ε= 0.120000
1 1 α= 0.136364

0.5 0.5

0 0

−0.5 −0.5
−0.5 0 0.5 1 1.5 −0.5 0 0.5 1 1.5
1.5 1.5

ε= 0.130682
1 1 α= 0.150327

0.5 0.5

0 0

−0.5 −0.5
−0.5 0 0.5 1 1.5 −0.5 0 0.5 1 1.5
1.5 1.5

ε= 0.109974
1 1 α= 0.123563

0.5 0.5

0 0

−0.5 −0.5
−0.5 0 0.5 1 1.5 −0.5 0 0.5 1 1.5

1.5

0.5

−0.5
−0.5 0 0.5 1 1.5

Fig. 7.8. Adaboost applied to samples extracted from two different Gaussian distri-
butions, using three iterations (three weak classifiers are trained). Different weights
are represented as different sample sizes. First row: initial samples and first weak
classifier. Second row: reweighted samples after first iteration and second classifier.
Third row: reweighted samples after second iteration and third classifier. Fourth row:
final classifier.
7.5 Infomax and Jensen–Shannon Boosting 301

iteration, thus discarding classification error as weak learner selection crite-


rion. The Infomax principle states that this most informative learner is the
one that maximizes the mutual information between input (training sample)
and output (class label). However, computing mutual information requires
numerical integration and knowing data densities; in the case of feature based
boosting, both requirements result in inefficient approaches. An efficient im-
plementation in the case of boosting based on high-dimensional samples and
features may be achieved by means of the method explained in this section,
based on kernel density estimation and quadratic mutual information.
Features may be described as low-dimensional representation of high di-
mensional data x ∈ Rd . Let us define a linear projection feature as the output

of a function φ : Rd → Rd with d ( d:

φ(x) = φT x (7.31)

with φ ∈ Rd and φT φ = 1.

Infomax feature selection

The most informative feature at each iteration is called Infomax feature, and
the objective of the algorithm is to find this Infomax feature at each iteration
of the boosting process. In order to measure how informative a feature is, mu-
tual dependence between input samples and class label may be computed. A
high dependence will mean that input samples will provide more information
about which class can it be labeled as. A natural measure of mutual depen-
dence between mapped feature φT x and class label c is mutual information:
C 
 p(φT x, c)
I(φT x; c) = p(φT x, c) log dφT x (7.32)
c=1 x p(φT x)p(c)

Therefore, Infomax feature φ∗ yields the maximum mutual information


with respect to all other features:

φ∗ = arg max I(φT x; c) (7.33)


φ

Figure 7.9 shows a simple example of mutual information as a measure


to detect the best of three classifiers, from the one that perfectly splits two
classes to one that cannot distinguish between classes. First classifier yields
the highest mutual dependence between samples and class labels; in the case
of the third classifier, input samples are not informative at all. As can be seen,
mutual information and classification error are related.
When trying to find the Infomax feature, two problems arise. The first
one is that the exact probability density function p(φT x, c) is not known.
302 7 Classifier Design

1.5 1.5

1 1

0.5 0.5

0 0

−0.5 −0.5
−0.5 0 0.5 1 1.5 −0.5 0 0.5 1 1.5
1.5

0.5

−0.5
−0.5 0 0.5 1 1.5

Fig. 7.9. Three different classifiers applied to the same data, obtained from two
different Gaussian distributions (represented at bottom right of the figure). Mutual
information between inputs and class labels, obtained by means of entropic graphs
(as explained in Chapters 3 and 4), are 6.2437, 3.5066, and 0, respectively. The
highest value is achieved in the first case; it is the Infomax classifier.

However, it can be estimated by means of a kernel density estimation method


like Parzen’s windows (explained in Chapter 5). Given a multivariate Gaussian
with mean μ and variance σ 2 :
 
1 (x − μ)T )(x − μ)
Gσ (x − μ) = exp − (7.34)
(2π)d/2πd 2σ 2
the probability density function can be approximated combining several of
these Gaussians kernels:

Nc
p(φT x|c) = wic Gσ (φT x − φT xci ) (7.35)
i=1

where Nc is the number of classes, xci represents input sample i from class c,
and wic is a non-negative weight applied to each training sample of class c,
Nc c
satistying that i=1 wi = 1.
Although solving this first problem, the derivation of the numerical inte-
gration of Eq. 7.32 is not simple. However, knowing that Mutual Information
7.5 Infomax and Jensen–Shannon Boosting 303

between projected data φT x and class label c may be expressed in terms of


Kullback–Leibler divergence:

I(φT x; c) = D(p(φT x, c)||p(φT x)p(c)) (7.36)

this second problem may be solved using a different divergence measure of


densities, like quadratic divergence:


Q(p||q) = (p(x) − q(x))2 dx (7.37)
x

Thus, initial mutual information expression may be reformulated as


quadratic mutual information between projected data and class label:

C 

IQ (φT x; c) = (p(φT x, c) − p(φT x)p(c))2 dφT x (7.38)
c=1 φT x

and after joining this expression with the probability density functions ap-
proximated by means of Eq. 1.5:

C 
Nc 
Nc
IQ (φT x; c) = Pc2 wic wjc G√2σ (yic − yjc )
c=1 i=1 j=1


C 
C 
Nc 
Nc
 
+ uci vjc G√2σ (yic − yjc ) (7.39)
c1 =1 c2 =1 i=1 j=1

where wic is the associated weight of xci in the approximated probability den-
sity function and

Pc = p(c) (7.40)
yic = φT xci (7.41)
 

C
uci = Pc Pc2 − 2Pc wic (7.42)
c =1
vic = Pc wic (7.43)

Thus, the Infomax feature is searched in feature space by means of a


quadratic mutual information gradient ascent. This gradient is computed from
previous equation:

∂IQ  Cc N
∂IQ ∂yic  c
∂IQ c
C N
= = x (7.44)
∂φ c=1 i=1
c
∂yi ∂φ c=1 i=1
∂yic i
304 7 Classifier Design

where:
 Nc  c 
∂IQ 2 c c
yi − yjc
= P c w w
i j 1 H
∂yic j=1

 
C  Nc c
 y c
i − y j
+ uci vjc H1 (7.45)
 j=1

c =1

H1 (x) = −2xe−x being the first degree Hermite function. Therefore, if an


2

initial estimate φ0 is chosen, by gradient ascent next values can be computed


as φk+1 = φk + ν ∂I Q
∂φ , with ν > 0, until convergence is reached.

Infomax Boosting vs. Adaboost

Infomax Boosting modifies the original Adaboost algorithm by using differ-


ent criteria to select the weak learner at each iteration: rather than selecting
the classifier having less error, the most informative one is chosen. The new
algorithm, as shown in Alg. 17, is straightforward. The main change is in

Algorithm 17: Infomax Boosting algorithm


Input: set of N+ positive and N− negative labeled examples
{x− −
1 , .., xN− , x1 , .., xN+ }, class priors P+ and P− and width
+ +

parameter of Gaussian kernels σ 2 .


+ −
Initialize wi,1 = N1+ , i = 1, .., N+ and wi,1 = N1− , i = 1, .., N−
for t=1 to T do
+ −
Choose the Infomax feature φt with wi,t ,wi,t and σ 2 .
Construct kernel density estimation of class-dependent densities using:
(t)

N+
+
p+ (φTt x) = wi,t Gσ (φTt x − φTt x+i )
i=1

N−

Gσ (φTt x − φTt x−
(t)
p− (φTt x) = wi,t i )
i=1
(t)
p+ (φT
t x)P+
Build weak classifier ft (x, c) = log (t)
p− (φT
t x)P−
Update weights:
+ +
wi,t+1 = wi,t exp(−ft (x+i ))
wi,t+1 = wi,t exp(−ft (x−
− −
i ))
+ −
Normalize weight vectors so wt+1 and wt+1 are distributions
end
T
Output: Strong classifier: f (x) = ft (x)
t=1
7.5 Infomax and Jensen–Shannon Boosting 305

the inner loop, where all concepts explained during this section are incorpo-
rated. A common modification of Adaboost is also included at initialization,
where weight values are calculated separately for samples belonging to dif-
ferent classes, in order to improve efficiency. It must be noted that strong
classifier output is not a discrete class label, but a real number, due to the
fact that Infomax Boosting is based on real Adaboost, and that class labels
are in {−1, 1}.
A question that may arise is if Infomax performance is really better than
in the case of Adaboost algorithm. Infomax boosting was originally applied
to face detection, like Viola’s face detection based on Adaboost. As can be
seen in Fig. 7.10, in this case features were obtained from oriented and scaled
Gaussians and Gaussian derivatives of first and second order, by means of con-
volutions. Figure 7.10 shows results of comparing both algorithms. Regarding
false positive rate, its fall as more features are added is similar; however,
at first iterations, error for Infomax is quite lower, needing a lower number
of features to decrease this error below 1%. In the case of ROC curves, it
can be clearly seen that Infomax detection rate is higher (96.3–85.1%) than
Adaboost value. The conclusion is that not only Infomax performance is bet-
ter, but that low error rates can also be achieved faster than in the case of
Adaboost, thanks to the Infomax principle, which allows algorithm to focus
on the most informative features.

7.5.2 Jensen–Shannon Boosting

Another way to apply information theory to boosting is using a symmetric


divergence measure in order to select the most discriminative weak learner at
each iteration, that is, the feature that optimally differentiates between the
two classes. One example is KLBoost [106], based on a symmetric Kullback–
Leibler divergence measure. However, Kullback–Leibler has limited numerical
stability. Jensen–Shannon improves this stability and introduces other advan-
tages. JSBoost algorithm, based on Jensen–Shannon divergence measure, is
quite similar to boosting algorithms explained above, thus we focus only on
how weak learners are selected and differences between it and other boosting
techniques.

Jensen–Shannon Feature Pursuit

The objective in each iteration is to find the JS Feature, the feature φ :


Rd → Rd that better discriminates between positive and negative classes.
Unlike Infomax algorithm, designed only for linear features, JSBoost algo-
rithm can rely also on not linear features. Once feature φi at iteration i is

learned, two histograms h+ i and hi are calculated from positive and neg-
ative samples mapped with this feature. Finally, from these two distribu-
tions, a weak classifier ϕi (): R → R is built in a way that ϕi (φi (x)) > 0 for
306 7 Classifier Design

Fig. 7.10. Comparison between Adaboost and Infomax. Top row left: example of
Haar features, relative to detection window, used in Viola’s face detection based
on Adaboost. Pixel values from white rectangles are subtracted from pixel values
in black rectangles. Top row right: example of application of two Haar features to
the same face image. Center row: feature bank used in Lyu’s face detection based
on Infomax. (Figure by S. Lyu (2005
c IEEE)). Bottom row: Lyu’s experiments
comparing Infomax and Adaboost false positive rate and ROC curves. (Figure by
S. Lyu (2005
c IEEE)).


positive samples (h+i (φi (x)) > hi (φi (x))) and ϕi (φi (x)) < 0 for negative ones
− +
(hi (φi (x)) > hi (φi (x))):

1 h+
i (φi (x))
ϕi (x) = log − (7.46)
2 hi (φi (x))

In order to select the most discriminative feature at each iteration, Jensen–


Shannon divergence is used, rather than Kullback–Leibler, due to the fact that

this last measure is undefined for h+ i (φi (x)) = 0 and hi (φi (x)) = 0:
7.5 Infomax and Jensen–Shannon Boosting 307
 
JS(φi ) = i (φi (x)) ·
h+

h+i (φi (x))


log dφi (x) (7.47)
1 +
[h
2 i (φ i (x)) + h−i (φi (x))]

Jensen–Shannon, as Kullback–Leibler, is not a symmetric measure. This


symmetry is achieved redefining it as
 %
h+i (φi (x))
SJS(φi ) = h+
i (φi (x)) log 1 +
[h
2 i (φ i (x)) + h− i (φi (x))]

hi (φi (x)) (
+ h−i (φ i (x)) log 1 + − dφi (x) (7.48)
2 [hi (φi (x)) + hi (φi (x))]

At iteration k, JS feature is calculated as

φ∗i = arg max JS(φi ) (7.49)


φi

Figure 7.11 shows an example of application of Jensen–Shannon diver-


gence, used in this case to discriminate between three different Haar based
features in the face detection problem. These Haar features are exactly the
same than the ones used in Viola’s Adaboost face detection algorithm. The
divergence between feature based classifier output between positive and neg-
ative samples is highest in the case of the first feature; it is the JS Feature.
As in the case of Infomax Boosting, Jensen–Shannon divergence and error
classification are related measures.

JSBoost vs. other boosting algorithms

Algorithm 18 shows JSBoosting. The main difference with previous methods


is how the final classifier is built. The weak learners are not weighted by
a coefficient, like for example in Infomax. Another small difference is how
weights are initialized; also, a coefficient βk is applied at iteration k to weight
updating, in order to control how fast the weight is updated. The value of βk
is defined as
1 − k
βk = log (7.50)
k
where k is the training error of weak learner at iteration k. However, as stated
before, SJS value is used to select a weak classifier at each iteration, rather
than this classification error.
Like in other cases, JSBoost was first applied to face detection problem;
the operator or feature from where weak learners were built is called Local
Binary Pattern (LBP), that is a not linear feature. Experiments compar-
ing face detection performance using this kind of features demonstrate that
308 7 Classifier Design

0.35 15

0.3 10
0.25
5
0.2
0
0.15
−5
0.1

0.05 −10

0 −15
−1.5 −1 −0.5 0 0.5 1 1.5 2 −1 −0.5 0 0.5 1 1.5
x 104 x 104
0.035 10
0.03
5
0.025

0.02 0

0.015 −5
0.01
−10
0.005

0 −15
−15000 −10000 −5000 0 5000 −10000 −8000−6000 −4000 −2000 0 2000
0.4 10
0.35
5
0.3
0.25 0
0.2
0.15 −5

0.1
−10
0.05
0 −15
−200 −100 0 100 200 −150 −100 −50 0 50 100 150

Fig. 7.11. Example of JS Feature pursuit. Three Haar based features and an exam-
ple of application to a positive sample are shown, from best to worst. At the center
of each row, the corresponding h+ (x) (dash line) and h− (x) (solid line) distributions
are represented, and at the right, φ(x) is shown. SJS values for each feature, from
top to bottom, are 6360.6, 2812.9, and 1837.7, respectively.

JSBoost (98.4% detection rate) improves KLBoost results (98.1%) and out-
performs Real Adaboost ones (97.9%). Furthermore, JSBoost achieves higher
detection rates with a lower number of iterations than other methods.

7.6 Maximum Entropy Principle for Classification


7.6.1 Improved Iterative Scaling

Maximum entropy is a general technique for estimating probability distribu-


tions from data. It has been successfully applied in pattern recognition tasks
of different fields, including natural language processing, computer vision, and
bioinformatics. The maximum entropy classifier is based on the idea that the
7.6 Maximum Entropy Principle for Classification 309

Algorithm 18: JSBoosting algorithm


Input: set of N+ positive and N− negative labeled examples
{x− −
1 , ..., xN− , x1 , ..., xN + }
+ +

Initialize wi = 2N1+ for positive samples and wi = 2N1− for negative samples
for k=1 to K do
Select JS feature φk by Jensen–Shannon divergence using weights wi
1 h+ (φk (x))
fk (x) = log k
2 h−
k
(φk (x))
Update weights wi = wi · exp(−β
 k ) · yi · fk (xi ), i = 1, . . . , N , and
normalize weights so that i wi = 1
end
Output:⎧Strong classifier:
⎪ 

K
1 h+k (φk (x))
1, log − ≥0
hf (x) = 2 h k (φk (x))

⎩ 1=1
0, otherwise

Table 7.3. A toy example of a classification problem with two classes C =


{motorbike, car} and an original feature space of D = 3 features. There are five
samples in this training set.
Original features
c 1: “has gears” 2: “# wheels” 3: “# seats”
Motorbike yes 3 3
Motorbike no 2 1
Motorbike yes 2 2
Car yes 4 5
Car yes 4 2

most suitable model for classification is the most uniform one, given some con-
straints which we call features. The maximum entropy classifier has to learn
a conditional distribution from labeled training data. Let x ∈ X be a sample
and c ∈ C be a label, then the distribution to be learnt is P (c|x), which is to
say, we want to know the class given a sample. The sample is characterized
by a set of D features (dimensions). For the maximum entropy classifier we
will formulate the features as fi (x, c), and 1 ≤ i ≤ NF , and NF = D|C|. Each
sample has D features for each one of the existing classes, as shown in the
following example.
Table 7.3 contains the original feature space of a classification problem
where two classes of vehicles are described by a set of features. In Table 7.4
the features are represented according to the formulation of the maximum
entropy classifier.
310 7 Classifier Design

Table 7.4. The values of the feature function fi (x, c) for the training data.
fi (x, c)
Class Class(i) = 1 = motorbike Class(i) = 2 = car
feature “has gears” “# wheels” “# seats” “has gears” “# wheels” “# seats”
x, c \ i i=1 i=2 i=3 i=4 i=5 i=6
x = 1, c = 1 1 3 3 0 0 0
x = 2, c = 1 0 2 1 0 0 0
x = 3, c = 1 1 2 2 0 0 0
x = 4, c = 2 0 0 0 1 4 5
x = 5, c = 2 0 0 0 1 4 2

The feature function is defined as


mod(i, c) if class(i) = c
fi (x, c) = (7.51)
0 otherwise

In the following equations f will form part of a product, multiplying a prob-


ability, so when the value of the feature is 0, the probability will also be null.
Originally the features f are defined as a binary function; however, a gener-
alization is made to allow real positive values. If a feature has a natural value
h, for the model this is the same as declaring the existence of the same fea-
ture h times. Generalizing in the same way real positive values are allowed for
the features. The maximum entropy classifier has been widely used in Natural
Language Processing [19] where many problems are easily formulated in terms
of binary features which may be present in a text zero times, once, or more
than once.
The maximum entropy principle is used to model a conditional distribu-
tion P (c|x) which is restricted to have as expected values for the features
fi (x, c), the values shown in 7.4. These expected values with respect to the
data distribution P (x, c) are denoted as

P (f ) ≡ P (x, c)f (x, c)
x∈X c∈C
 
= P (x) P (c|x)f (x, c) (7.52)
x∈X c∈C

Here P (x, c) and P (x) are the expected values in the training sample. The
equation is a constraint of the model, it forces the expected value of f to
be the same as the expected value of f in the training data. In other words,
the model has to agree with the training set on how often to output a feature.
The empirically consistent classifier that maximizes entropy is known as
the conditional exponential model. It can be expressed as
N 
1  F

P (c|x) = exp λi fi (x, c) (7.53)


Z(x) i=1
7.6 Maximum Entropy Principle for Classification 311

where λi are the weights to be estimated. In the model there are NF weights,
one for each feature function fi . If a weight is zero, the corresponding fea-
ture has no effect on classification decisions. If a weight is positive then the
corresponding feature will increase the probability estimates for labels where
this feature is present, and decrease them if the weight is negative. Z(x) is a
normalizing factor for ensuring a correct probability and it does not depend
on c ∈ C: N 
 F

Z(x) = exp λi fi (x, c) (7.54)


c∈C i=1

A detailed derivation is presented in [18]. In his work it is proved that this


model is the one that is closest to the expected probability of the features in
the training data in the sense of Kullback–Leibler divergence. The estimation
of the parameters (weights) is a very important point and is not trivial.
In the following equation the log-likelihood function of the empirical dis-
tribution of the training data is expressed:
 
L(p) ≡ log P (c|x)P (x,c)
x∈X c∈C

= P (x, c) log P (c|x) (7.55)
x∈X c∈C

This is a dual maximum-likelihood problem. It could be solved with the


Expectation Maximization (EM) algorithm. However, when the optimization
is subjected to the feature constraints, the solution to the primal maximum
entropy problem is also the solution to the dual maximum-likelihood problem
of Eq. 7.55, provided that the model has the same exponential form. In [18] it
is also explained that due to the convex likelihood surface of the model, and
due to the fact that there are no local maxima, it is possible to find the solu-
tion using a hill climbing algorithm. A widely used hill climbing method is the
Generalized Iterative Scaling. The Improved Iterative Scaling performs better
and will be outlined in this subsection. Also, there are other optimization tech-
niques that can be used, such as gradient descent, variable metric, and so on.
The Improved Iterative Scaling algorithm (see Alg. 19) performs a hill-
climbing search among the log-likelihood space formed by the parameters λi .
It is guaranteed to converge on the parameters that define the maximum
entropy classifier for the features of a given data set.
Equation 7.56 can be solved in closed-form or with other root-finding pro-
cedure, such as Newton’s numerical method.
Once the parameters λi are calculated, the conditional probability distri-
bution P (c|x) is modeled and classification of new examples can be performed
using Eq. 7.53.
Let us see a classification toy-example with the data from Table 7.4. The
Improved Iterative Scaling algorithm converges to the weights:

(λ1 , . . . , λ6 ) = (0.588, −0.098, −0.090, −0.588, 0.098, 0.090)


312 7 Classifier Design

Algorithm 19: Improved Iterative Scaling


Input: A distribution of classified samples p(x, c) and a set of NF features
fi .
Initialize λi = 0, ∀i ∈ {1, 2, . . . , NF }
repeat
foreach i ∈ {1, 2, . . . , NF } do
Solve for Δλi the equation:

  
NF
P (x) P (c|x)fi (x, c) exp(Δλi fj (x, c))
 c∈C
x∈X j=1 (7.56)
= P (x, c)fi (x, c)
x∈X c∈C

Update λi ← λi + Δλi
end
until convergence of all λi ;
Output: The λi parameters of the conditional exponential model (Eq. 7.53).

Table 7.5. The values of the feature function fi (x, c) for the new sample supposing
that its class is motorbike (x = 6, c = 1) and another entry for the supposition that
the class is car (x = 6, c = 2).
fi (x, c)
Class Class(i) = 1 = motorbike Class(i) = 2 = car
feature “has gears” “# wheels” “# seats” “has gears” “# wheels” “# seats”
x, c \ i i=1 i=2 i=3 i=4 i=5 i=6
x = 6, c = 1 1 4 4 0 0 0
x = 6, c = 2 0 0 0 1 4 4

We want to classify a new unlabeled sample which has the following features:
has gears, 4 wheels, 4 seats
We have to calculate the probabilities for each one of the existing classes, in
this case C = motorbike, car. Let us put the features using the notation of
Table 7.4. We have two new samples, one of them is supposed to belong to
the first class, and the other one to the second class (see Table 7.5).
According to the estimated model the conditional probability for the sam-
ple (x = 6, c = 1) is
N 
1  F

P (c = 1|x = 6) = exp λi fi (6, 1) (7.57)


Z(6) i=1
7.6 Maximum Entropy Principle for Classification 313

where the exponent is equal to



NF
λi fi (6, 1)
i=1
= 0.588 · 1 − 0.098 · 4 − 0.090 · 4 − 0.588 · 0 + 0.098 · 0 + 0.090 · 0
= −0.164 (7.58)
and the normalizing factor is
N  N 
 F F

Z(6) = exp λi fi (6, 1) + exp λi fi (6, 2)


i=1 i=1
= exp(0.588 · 1 − 0.098 · 4 − 0.090 · 4 − 0.588 · 0 + 0.098 · 0 + 0.090 · 0)
+ exp(0.588 · 0 − 0.098 · 0 − 0.090 · 0 − 0.588 · 1 + 0.098 · 4 + 0.090 · 4)
= 0.8487 + 1.1782 = 2.027 (7.59)
so the probability is
1
P (c = 1|x = 6) = exp (−0.164) = 0.4187 (7.60)
2.027
Performing the same calculations for the other class (x = 6, c = 2) we have
N 
1  F

P (c = 2|x = 6) = exp λi fi (6, 2) (7.61)


Z(6) i=1

where the exponent is equal to



NF
λi fi (6, 2)
i=1
= 0.588 · 0 − 0.098 · 0 − 0.090 · 0 − 0.588 · 1 + 0.098 · 4 + 0.090 · 4
= 0.164 (7.62)
and the probability is
1
exp (0.164) = 0.5813
P (c = 2|x = 6) = (7.63)
2.027
Therefore the maximum entropy classifier would label the sample “has
gears, 4 wheels, 4 seats” as a car, with a 0.5813 probability.

7.6.2 Maximum Entropy and Information Projection


Besides classifier design, along this book the ME principle has been applied to
several problems (it appears in almost all chapters). In all cases, but when pre-
senting iterative scaling, we have outlined solutions based on the primal formu-
lation of the problem (that is, go ahead to estimate the Lagrange multipliers
related to the expectation constraints). In the previous subsection we have re-
ferred to iterative scaling as an algorithm solving the dual maximum-likelihood
314 7 Classifier Design

problem. Actually, iterative scaling and its improvements are good methods
for working in the dual space, but, what is the origin of this dual framework?
Let us start by the classical discrete formulation of the ME problem:

 1

p (x) = arg max p(x)
p(x)
x
log p(x)

s.t p(x)Fj (x) = aj , j = 1, . . . , m
x

p(x) = 1
x
p(x) ≥ 0 ∀x , (7.64)
where aj are empirical estimations of E(Gj (x)). It is particularly interesting
here to remind the fact that the objective function is concave and, also, that
all constraints are linear. These properties are also conserved in the following
generalization of the ME problem:

 p(x)

p (x) = arg min D(p||π) = p(x) log
p(x)
x
π(x)

s.t p(x)Fj (x) = aj , j = 1, . . . , m
x

p(x) = 1
x
p(x) ≥ 0 ∀x (7.65)
where π(x) is a prior distribution, and thus, when the prior is uniform we
have the ME problem. This generalization is known as the Kullback’s mini-
mum cross-entropy principle [100]. The primal solution is obtained through
Lagrange multipliers, and has the form of the following exponential function
(see Section 3.2.2):
1 m
p∗ (x) = e j=1 λj Fj (x) π(x) (7.66)
Z(Λ)
where Λ = (λ1 , . . . , λm ), and the main difference with respect to the solution
to the ME problem is the factor corresponding to the prior π(x). Thinking
of the definition of p∗ (x) in Eq. 7.66 as a family of exponential functions
(discrete or continuous) characterized by: an input space S ⊆ Rk , where the x
are defined, Λ ∈ Rm (parameters), F (x) = (F1 (x), . . . , Fm ) (feature set), and
π(x): Rk → R (base measure, not necessarily a probability measure). Thus,
the compact form of Eq. 7.66 is
1 Λ·F (x)
pΛ (x) = e π(x) (7.67)
Z(Λ)
 Λ·F (x)
and considering that  Z(Λ) = xe π(x), the log-partition function is
denoted G(Λ) = ln x eΛ·F (x) π(x). Thus it is almost obvious that
7.6 Maximum Entropy Principle for Classification 315

1 Λ·F (x)
pΛ (x) = e π(x) = eΛ·F (x)−G(Λ) π(x) (7.68)
eG(Λ)
where the right-hand side of the latter equation is the usual definition of an
exponential family:
pΛ (x) = eΛ·F (x)−G(Λ) π(x) (7.69)
The simplest member of this family is the Bernoulli distribution defined over
S = {0, 1} (discrete distribution) by the well-known parameter θ ∈ [0, 1] (the
success probability) so that p(x) = θx × (1 − θ)1−x , x ∈ S. It turns out that
E(x) = θ. Beyond the input space S, this distribution is posed in exponential
form by setting: T (x) = x, π(x) = 1 and choosing
pλ (x) ∝ eλx
pλ (x) = eλx−G(λ)
 + ,
G(λ) = ln eλx = ln eλ·0 + eλ·1 = ln(1 + eλ )
x
λx
e
pλ (x) =
1 + eλ

pλ (1) = =θ.
1 + eλ
eλ 1
pλ (0) = 1 − pλ (1) = 1 − λ
= =1−θ (7.70)
1+e 1 + eλ
The latter equations reveal a correspondence between the natural space λ and
the usual parameter θ = eλ /(1 + eλ ). It is also interesting to note here that
the usual parameter is related to the expectation of the distribution. Strictly
speaking, in the general case the natural space has as many dimensions as the
number of features, and it is defined as N = {Λ ∈ Rm : −1 < G(Λ) < 1}.
Furthermore, many properties of the exponential distributions, including the
correspondence between parameter spaces,  emerge from the analysis of the
log-partition. First of all, G(Λ) = ln x eΛ·F (x) π(x) is strictly convex with
respect to Λ. The partial derivatives of this function are:
∂G 1 ∂  Λ·F (x)
= e π(x)
∂λi G(Λ) ∂λi x
 
1  ∂ m
= e j=1 λj Fj (x) π(x)
G(Λ) x ∂λi
1   m
= e j=1 λj Fj (x) π(x)Fi (x)
G(Λ) x
1  Λ·F (x)
= e π(x)Fi (x)
G(Λ) x

= pΛ (x)Fi (x)
x
= EΛ (Fi (x)) = ai (7.71)
316 7 Classifier Design

This result is not surprising if one considers that G (Λ) = EΛ (F (x)) = a,


which is obvious to derive from above. Furthermore, the convexity of the log-
partition function makes this derivative unique and thus it is possible to make
a one-to-one correspondence between the natural parameters Λ ∈ Rm and the
expectation parameters a lying also in Rm .
Beyond the latter correspondence, the deepest impact of the connection
between the derivative of the log-partition and the expectations in learning is
the finding of the distribution maximizing the log-likelihood of the data. Let
X = {x1 , . . . , xN } be a set of i.i.d. samples of a distribution pΛ (x) belonging
to the exponential family. Then, the log-likelihood is


N 
N
(X|Λ) = log pΛ (xi ) = log(pΛ (xi ))
i=1 i=1

N
= (Λ · F (xi ) − G(Λ) + log π(xi ))
i=1
 

N 
N
=Λ ·F (xi ) − N G(Λ) + log π(xi ) (7.72)
i=1 i=1

Then, for finding the distribution maximizing the log-likelihood we must set
to zero its derivative with respect to Λ:
N 


 (X|Λ) = 0 ⇒ ·F (xi ) − N G (Λ) = 0
i=1

1 
N
⇒ G (Λ) ≡ a = ·F (xi ) (7.73)
N i=1

which implies that Λ corresponds to the unique possible distribution fitting the
average vector a. Thus, if we consider the one-to-one correspondence, between
the natural space and the expectation space established through G (Λ), the
maximum-likelihood distribution is determined up to the prior π(x). If we
know the prior beforehand, there is a unique member of the exponential dis-
tribution family, p∗ (x) satisfying EΛ (F (x)) = a which is the closest to π(x),
that is, which minimizes D(p||π). That is, any other distribution p(x) satisfy-
ing the constraints is farther from the prior:

D(p||π) − D(p∗ ||π)


 p(x)  ∗ p∗ (x)
= p(x) log − p (x) log
x
π(x) x
π(x)
7.6 Maximum Entropy Principle for Classification 317
 p(x)  ∗
= p(x) − p (x)(Λ∗ · F (x) − G(Λ∗ ))
x
π(x) x
 p(x)  
= p(x) − p(x)(Λ∗ · F (x)) − p∗ (x)G(Λ∗ )
x
π(x) x x
 p(x)  
= p(x) − p(x)(Λ∗ · F (x)) − p∗ (x)G(Λ∗ )
x
π(x) x x
 p(x)  
= p(x) − p(x)(Λ∗ · F (x) p(x)G(Λ∗ )
x
π(x) x x
 p(x) 
= p(x) − p(x)(Λ∗ · F (x) − G(Λ∗ ))
x
π(x) x
 p(x)  p∗ (x)
= p(x) − p(x) log
x
π(x) x
π(x)
 p(x)
= p(x) log = D(p||p∗ ) (7.74)
x
p∗ (x)

where the change of variable from p∗ (x) to p(x) in the right term of the fourth
line is justified
 by the fact that
both pdfs satisfy the expectation
 constraints.
Thus: x p∗ (x)(Λ

· F (x)) = x p(x)(Λ ∗
· F (x)) because ∗
x p(x) F (x) =
EΛ∗ (F (x)) = x p(x)F (x) = EΛ (F (x)) = a. This is necessary so that the
Kullback–Leibler divergence satisfies a triangular equality (see Fig. 7.12, top):

D(p||π) = D(p||p∗ ) + D(p∗ ||π) (7.75)

Otherwise (in the general case that p∗ (x) to p(x) satisfy, for instance, the same
set of inequality constraints) the constrained minimization of cross-entropy
(Kullback–Leibler divergence with respect to a prior) satisfies [45, 144]:

D(p||π) ≥ D(p||p∗ ) + D(p∗ ||π) (7.76)

In order to prove the latter inequality, we firstly remind that D(p∗ ||π) =
minp∈Ω D(p||π), Ω being a convex space of probability distributions (for in-
stance, the space of members of the exponential family). Then, let p∗α (x) =
(1 − α)p∗ (x) + αp(x), with α ∈ [0, 1], be pdfs derived from a convex combina-
tion of p∗ (x) and p(x). It is obvious that

D(p∗α ||p) ≥ D(p∗ ||p) ≡ D(p∗0 ||p) ≥ 0 (7.77)


318 7 Classifier Design

D(p||π) π
D(p*||π)

D(p||p*) p*= I-projection(π)

p Ξ={p:E(F(x))=a}

Ω space

Ξ3={p:E(F3(x))=a3} Ξ2={p:E(F2(x))=a2}

p0=π

p'2

p1
p5 p3
p7
p* p
Ξ1={p:E(F1(x))=a1}

Fig. 7.12. Top: triangular equality. Bottom: solving by alternating projections in


different subspaces until convergence.

and, thus, D (p∗0 ||p) ≥ 0. Now, taking the derivative of D(p∗α ||p) with respect
to α:
0 ≤ D (p∗α ||p)
 8
d  ∗ (1 − α)p∗ (x) + αp(x) 88
= ((1 − α)p (x) + αp(x)) log 8 α=0
dα x
π(x)
 
p∗ (x)
= (p(x) − p∗ (x)) 1 + log
x
π(x)
 ∗
  
p (x) ∗ ∗ p∗ (x)
= p(x) + p(x) log − p (x) + p (x) log
x
π(x) x
π(x)
7.6 Maximum Entropy Principle for Classification 319
   
p∗ (x) p∗ (x)
= p(x) log − p∗ (x) log
x
π(x) x
π(x)
   
p(x) p∗ (x) p∗ (x)

= p(x) log − p (x) log
x
π(x) p(x) x
π(x)
   
p(x) p∗ (x) ∗ p∗ (x)
= p(x) log + p(x) log − p (x) log
x
π(x) p(x) x
π(x)
= D(p||π) − D(p∗ ||p) − D(p∗ ||π) (7.78)
Therefore,
D(p||π) − D(p∗ ||p) − D(p∗ ||π) ≥ 0 ≡ D(p||π) ≥ D(p∗ ||p) + D(p∗ ||π) (7.79)
Thus, for the equality case (the case where the equality constraints are satis-
fied) it is timely to highlight the geometric interpretation of such triangular
equality. In geometric terms, all the pdfs of the exponential family which sat-
isfy the same set of equality constraints lie in an affine subspace Ξ (here is the
connection with information geometry, introduced in Chapter 4). Therefore,
p∗ can be seen as the projection of π onto such subspace. That is the defini-
tion of I-projection or information projection. Consequently, as p is also the
I-projection of π in the same subspace, we have a sort of Pythagorean theorem
from the latter equality. Thus, the Kullback–Leibler divergences play the role
of the Euclidean distance in the Pythagorean theorem. In addition, pi and p∗
belong to Ω, the convex space. Actually, the facts that Ω is a convex space,
and also the convexity of D(p||π), guarantee that p∗ is unique as we have seen
above. Anyway, the p∗ distribution has the three following equivalent prop-
erties: (i) it is the I-projection of the prior; (ii) it is the maximum likelihood
distribution in Ω; and (iii) p∗ ∈ Ξ ∩ Ω.
In algorithmic terms, the latter geometric interpretation suggests an iter-
ative approach to find p∗ from an initial p, for instance π. A very intuitive
approach was suggested by Burr 20 years ago [32]. For the sake of simplic-
ity, consider that we have only two expectation constraints to satisfy, that
is F (x) = (F1 (x), F2 (x)), and E(F1 (x)) = a1 , E(F2 (x)) = a2 , thus being
a = (a1 , a2 ), and Λ = (λ1 , λ2 ) (here, for simplicity, we do not include λ0 , the
multiplier ensuring up-to-one sum). Each of the constraints has associated a
different convex space of probability distributions satisfying them: Ξ1 and Ξ2 ,
respectively. Obviously, p∗ lies in Ξ1 ∩ Ξ2 . However, starting by pt=0 = π it is
possible to alternatively generate pt ∈ Ξ1 (t even) and pt ∈ Ξ2 (t odd) until
convergence to p∗ . More precisely, p1 has the form
p1 (x) = eλ1 F1 (x)−G(λ1 ) π(x) (7.80)

as G (λ1 ) = Eλ1 (F1 (x)) = a1 , then λ1 is uniquely determined. Let us then
project p1 onto Ξ2 using p1 as prior. Then we have
p2 (x) = eλ2 F2 (x)−G(λ2 ) p1 (x)
= eλ2 F2 (x)−G(λ2 ) eλ1 F1 (x)−G(λ1 ) π(x)
= eλ1 F1 (x)+λ2 F2 (x)−(G(λ1 )+G(λ2 )) π(x) (7.81)
320 7 Classifier Design

where we must find λ2 , given λ1 , which, in general, does not satisfy the con-
straint generating Ξ2 . However, a key element in this approach is that the
triangular equality (Eq. 7.75) ensures that: (i) p1 is closer to p∗ than p0 ; and
(ii) p2 is closer to p∗ than p1 . In order to prove that, the cited triangular
equality is rewritten as

D(p||π) = D(p∗ ||π) − D(p∗ ||p) ≡ D(p∗ ||p) = D(p∗ ||π) − D(p||π) (7.82)

because D(p||p∗ ) = −D(p∗ ||p). This new form of the triangular equality is
more suitable for giving the intuition of the convergence of alternate projec-
tions:

D(p∗ ||p1 ) = D(p∗ ||p0 ) − D(p1 ||p0 )


D(p∗ ||p2 ) = D(p∗ ||p1 ) − D(p2 ||p1 )
...
D(p∗ ||pt ) = D(p∗ ||pt−1 ) − D(pt ||pt−1 ) (7.83)

The latter iteration alternates projecting onto subspace Ξ1 and Ξ2 , using


the previous probability as a prior, and finding the multipliers satisfying the
corresponding constraint, until a p∗ ∈ Ξ1 ∩ Ξ2 is found. As we have seen
above, such minimizer is unique. The convergence speed of this process was
experimentally compared with a Newton–Raphson iterative scheme and the
speed of the alternated method is faster or equal, but never slower, than the
one of the iterative schemes.
A generalization of the latter idea, when having m constraints, is to project
the current prior (result of iteration t) onto the subspace Ξt( mod m) (itera-
tive scaling). This scheme assumes an arbitrary order in the selection of the
next subspace to project onto. As we illustrate in Fig. 7.12 (bottom), alter-
nating between near parallel subspaces slows down the search, with respect to
alternating between near orthogonal subspaces. However, the search for the
constraint farthest from the current point could add an inadmissible overload
to the search process. It seems more convenient to take into account all the
constraints at each step. This is the motivation of generalized iterative scal-
ing [47], whose geometric interpretation is due to Csziszár [44]. Let Ξ be the
space of probability distributions satisfying m expectation constraints:


Ξ= p: p(x)Fj (x) = aj j = 1, . . . , m (7.84)
x

The latter space is called a linear


m family. It is assumed, without loss of gen-
erality, that Fj (x) ≥ 0 and j=1 Fj (x) = 1. If the latter conditions do not
apply for our problem, the original features Fjo are affine scaled properly
m
(Fj = aFjo + b) so that j=1 Fj (x) ≤ 1, and if the latter inequality is strict,
m
an additional function (Fm+1 = 1 − j=1 Fj ) and, thus, a new constraint,
must be added. Let us also define the families
7.6 Maximum Entropy Principle for Classification 321

Ψ = {π̃(x, j) = π(x)Fj (x) j = 1, . . . , m}


⎧ ⎫
⎨ m ⎬
Ψ̃1 = π̃(x, j) = π(x)Fj (x) : π̃(x, j) = aj j = 1, . . . , m
⎩ ⎭
j=1
⎧ ⎫
⎨ 
m ⎬
Ψ̃2 = p̃(x, j) = p(x)Fj (x) : p̃(x, j) = aj j = 1, . . . , m (7.85)
⎩ ⎭
j=1

The latter definitions ensure that p̃∗ (x, j) = p∗ (x)π̃(x, j). Furthermore, as Ψ̃1
and Ψ̃2 are linear families whose marginals with respect to j yield a, then
Ψ = Ψ̃1 ∩ Ψ̃2 . This implies that we may iterate alternating projections onto
the latter spaces until convergence to the optimal p∗ lying in the intersection.
We define p̃2t (x, j) = pt (x)Fj , t = 0, 1, . . ., with p0 = π. Then, let p̃2t+1 be
the I-projection of p̃2t on Ψ̃1 , and, p̃2t+2 the I-projection of p̃2t+1 on Ψ̃2 . The
projection of p̃2t on Ψ̃1 can be obtained by exploiting the fact that we are
projecting on a family of distribution defined by a marginal. Consequently
the projection can be obtained by scaling the projecting pdf with respect to the
value of the corresponding marginal:

j p̃2t (x, j)
p̃2t+1 (x, j) = p̃2t (x, j) 
x p̃2t (x, j)
aj 
= pt (x)Fj (x) , aj,t = pt (x)Fj (x) (7.86)
aj,t x

where 0/0 = 0. Then, the projection of p̃2t+2 is obtained by minimizing the


cross entropy of pdfs in Ψ̃2 from p̃2t+1 . That is, we must minimize:


m 
p̃(x, j)
D(p̃(x)||p̃2t+1 (x)) = p̃(x, j) log
x j=1
p̃2t+1 (x, j)


m
p(x)Fj (x)
= p(x)Fj (x) log aj
x j=1
pt (x)Fj (x) aj,t

 
m
p(x)
= p(x) Fj (x) log aj
x j=1
pt (x) aj,t
⎧ ⎫
 ⎨
p(x) m
aj,n ⎬
= p(x) log + Fj (x) log
⎩ pt (x) j=1 aj ⎭
x
⎧ ⎫
⎨ m  F (x)
 p(x)  aj,n j ⎬
= p(x) log + log (7.87)
⎩ pt (x) aj ⎭
x j=1
322 7 Classifier Design

In order to find a minimizer for the latter equation, the definition of the
following quantity:

m  Fj (x)
aj Rt+1 (x)
Rt+1 (x) = pt (x)
a
that is pt (x) =
 & 'Fj (x) (7.88)
j,n m aj
j=1
j=1 aj,n

and the inclusion of a normalizing factor Zt+1 so that Rt+1 (x)/Zt+1 is a


probability distribution, imply
⎧ ⎫
 ⎨ m  Fj (x) ⎬
p(x) aj,t
D(p̃(x)||p̃2t+1 (x)) = p(x) log + log
x
⎩ p t (x) aj ⎭
j=1
⎧ ⎫

⎨ ⎪

 p(x)
m & aj 'Fj (x) ⎭
= p(x) log

⎩ ⎪
x pt (x) j=1 aj,t
 
p(x) 1 1
= p(x) log + log − log
x
Rt+1 (x) Zt+1 Zt+1

 p(x) 1
= p(x) log 1 + log
x Zt+1 Rt+1 (x)
Zt+1
 
1 1
= D p(x)|| Rt+1 (x) + log (7.89)
Zt+1 Zt+1

Thus, the minimizing distribution is pt+1 (x) = Rt+1 (x)/Zt+1 , the minimum
being 1/Zt+1 . Consequently we obtain

m  Fj (x)
1 aj
p̃2t+1 (x, j) = pt+1 (x)Fj (x), pt+1 (x) = pt (x) (7.90)
Zt+1 j=1
aj,t

and the following recurrence relation, defining the generalized iterative scal-
ing [47], is satisfied:

m  Fj (x) 
aj
Rt+1 (x) = Rt (x) , bj,t = Rt (x)Fj (x) (7.91)
j=1
bj,t x

with R0 = p0 = π. Furthermore, the so-called improved generalized iterative


scaling, described in the previous section, has the latter recurrence relation
as starting point. However, this algorithm is designed from the maximization
of the log-likelihood. More precisely, such maximization is performed by de-
riving the log-likelihood with respect to the multipliers. As we have shown in
Eq. 7.72, the log-likelihood for a pdf belonging to the exponential family is
given by
7.6 Maximum Entropy Principle for Classification 323


N 
N
(X|Λ) = log(pΛ (xi )) = (Λ · F (xi ) − G(Λ) + log π(xi )) (7.92)
i=1 i=1

If we are seeking for an iterative gradient-ascent approach Λ ← Λ + Δ, we


need to choose, at each step Δ satisfying the following monoticity inequality
(X|Λ + Δ) − (X|Λ) ≥ 0. Then, obtaining the following equivalence:


N
(X|Λ + Δ) − (X|Λ) = (Δ · F (xi )) − G(Λ + Δ) + G(Λ) (7.93)
i=1

is straightforward. Then, from the definitions of G(Λ) and G(Λ + Δ) in terms


of the logarithm of the partition function we have


N
(Δ · F (xi )) − G(Λ + Δ) + G(Λ)
i=1
   

N  
= (Δ · F (xi )) − log (Λ+Δ)·F (xi )
e π(xi ) + log Λ·F (xi )
e π(xi )
i=1 xi xi
 

N
e(Λ+Δ)·F (xi ) π(xi )
= (Δ · F (xi )) − log  Λ·F (x ) xi

xi e
i π(x )
i=1 i
  
N
xi p(xi )Z(Λ)e
Δ·F (xi )
= (Δ · F (xi )) − log 
i=1 xi p(xi )Z(Λ)
  
N
xi p(xi )e
Δ·F (xi )
= (Δ · F (xi )) − log 
i=1 xi p(xi )
 
N 
= (Δ · F (xi )) − log p(xi )eΔ·F (xi )
= (X|Λ + Δ)− (X|Λ).
i=1 xi
(7.94)

Then, as − log α ≥ 1 − α for α > 0, we have


N & '
(X|Λ+Δ)−(X|Λ) ≥ (Δ · F (xi )) + 1 − p(xi )eΔ·F (xi ) ≥ 0 (7.95)
i=1 xi
  
A(Δ|Λ)

Then, the right side of the latter equation is a lower bound of the log-likelihood
increment. Next formal task consists of posing that lower bound A(Δ|Λ) in
terms of the δj : Δ = (δ1 , . . . , δj , . . . , δm ) so that setting to zero the partial
derivatives with respect to each δj allows to obtain it. To that end, it is key
to realize that the derivatives of A(Δ|Λ) will leave each λj as a function
of all the other ones due to the exponential (coupled variables). In order to
324 7 Classifier Design

decouple each λj so that each partial derivative depends only on it, a useful
trick (see [17]) is to perform the following transformation on the exponential:
m m m δj Fj (xi )
δj Fj (xi ) ( Fj (xi )) m
eΔ·F (xi ) = e
j=1 j=1 F (x )
j=1 =e (7.96)
j=1 j i


Then, let us define
 pj (x) = Fj (x)/( j Fj (x)) the pdf resulting from the
introduction of j Fj (x) in the exponentiation. The real trick comes when
Jensen’s inequality for pj (x) is exploited. As we have seen in other chapters in
the book (see for instance Chapter 3 on segmentation), given a convex function
 ϕ(E(x)) ≤ 
ϕ(x) we have that E(ϕ(x)). Therefore, as the exponential
 is convex
we have that e x pj (x)q(x) ≤ x p(x)eq(x) . Then setting q(x) = j Fj (x) we
have that
⎛ ⎞ ⎛ ⎞
N 
m  
m m
A(Δ|Λ) ≥ ⎝ δj Fj (xi )⎠ + 1 − p(xi ) ⎝ pj (xi )eδj j=1 Fj (xi ) ⎠
i=1 j=1 xi j=1
  
B(Δ|Λ)
⎛ ⎞

N 
m  & m '
= ⎝ δj Fj (xi )⎠ + 1 − p(xi ) eδj j=1 Fj (xi ) . (7.97)
i=1 j=1 xi

Then we have
∂B(Δ|Λ) 
N  & m '
= Fj (xi ) − p(xi ) Fj (xi )eδj k=1 Fk (xi ) (7.98)
∂δj i=1 x i

and setting each partial derivative to zero we obtain the updating equations.
The difference between the expression above and the ones used in Alg. 19 is
the inclusion of the conditional model used for formulating the classifier. The
verification of these equations is left as an exercise (see Prob. 7.12). A faster
version of the Improved Iterative Scaling algorithm is proposed in [86]. The
underlying idea of this latter algorithm is to exploit tighter bounds by decou-
pling only part of the variables.

7.7 Bregman Divergences and Classification


7.7.1 Concept and Properties
If exponential families, described in the previous section, are a general formal
way of describing different probability distributions, Bregman divergences [29]
generalize the concept of divergence associated to each member of the family.
We will reach this point later in this section. Let us first define formally a
Bregman divergence. Such divergence Dφ (x, y) between the arguments, which
may be real-valued vectors or probability distributions, is characterized by a
convex differentiable function φ(·) called the generator so that
Dφ (x, y) = φ(x) − φ(y) − (x − y)T ∇φ(y) (7.99)
7.7 Bregman Divergences and Classification 325

When considering one-dimensional variables x and y, the first-order Taylor


expansion of φ(x) at y, with x ≥ y, is φ(x) = φ(y) + φ (y)(x − y). Therefore,
the Bregman divergence is usually seen as the tail of the Taylor expansion
(the terms remaining after subtracting the first-order approximation). Thus,
Bregman divergence is defined as the difference between the true value of φ(x)
and the value of the tangent to φ(y) at x. Of course, the magnitude of this
difference, the tail, for fixed x and y, depends on the generator. Some yet
classical generators and their divergences are:

φ(x) = ||x||2 ⇒ Dφ (x, y) = ||x − y||2 (Euclidean),


  xi
φ(x) = xi log xi ⇒ Dφ (x, y) = xi log (Kullback–Leibler),
i i
yi
   xi xi

φ(x) = − log xi ⇒ Dφ (x, y) = − log − 1 (Itakura–Saito)
i i
yi yi
(7.100)

In general, Bregman divergences satisfy Dφ (x, y) ≥ 0 with equality to 0 if and


only if x = y. However, they are not metrics because neither the symmetry
nor the triangle equality hold in general. However, we have the following
pythagorean inequality:

Dφ (x, y) ≥ Dφ (z, y) + Dφ (x, z) (7.101)

where the equality holds when z = min Dφ (w, y) with w ∈ Ω the so-called
Bregman projection of y onto the convex set Ω (this is the generalization of
information projection illustrated in Fig. 7.12). Thus, given the connection
between Bregman divergences and information projection, it is not surprising
that when considering distributions belonging to the exponential family, each
distribution has associated a natural divergence. Actually, there is an inter-
esting bijection between exponential pdfs and Bregman divergences [11]. Such
bijection is established through the concept of Legendre duality. As we have
seen in the previous section, the expressions

pΛ (x) = eΛ·F (x)−G(Λ) π(x) and G (Λ) = EΛ (F (x)) = a (7.102)

establish an interesting bijection between natural (Λ) and expectation a pa-


rameters, due to the strict convexity of G(·). More precisely, given a real-
valued function G(·) in Rm . Then, its conjugate function G∗ (·) is defined as

G∗ (a) = sup {a · Λ − G(Λ)} (7.103)


Λ∈Ω

being Ω = dom(G). The latter equation is also known as the Legendre


transformation of G(·). When G(·) is strictly convex and differentiable over
Θ = int(Ω), there is a unique Λ∗ corresponding to the sup(·) in the latter
equation. Then, by setting the gradient at this point to zero we obtain
326 7 Classifier Design

∇(a · Λ − G(Λ))|Λ=Λ∗ = 0 ⇒ a = ∇G(Λ∗ ) (7.104)

The inverse ∇G−1 : Θ∗ → Θ, with Θ∗ = int(dom(G∗ )), exists because ∇G is


monotonic given the convexity of G. As G∗ is a Legendre transform, we have
that the first derivatives of G and G∗ are inverse of each other: ∇G = (∇G∗ )−1
and (∇G)−1 = ∇G∗ . Furthermore, the convexity of G implies

a(Λ) = ∇G(Λ) and Λ(a) = ∇G∗ (a)


G∗ (a) = Λ(a) · a − G(Λ(a)) ∀a ∈ int(dom(G∗ )) (7.105)

which is key to formulate the exponential family in terms of Bregman diver-


gences:

p(G,Λ) (x) = eΛ·F (x)−G(Λ) π(x)


= e{(a·Λ−G(Λ))+(F (x)−a)·Λ} π(x)

= e{G (a)+(F (x)−a)·Λ)} π(x)
∗ ∗
= e{G (a)+(F (x)−a)·∇G (a)} π(x)
∗ ∗ ∗
(a)}+G∗ (F (x))
= e−{G (a)+G (F (x))−(F (x)−a)·∇G π(x)
= e−DG∗ (F (x),a)bG∗ π(x) (7.106)

DG∗ (F (x), a) being the Bregman divergence generated by G∗ and bG∗ =



eG (F (x)) . If the maximum entropy is not the underlying principle to build
the model, the more standard generalization of a exponential pdf is given by
setting F (x) = x. Anyway, given this new parameterization, the generalization
of the maximum log-likelihood model inference, and even iterative scaling
approaches, is straightforward. Furthermore, as we have noted above, there
is a natural choice of G∗ for defining a different class of distribution in the
exponential family: Euclidean distance yields the Gaussian, Kullback–Leibler
divergence achieves the multinomial, and the Itakura–Sahito distance provides
the geometric distribution (see Prob. 7.13).

7.7.2 Bregman Balls and Core Vector Machines

Focusing on classification, let us consider a simple one-class classification prob-


lem: Given a set of observed vectorized patterns S = {x1 , . . . , xN }, with
xi ∈ Rk , compute a simplified description of S that fits properly that set.
Such description is the center c ∈ Rk minimizing the maximal distortion
D(x, c) ≤ r with respect to S, r (the radius) being a parameter known as
variability threshold which must also be estimated. Thus, both the center and
the radius must be estimated. In other words, the one-class classification prob-
lem can be formulated in terms of finding the Minimum Enclosing Ball (MEB)
of S. In this regard, the classic distortion (the Euclidean one) yields balls like
7.7 Bregman Divergences and Classification 327

||c−x||2 ≤ r2 . Actually the MEB problem under the latter distortion is highly
connected with a variant of Support Vector Machines (SVMs) classifiers [165],
known as Core Vector Machines [158]. More precisely, the approach is known
as Ball Vector Machines [157] which is faster, but only applicable in contexts
where it can be assumed that the radius r is known beforehand.
Core Vector Machines (CVMs) inherit from SVMs the so-called kernel
trick. The main idea behind the kernel trick in SVMs is the fact that when the
original training data are not linearly separable, it may be separable when we
project such data in a space of higher dimensions. In this regard, kernels k(., .),
seen as dissimilarity functions, play a central role. Let kij = k(xi , xj ) = ϕ(xi )·
ϕ(xj ) be the dot product of the projections through function ϕ(·) of vectors
xi , xj ∈ S. Typical examples of kernels are the polynomial and the Gaussian
ones. Given vectors xi√= (u1√ , u2 ) and x √j = (v1 , v2 ) it is straightforward to see
that using ϕ(x) = (1, 2u1 , 2u2 , u21 , 2u1 u2 , u22 ) yields the quadratic kernel
k(xi , xj ) = ϕ(xi ) · ϕ(xj ) = (xi · xj + 1)d with d = 2. Given the definition
of a kernel and a set of vectors S, if the kernel is symmetric, the resulting
|S| × |S| = N × N matrix Kij = k(x i , x
j ) is the Gramm matrix if it is positive
semi-definite, that is, it satisfies i j Kij ci cj ≥ 0 for all choices of real
numbers for all finite set of vectors and choices of real numbers ci (Mercer’s
theorem). Gram matrices are composed of inner products of elements of a set
of vectors which are linearly independent if and only if the determinant of K
is nonzero.
In CVMs, the MEB problem is formulated as finding

min r2
c,r

s.t. ||c − ϕ(xi )||2 ≤ r2 i = 1, . . . , N (7.107)

which is called the primal problem. The primal problem has as many con-
straints as elements in S because all of them must be inside the MEB. Thus,
for any constraint in the primal problem there is a Lagrange multiplier in the
Lagrangian, which is formulated as follows:


N
L(S, Λ) = r2 − λi (r2 − ||c − ϕ(xi )||2 ) (7.108)
i=1

Λ = (λ1 , . . . , λN )T being the multiplier vector. Consequently, the multipliers


will be the unknowns of the dual problem:

max ΛT diag(K) − ΛT KΛ
Λ
s.t ΛT 1 = 1, Λ ≥ 0 (7.109)

being 1 = (1, . . . , 1)T and 0 = (0, . . . , 0)T . The solution to the dual problem
Λ∗ = (λ∗1 , . . . , λ∗1 )T comes from solving a constrained quadratic programming
(QP) problem. Then, the primal problem is solved by setting
328 7 Classifier Design


N 
c∗ = λ∗i ϕ(xi ), r = Λ∗ T diag(K) − Λ∗ T KΛ∗ (7.110)
i=1

In addition, when using the Lagrangian and the dual problem, the Karush–
Kuhn–Tucker (KKT) condition referred to as complementary slackness en-
sures that λi (r2 −||c−ϕ(xi )||2 ) = 0 ∀i. This means that when the ith equality
constraint is not satisfied (vectors inside the hypersphere) then λi > 0 and
otherwise (vectors defining the border, the so-called support vectors) we have
λi = 0. In addition, using the kernel trick and c we have that the distances
between projected points and the center can be determined without using
explicitly the projection function


N 
N 
N
||c∗ − ϕ(xl )||2 = λ∗i λ∗j Kij − 2 λ∗i Kil + Kll (7.111)
i=1 j=1 i=1

This approach is consistent with the Support Vector Data Description


(SVDD) [153] where one-class classification is proposed as a method for
detecting outliers. What is interesting is that the use of kernels allows to
generalize from hyperspherical regions in the Euclidean (original) domain to
more-general regions in the transformed domain. This is important because
in the general case the points belonging to a given class are not uniformly
distributed inside a tight hypersphere. Thus, generalization is desirable, al-
though the degree of generalization depends on the kernel chosen. In general,
Gaussian kernels k(xi , xj ) = e−(xi −xj ) /s coming from an infinite dimension
2 2

projection function yield better results than the polynomial ones, as we show
in Fig. 7.13. In the one-class classification problem, support vectors may be
considered outliers. In the polynomial case, the higher the degree the tighter
the representation (nonoutliers are converted into outliers). Although in this
case many input examples are accepted the hypershere is too sparse. On the
contrary, when using a Gaussian kernel with an intermediate variance there
is a trade-off between overfitting the data (for small variance) and too much
generalization (higher variance).
Core Vector Machines of the type SVDD are highly comparable with
the use of a different distance/divergence (not necessarily Euclidean) in-
stead of using a different kernel. This allows to define Bregman balls and
to face the MEB from a different perspective: the Smallest Enclosing Breg-
man Ball (SEBB) [119]. Then, given a Bregman divergence Dφ , a Bregman
(c,r)
ball Bφ = {x ∈ S : Dφ (c, x) ≤ r}. The definition of Bregman divergence
applied to this problem is

Dφ (c, x) = φ(c) − φ(x) − (c − x)T ∇φ(x) (7.112)

As any Bregman divergence is convex in the first argument, then Dφ (c, x)


is convex in c. Such convexity implies the unicity of the Bregman ball. In
addition, with respect to the redefinition of the Lagrangian, we have
7.7 Bregman Divergences and Classification 329

d=1 d=3 d=6

5 5 5

0 0 0

−5 −5 −5

0 5 10 0 5 10 0 5 10
sigma=1 sigma=5 sigma=15
10 10 10

5 5 5
C=25.0

0 0 0

−5 −5 −5

−10 −10 −10


−5 0 5 −5 0 5 −5 0 5

10 10 10

5 5 5
C= 0.1

0 0 0

−5 −5 −5

−10 −10 −10


−5 0 5 −5 0 5 −5 0 5

Fig. 7.13. Top: SVDD using different polynomial kernels (degrees). Bottom: using
different Gaussian kernels (variances). In both cases, the circles denote the support
vectors. Figure by D. M. J. Tax and R. P. W. Duin [153], ( c Elsevier 2004). See
Color Plates.


N
L(S, Λ) = r − λi (r − Dφ (c, xi )) (7.113)
i=1

with Λ > 0 according to the dual feasibility KKT condition. Considering the
definition of Bregman divergence in Eq. 7.112, the partial derivatives of L
with respect to c and r are

∂L(S, Λ) N N
= ∇φ(c) λi − λi ∇φ(xi ) ,
∂c i=1 i=1

∂L(S, Λ) N
= 1− λi (7.114)
∂r i=1
330 7 Classifier Design

Then, setting the latter derivatives to zero we have


N
λi = 1
i=1
N 

N 
∗ −1
∇φ(c) = λi ∇φ(xi ) ⇒ c = ∇ φ λi ∇φ(xi ) (7.115)
i=1 i=1

Consequently, the dual problem to solve is


 N  
N 
max λi Dφ ∇−1 φ λi ∇φ(xi ) , xi
Λ
i=1 i=1

N
s.t. λi = 1, Λ ≥ 0 (7.116)
i=1

which is a generalization of the SVDD formulation. φ(z) = ||z||2 = z · z =


k
j=1 zj with z ∈ R . More precisely, we have that ∇φ(z) = 2z ⇒
2 k

(∇φ(z))−1 = 1/2z. Therefore, Dφ is defined as follows:


N  =N =2 N T
 = = 
= =
Dφ λ i x i , xi == λi xi = − ||xi || +
2
λ i xi − x i 2xi
= =
i=1 i=1 i=1

N
= || λi xi − xi ||2 (7.117)
i=1

and the dual problem is posed as follows:


= =2
 = =
N
=N =
max λi =
= λ x
j j − x =
i=
Λ
i=1 = j=1 =

N
s.t. λi = 1, Λ ≥ 0 (7.118)
i=1

when the Bregman generator is the Euclidean distance. The key question
here is that there is a bijection between generators and values of the inverse
of the generator derivative which yields c∗ (the so-called functional averages).
As we have seen, for the Euclidean distance, the functional average is c∗j =
N
λi xj,i (arithmetic mean). For the Kullback–Leibler divergence we have
i=1
∗ N
cj = i=1 xλj,ii (geometric mean), and for the Itakura–Saito distance we obtain
N
c∗j = 1/ i=1 xλj,i i
(harmonic mean). Then, the solution to the SEBB problem
depends on the choice of the Bregman divergence. Anyway the resulting dual
problem is complex and it is more practical to compute g∗ , an estimation of
c∗ , by solving the following problem:
7.7 Bregman Divergences and Classification 331

min r
g,r

s.t. ||g − ∇φ(xi )||2 ≤ r , i = 1, . . . , N (7.119)

which exploits the concept


Nof functional averages and its connection with
the definition ∇φ(c∗ ) = i=1 λi ∇φ(xi ). It can be proved that g∗ is a good
approximation of the optimal center. More precisely

Dφ (x, ∇−1 φ(g)) + Dφ (∇−1 φ(g), x) ≤ (1 + 2 )r∗ /f (7.120)

f being the minimal nonzero value of the Hessian norm ||Hφ|| inside the
convex closure of S and  the error assumed for finding the center of the
MEB within an approximation algorithm using the Euclidean distance: it is
assumed that r ≤ (1+)r∗ , see for instance [33], and points inside this ball are
called core sets. Such algorithm, named BC, can be summarized as follows:
(i) choose at random c ∈ S; (ii) for a given number of iterations T − 1, for
t = 1, . . . , T −1: set x ← arg maxx ∈S ||c−x ||2 and then set c ← t+1
t 1
c+ t+1 x.
The underlying idea of the latter algorithm is to move along the line between
the current estimation of the center and the new one. This approach is quite
efficient for a large number of dimensions. The adaptation to find the center
when using Bregman Divergerces (BBC algorithm) is straigthforward: simply

change the content& ' x ← arg maxx ∈S Dφ (c, x ) and then
in the main loop: set
set c ← ∇−1 φ t+1 t
∇φ(c) + t+11
∇φ(x) . This method (see results in Fig. 7.14)
converges quite faster than BC in terms of the error (Dφ (c, c∗ )+Dφ (c∗ , c))/2.
Another method, less accurate than BBC but better than it in terms of rate
of convergence, and better than BC, is MBC. The MBC algorithm takes into
account the finding of g ∗ : (i) define a new set of transformed vectors: S  ←
{∇φ(x) : x ∈ S}; (ii) call BC g∗ ← BC(S, T ); (ii) obtain c ← ∇−1 φ(c).
Anyway, whatever the method used, once the c∗ (or an approximation) is
obtained, the computation of r∗ is straightforward. Similar methods have
been used to simplify the QP problem in CVMs or SVDDs.
Bregman balls (and ellipsiods) have been recently tested in contexts of
one-class classification [118]. Suppose, for example, that we have a Gaussian
distribution. As the Gaussian pdf belongs to the exponential family, the op-
timal Bregman divergence to choose is the Kullback–Leibler divergence, in
order to find the corresponding ball and to detect supports for classifying test
vectors (see Prob. 7.14).

7.7.3 Unifying Classification: Bregman Divergences and Surrogates

Bregman divergences are recently recognized as a unifying formal tool for


understanding the generalization error of classifiers. The proper theoretical
framework is that of the surrogates (upper bounds of the empirical risk) [12].
Here we consider, for the sake of simplicity, the binary classification prob-
lem, that is, we have a sample S = (X , Y) where X = {x1 , . . . , xN } and
332 7 Classifier Design

Fig. 7.14. BBC results with k = 2 for different Bregman divergences: Itakura–Saito
vs. Kullback–Leibler. (Figure by courtesy of Richard Nock).

Y = {+1, −1}, and a classifier h : X → Y. Then h(x) = y indicates that y


is the correct class of x. Therefore, sign(yh(x)) indicates whether the clas-
sification is correct (positive) or incorrect (negative). As we have seen along
this chapter (see the formulation of random forests), and also partially in
Chapter 5 (RIC), the generalization error is defined as GE = PX ,Y (y =
sign(h(x)) = EX ,Y (y, h(x)), (., .) being a loss function. The GE is typically
estimated by the average of the loss given both the classifier and the sample:
N
R̂(h, S) = N1 i=1 (yi , h(xi )), the empirical risk ER. Surrogates emerge as a
consequence of the hardness of minimizing the ER. Of course, such hardness
7.7 Bregman Divergences and Classification 333

depends on the properties of the loss function chosen. The typical one is the
0/1-loss function: 0/1 (y, h(x)) = I(h(x) = y) = I(sign(yh(x)) = +1), I(·)
being an indicator function. In that case, R̂ is renamed 0/1 and scaled by N
to remove the average. However, the ER using this loss function is hard to
minimize. However, and for the sake of clarity, in this section it is more con-
venient to: (i) assume that either h : X → R or h : X → [0, 1]; Y = {0, 1} and
Y ∗ = {−1, 1}. Mapping h to reals allows to consider |h|, the confidence on the
classifier, and considering the new Y and Y ∗ , which are equivalent in terms
of labeling an example, gives more flexibility. In this regard, the 0/1 loss may
be defined either as: R (y ∗ , h(x) = I(σ(h(x)) = y ∗ )) (σ(z) = +1 if z ≥ 0
0/1

0/1
and −1 otherwise) if the image of h is R, and [0,1] (y, h(x) = I(τ (h(x)) = y))
0/1
(τ (z) = 1 if z ≥ 1/2 and 0 otherwise) if the image of h is [0, 1]. Thus, R
0/1
and [0,1] are defined consequently. Anyway, surrogates relax the problem to
estimate a tight upper bound of 0/1 (whatever the notation used) by means
of using different, typically convex, loss functions. It turns out that some of
these functions are a subset of the Bregman divergences. The most typical are
(see Fig. 7.15)

∗ −y
exp
R (y , h(x)) = e
h(x)
(exponential loss)
∗ −y ∗ h(x)
log
R (y , h(x)) = log(1 + e ) (logistic loss)
∗ ∗
sqr
R (y , h(x)) = (1 − y h(x)) (squared loss)
2
(7.121)

Surrogates  must satisfy that 0/1 (h, S) ≤ (h, S). First of all, in or-
der to define permissible surrogates based on a given loss function (., .),
it must satisfy three properties: (i) (., .) (lower bounded by zero); (ii)
arg minx [0,1] (x, S) = q x being the output of the classifier mapped to
the interval [0, 1], [0,1] an empirical risk based on a loss [0,1] when the output

Fig. 7.15. The 0/1 loss functions and other losses used for building surrogates.
334 7 Classifier Design

of the classifier is mapped to [0, 1], and q = p̂(y|x) the approximation of a


conditional pdf p given by proportion of positive examples with observation
x (this is the proper scoring rule); and (iii) (y, h(x)) = (1 − y, 1 − h(x)) for
y ∈ {0, 1} and h(x) ∈ [0, 1] (symmetry property). Given the latter properties
over loss functions, a function φ : [0, 1] → R+ is permissible if and only if: −φ
is differentiable on (0, 1), is strictly concave, symmetric around x = 1/2, and
aφ = −φ(0) = −φ(1) ≥ 0 (then bφ = −φ(1/2) − aφ > 0). Some examples of
permissible functions, which must be scaled so that φ(1/2) = −1, are

φM (x) = − x(1 − x) (Matsushita’s error)
φQ (x) = x log x + (1 − x) log(1 − x) (Bit entropy)
φB (x) = −x(1 − x) (Gini index) (7.122)

Our main concern here is that the link between permissible functions and
surrogates is that any loss (y, h) is properly defined and satisfies the three
properties referred above if and only if (y, h) = Dφ (y, h), being Dφ (., .) a
Bregman divergence with a permissible generator φ(·) [120]. Using again the
Legendre transformation, the Legendre conjugate of φ is given by

φ∗ (x) = sup {ux − φ(u)} (7.123)


u∈dom(φ)

As we have seen in Section 7.7.1, ∇φ = ∇−1 φ∗ and ∇φ∗ = ∇−1 φ. In addition:

φ∗ (x) = x∇−1 φ(x) − φ(∇−1 φ(x)) (7.124)

Therefore, for y ∈ {0, 1} we have

Dφ (y, ∇−1 φ(h)) = φ(y) − φ(∇−1 φ(h)) − (y − ∇−1 φ(h))∇φ(∇−1 φ(h))


= φ(y) − φ(∇−1 φ(h)) − (y − ∇−1 φ(h))h
= φ(y) − φ(∇−1 φ(h)) − yh + ∇−1 φ(h)h
= φ(y) − yh + ∇−1 φ(h)h − φ(∇−1 φ(h))
= φ(y) − yh + φ∗ (h) (7.125)

As Dφ (y, ∇−1 φ(h)) = Dφ (1−y, 1−∇−1 φ(h)) because the Bregman divergence
is a loss function, and loss functions satisfy symmetry, we have

Dφ (1 − y, 1 − ∇−1 φ(h)) = φ(1 − y) − (1 − y)h + φ∗ (−h) (7.126)

by applying ∇−1 φ(−x) = 1 − ∇−1 φ(−x). Then, considering that φ∗ (x) =


φ∗ (−x) + x, aφ = −φ(0) = −φ(1) ≥ 0, and combining both definitions of
divergence, we obtain

Dφ (y, ∇−1 φ(h)) = φ∗ (−y ∗ h) + aφ =Dφ (0, ∇−1 φ(−y ∗ h)) = Dφ (1, ∇−1 φ(y ∗ h))
(7.127)
7.7 Bregman Divergences and Classification 335

where proving the latter two equivalences is straightforward. Thus, φ∗ (·) seems
to be a Bregman divergence (it is, because aφ = 0 for well-known permissi-
ble φ). Then, we are going to define Fφ (x) = (φ∗ (−x) − aφ )/bφ , which, for
aφ = 0, bφ = 1, yields Fφ (x) = φ∗ (−x). This new function is strictly convex
and satisfies R (y ∗ , h) ≤ Fφ (y ∗ h) which ensures that Fφ defines a surrogate
0/1

because
N 
N
Fφ (yi∗ h(xi )) = Dφ (0, ∇−1 φ(−yi∗ h(xi )))
0/1
0/1 (h, S) ≤ φ (h, S) =
i=1 i=1
(7.128)
When using the latter definitions of surrogates it is interesting to remind that
we have

Fφ (y ∗ h) = −y ∗ h + 1 + (y ∗ h)2 for φM
& ∗
'
Fφ (y ∗ h) = log 1 + e−y h for φQ
Fφ (y ∗ h) = (1 − y ∗ h2 ) for φB (7.129)
which correspond in the two latter cases to the losses defined above (logistic
and squared). Given the basic theory, how to apply it to a linear-separator
(LS) classifier like the one coming from Boosting? Such classifier, H(xi ) =
T
t=1 αt ht (xi ), is composed of weak classifiers ht and leveraging coefficients
αt . First of all, we are going to assimilate ∇−1 φ(−yi∗ h(xi )) to the weights wi
used in the sampling strategy of boosting. This assimilation comes naturally
from the latter definitions of Fφ (y ∗ h): the more badly classified the example
the higher the loss and the higher the weights and vice versa. Firstly, it is
interesting to group the disparities between all classifiers and all the examples
in a unique matrix M of dimensions N × T :
⎛ ∗ ⎞
y1 h1 (x1 ) . . . y1∗ ht (x1 ) . . . y1∗ hT (x1 )
⎜ .. .. .. ⎟
⎜ ⎟
⎜ ∗ . . . ⎟
M = − ⎜ yi h1 (xi ) . . . yi ht (xi ) . . . yi hT (xi ) ⎟
⎜ ∗ ∗
⎟ (7.130)
⎜ .. .. .. ⎟
⎝ . . . ⎠
∗ ∗ ∗
yN h1 (xN ) . . . yN ht (xN ) . . . yN hT (xN )

Then, a disparity (edge, in this context) between a boosted classifier H and


the class yi∗ corresponding to example xi is given & by the product 'of the ith

T
row of M and α = (α1 , . . . , αT ) : −Mi α = yi
T
i=1 αt ht (xi ) . On the
other hand, the disparity between a weak classifier and S is given by the
product of the transpose of the t-th column of −M and w = (w1 , . . . , wN )T :
T
−MTt w = i=1 wi yi ht (xi ). Here, weights are of fundamental importance
because they allow to compute the optimal leveraging coefficients. We remind
that weights are assimilated to inverses of the derivative of φ:
wi = ∇−1 φ(−yi∗ h(xi )) (7.131)
336 7 Classifier Design

A key element for finding the optimal classifier through this approach is
the Bregman–Pythagoras theorem applied to this context:

Dφ (0, w) = Dφ (0, w∞ ) + Dφ (w∞ , w) (7.132)

where w∞ are  the optimal weights, and the vectorial notation of Dφ consists of
Dφ (u, v) = i Dφ (ui , vi ) (component-wise sum). Then, the idea is to design
an iterative algorithm for minimizing Dφ (0, w) in Ω ) w. More precisely

0/1
min φ (H, S) = min Dφ (0, w) (7.133)
H w∈Ω

Such iterative algorithm starts by setting w1 ← ∇−1 φ(0)1, that is, w1,i =
1/2, ∀i because ∇−1 φ(0) = 1/2. The initial leverage weights are set to zero:
α1 ← 0. The most important part of the algorithm is how to update wj+1 ,
given wj
wj+1,i ← ∇−1 φ(Mi (αj + δ j )) , (7.134)

This is consistent with the idea that Mi α is associated to an edge of H. In


this formulation, αj + δ j is the solution to the following nonlinear system:


N
+ ,
Mit ∇−1 φ(Mi (αj + δ j )) = 0 (7.135)
i=1

where t ∈ Tj and Tj ⊆ {1, 2, . . . , T }, typically |Tj | = 1, which means that only


a single parameter is updated at each iteration. Then we set αj+1 ← αj + δ.
This setting ensures that Dφ (0, wj+1 ) decreases with respect to Dφ (0, wj ) by
following the rule:

Dφ (0, wj+1 ) = Dφ (0, wj ) − Dφ (wj+1 , wj ) (7.136)

which may be proved by adapting the definition of Dφ (., .) to the vectorial


case. In addition, the computation of wj+1 ensures that wj ∈ KerMT |Tj ,
where the notation |Tj indicates discarding columns j ∈ Tj , and Ker is the
typical null space, that is KerA = {u ∈ RN : Au = 0}. This latter link
between null spaces and weights (edges) can be summarized by the following
equivalence:
wj ∈ KerMT |{t} ⇔ −1{t} MT wj = 0 (7.137)
where we assume Tj = {t}, and the right hand of the equivalence defines an
edge of ht . The efficient convergence of the algorithm is ensured when the
set of features considered in each iteration works slightly better than random
(weak learners assumption), that is, when assuming ht ∈ {−1, 1} we have
8 8
8 1 0/1 18
1{t} MT wj > Zj γj ⇔ 88 R (ht , S) − 88 ≥ γj > 0 (7.138)
Zj 2
7.7 Bregman Divergences and Classification 337

that is, the classifier ht is better


 than random by an amount of γj . In the latter
formulation Zj = ||wj ||1 = i |wj,i | (L1 norm). Furthermore, the convergence
conditions (no new update needed) are

δ = 0 ∀Tj ⇔ wj ∈ KerMT |Tj ∀Tj ⇔ wj ∈ KerMT ⇔ wj = w∞ (7.139)

The progress of the algorithm can be illustrated graphically by representing


the weight space Ω as a non-Riemanian manifold (see Fig. 7.16). The algo-
rithm is dubbed ULS Universal Linear Separator and it is described in Alg. 20.
It turns out, that when we choose φ(x) = e−x we obtain the Adaboost algo-
rithm with un-normalized weights. Thus, ULS is a general (Universal) method
for learning LSs. Furthermore, the ULS approach may be extended to other
classifiers like the ones presented along this chapter. For instance, the basic
insight to adapt ULS to decision trees is to consider each node of the tree as a
weak classifier. In this regard, the output of each internal weak classifier (test)
is hj ∈ {0, 1} (output of a boolean test, so we have a binary tree H). On the
other hand, the output of a leaf node l ∈ ∂H will be hl ∈ {−1, 1}. Therefore,

0 D (0,w )
8

w Ker MT
8

D (w w)
8

D (0,w)
w

-1
w1=Δφ(0)1

wj
Dφ(0,wj) Dφ(wj+1,wj)
wj+1
0 Dφ(0,wj+1) w Ker MT
8

Ker M|TTj

Fig. 7.16. ULS iterative minimization. Top: Bregman–Pythagoras theorem.


Bottom: geometric representation of the iterative algorithm for minimizing
Dφ (0, wj ).
338 7 Classifier Design

Algorithm 20: ULS algorithm


Input: M,φ(.)
Initialize w0 = ∇−1 φ(0)1, α0 = 0
for j=1 to T do
Pick Tj ⊆ {1, 2, . . . , T } being |Tj | small for computational reasons:
Typically |Tj | = 1
If |Tj | = 1 then set Tj = {t}
Set δ j = 0
∀t ∈ Tj select δj,t so that


N
+ ,
Mit ∇−1 φ(Mi (αj + δ j )) = 0 (7.140)
i=1
  
wj+1,i

Set αj+1 = αj + δ
end
Output:Strong classifier:
H(x) = Tt=1 αt ht (x)

each sequence of decisions (test results) yields a given label. As the tree in-
duces a partition of S into subsets Sl (examples arriving to each leaf l) the
probability that the output class of the tree is +1 is the proportion between
the samples of Sl labeled as +1 (Sl+ ) and |Sl |, and the same rationale applies
for the probability of class −1. This is the base of weak classifiers associated
to the leaves of the tree, and this rationale leads to LDS or Linear Decision
Tree. To apply the concept of ULS here, we must estimate the leveraging
coefficients for each node. In this case, these coefficients can be obtained from
  +   
1 |Sl |
αl = ∇φ − αt ht (7.141)
hl |Sl |
t∈Pl

Pl being the classifiers belonging to the path between the root and leaf l. Then,
for an observation x reaching leaf l, we have that H(x) = ∇φ(|Sl+ |/|Sl |).
Therefore ∇−1 φ(x) = |Sl+ |/|Sl |, and the surrogate is defined as follows:


N    
|Sl+ |
R
φ (H, S) = Fφ (yi∗ H(xi )) = |Sl | −φ (7.142)
i=1
|Sl |
l∈∂H

if aφ = 0 and bφ = 1 (see Prob. 7.15). Then, the extension of ULS to trees is


trivial. Just start by a tree with a single leaf and at each moment expand the
leaf (corresponding to a given variable) satisfying
   +     + 
|Sl | |Sl |
|Sl | −φ − |Sl | −φ <0 (7.143)
|Sl | 
|Sl |
l∈∂H l∈∂H
7.7 Bregman Divergences and Classification 339

where H is the current tree and H  is the new one after the expansion of a
given leaf. The leverage coefficients of the new leafs may be computed at each
iteration given the ones of the interior nodes.

Problems
7.1 Rare classes and Twenty Questions
The purpose of “Twenty Questions” (TQ) is to find a testing strategy which
performs the minimum number of tests for finding the true class (value of
Y ). For doing so, the basic idea is to find the test dividing the population in
masses as equal as possible.
1. Given the sequence of questions Qt , check that this strategy is consistent
with finding the test Xt+1 maximizing H(Y |Qt , Xt+1 ).
2. What is the result of applying the latter strategy to the data in
Table 7.2?
7.2 Working with simple tags
1. Given the image examples and tags of Fig. 7.3, extract all the binary
relationships (the number of tag may be repeated in the relationship, for
instance 5 ↑ 5) and obtain the classification tree.
2. Compare the results, in terms of complexity of the tree and classification
performance with the case of associating single tags to each test.
7.3 Randomization and multiple trees
1. Given the image examples and tags of Fig. 7.3, and having extracted B,
use randomization and the multiple-tree methodology to learn shallow trees.
2. Is the number of examples enough to learn all posteriors. If not, expand
the training set conveniently.
7.4 Gini index vs. entropy
An alternative measure to entropy for quantifying node impurity is the Gini
index defined as ⎡ ⎤
1 
G(Y ) = ⎣1 − P 2 (Y = y)⎦
2
y∈Y

This measure is used, for instance, during the growing of RFs. Considering the
case of two categories (classes), compare both analytically and experimentally
this measure with entropy.
7.5 Breiman’s conjecture
When defining RFs, Breiman conjectures that Adaboost emulates RFs at
the later stages. Relate this conjecture with the probability of the kth set of
weights.
7.6 Weak dependence of multiple trees
In the text it is claimed that randomization yields weak statistical depen-
dency among the trees. Given the trees grown in the previous problem, check
340 7 Classifier Design

the claim by computing both the conditional variances νc and the sum of
conditional covariances for all classes γc which are defined as follows:

1 
K C
νc = V ar(μTk (d)|Y = c)
K
k=1 d=1

1 
C
γc = Cov(μTk (d), μTp (d)|Y = c)
K2
k=p d=1

7.7 Derivation of mutual information


In Infomax Boosting, Quadratic Divergence (Eq. 7.37) is used to derive
I(φIx ; c). Derive I(φIx ; c) from a different similarity measure between his-
tograms like Chi-Square X 2 :

n
(Hi (k) − Ĥ(k))2
Xij2 = (7.144)
k
Ĥ(k)
Hi (k) + Hj (k)
Ĥ(k) = (7.145)
2
7.8 Discrete infomax classifier
Output of Alg. 17 is a strong classifier which returns a real value in the range
{−1, 1}. Transform this algorithm to make it return a discrete classifier with
two possible class labels (0 and 1).
7.9 How Information Theory improves Adaboost?
Infomax and Jensen–Shannon are based on information theory measures to
select a weak learner at each iteration, rather than classification error. Explain
why this fact makes these algorithms yield better results than the original
Adaboost.
7.10 Maximum entropy classifiers
In the example presented in Table 7.3 there are five vehicles with three features
for each one of them. Suppose we are also informed that the first vehicle is
equipped with a sidecar, and the rest of them are not.
1. Update the data of Tables 7.4 and 7.5.
2. For the new data the Improved Iterative Scaling (Alg. 19) converges to
the λi values:

(λ1 , . . . , λ8 ) = (0.163, −0.012, −0.163, 14.971, −0.163, 0.012, 0.163, 0)

Calculate again the probability that the vehicle which has gears, 4
wheels and 4 seats is a car, and the probability that it is a motorbike.
3. Has the new feature helped the model to improve the classification?
4. Why is λ8 = 0?
5. Is the Maximum Entropy classifier suitable for real-valued features?
How would you add the maximum speed of the vehicle as a feature?
7.8 Key References 341

7.11 Parameterizing the Gaussian


Proof that the well-known Gaussian distribution belongs to exponential fam-
ily. Identify: S, T (x),π(x), and G(Λ). Show also the one-to-one correspon-
dence, through the first moment of G(Λ), between the natural and usual
spaces.

7.12 Improved iterative scaling equations


Given the derivation of the maximum-likelihood updates δj in Eq. 7.98, gen-
eralize them for the expressions contained in Alg. 19, that is, considering the
conditional model for the maximum entropy classifier.

7.13 Exponential distributions and bijections with Bregman


divergences
Assuming F (x) = x, check that the following settings of G∗ yield the fol-

lowing distributions: G (x) = ||x||2 /2 gives a unit variance Gaussian through

DG∗ (x, μ); G (x) = i xi 
log xi gives a multinomial distribution through
DG∗ (x, q); and G∗ (x) = − i log xi gives a geometric distribution through
DG∗ (x, μ), with λi = 1/μi .

7.14 Bregman balls for Gaussian distributions


Reformulate the BBC algorithm for the case of assuming a Gaussian distri-
bution as a model for the training data. What is the proper generator and
Bregman divergence in this case? Design a test for detecting the Gaussianity
of test vectors.

7.15 Bregman surrogates for trees and random forests


1. Using the definition of Fφ (·) in terms of Bregman divergence between 0
and the inverse, proof that


N    + 
|Sl |
R
φ (H, S) = Fφ (yi∗ H(xi )) = |Sl | −φ (7.146)
i=1
|Sl |
l∈∂H

2. Speculate what should be the impact of replacing the majority (voting)


rule used for random forests by an ULS approach. This is related to Breiman’s
conjecture (see Prob. 7.5). For this rationale it is also useful to consider that
the performance of Adaboost decreases when the class label of several training
examples is deteriorated.

7.8 Key References

• L. Breiman. “Random Forests”. Machine Learning 45: 5–32, Prague (Czech


Republic) (2001)
• M. Robnik-Sikonja. “Improving Random Forests”, European Conference
on Machine Learning, Pisa (Italy) (2004)
342 7 Classifier Design

• D. Geman and B. Jedynak. “Model-Based Classification Trees”. IEEE


Transactions on Information Theory 47(3) (2001)
• S. Lyu. “Infomax Boosting”. IEEE Conference on Computer Vision and
Pattern Recognition Vol. I 533–538, San Diego (USA) (2005)
• X. Huang, S.Z. Li, and Y. Wang. “Jensen–Shannon Boosting Learning for
Object Recognition”. IEEE Conference on Computer Vision and Pattern
Recognition Vol. I 533–538, San Diego (USA) (2005)
• C. Liu and H. Y. Shum. “Kullback–Leibler Boosting”. IEEE Conference
on Computer Vision and Pattern Recognition 407–411, Madison (USA)
(2003)
• A. Fred and A.K. Jain. “Combining Multiple Clustering Using Evidence
Accumulation”. IEEE Transactions on Pattern Analysis and Machine In-
telligence 27(6): 835–850 (2005)
• A. Topchy, A.K. Jain, and W. Punch. “Clustering Ensembles: Models of
Consensus and Weak Partitions”. IEEE Transactions on Pattern Analysis
and Machine Intelligence 27(12): 1866–1881 (2005)
References

1. W. Aguilar, Y. Frauel, F. Escolano, M.E. Martı́nez-Pérez, A. Espinosa-Romero,


and M.A. Lozano, A robust graph transformation matching for non-rigid reg-
istration, Image and Vision Computing 27 (2009), 7, 897–910.
2. H. Akaike, A new look at the statistical model selection, IEEE Transactions
on Automatic Control 16 (1977), 6, 716–723.
3. S.-I. Amari, Differential-geometrical methods in statistics, Lecture Notes in
Statistics, Springer-Verlag, Berlin, 28, (1985).
4. S.-I. Amari, Information geometry and its applications, Emerging Trends in
Visual Computing (PASCAL videolectures), 2008.
5. S.-I. Amari and H. Nagaoka, Methods of information geometry, American
Mathematical Society, 2001.
6. A.A. Amini, R.W. Curwen, and J.C. Gore, Snakes and splines for tracking
non-rigid heart motion, Proceedings of the European Conference on Computer
Vision, Cambridge, UK, 1996, pp. 251–261.
7. Y. Amit and D. Geman, Shape quantization and recognition with randomized
trees, Neural Computation (1997), 9, 1545–1588.
8. Y. Amit, D. Geman, and B. Jedynak, Efficient focusing and face detection,
Face recognition: from theory to applications, eds. H. Wechsler et al., NATO
ASI Series F, Springer-Verlag, Berlin, pp. 157–173, 1998.
9. A.M. Peter and A. Rangarajan, Information geometry for landmark shape
analysis: unifying shape representation and deformation, IEEE Transactions
on Pattern Analysis and Machine Intelligence 31 (2008), 2, 337–350.
10. J.H. Friedman and J.W. Turkey, A projection pursuit algorithm for exploratory
data analysis, IEEE Transactions on Computers 23 (1974), 9, 881–890.
11. A. Banerjee, S. Merugu, I.S. Dillon, and J. Ghosh, Clustering with Bregman
divergences, Journal of Machine Learning Research 6 (2005), 1705–1749.
12. P.L. Bartlett, M.I. Jordan, and J.D. Mcauliffe, Convexity, classification, and
risk bounds, Journal of the American Statistical Association 101 (2006),
138–156.
13. E. Beirlant, E. Dudewicz, L. Gyorfi, and E. Van der Meulen, Nonparamet-
ric entropy estimation, International Journal on Mathematical and Statistical
Sciences 6 (1996), 1, 17–39.
14. A.J. Bell and T.J. Sejnowski, An information-maximization approach to blind
separation and blind deconvolution, Neural Computation 7 (1995), 1129–1159.
343
344 References

15. A.J. Bell and T.J. Sejnowski, Edges are the independent components of natural
scenes, Advances on Neural Information Processing Systems (NIPS), 8, (1996),
831–837.
16. A.J. Bell, The co-information lattice, Proceedings of the International Work-
shop on Independent Component Analysis and Blind Signal Separation, Nara,
Japan, 2003, pp. 921–926.
17. A. Berger, The improved iterative scaling algorithm: a gentle introduction,
Technical report, Carnegie Melon University, 1997.
18. A. Berger, Convexity, maximum likelihood and all that, Carnegie Mellon
University, Pittsburgh, PA, 1998.
19. A. Berger, S. Della Pietra, and V. Della Pietra, A Maximum entropy approach
to natural language processing, Computational Linguistics, 22 (1996).
20. D. Bertsekas, Convex analysis and optimization, Athena Scientific, Nashua,
NH, 2003.
21. D.J. Bertsimas and G. Van Ryzin, An asymptotic determination of the min-
imum spanning tree and minimum matching constants in geometrical proba-
bility, Operations Research Letters 9 (1990), 1, 223–231.
22. P.J. Besl and N.D. McKay, A method for registration of 3D shapes, IEEE
Transactions on Pattern Analysis and Machine Intelligence 14 (1992), 2,
239–256.
23. A. Blake and M. Isard, Active contours, Springer, New York, 1998.
24. A. Blum and P. Langley, Selection of relevant features and examples in machine
learning, Artificial Intelligence 97 (1997), 1–2, 245–271.
25. B. Bonev, F. Escolano, and M. Cazorla, Feature selection, mutual informa-
tion, and the classification of high-dimensional patterns, Pattern Analysis and
Applications, 11 (2008) 309–319.
26. F.L. Bookstein, Principal warps: thin plate splines and the decomposition of
deformations, IEEE Transactions on Pattern Analysis and Machine Intelligence
11 (1989), 6, 567–585.
27. A. Bosch, A. Zisserman, and X. Muñoz, Image classification using random
forests and ferns, Proceedings of the International Conference on Computer
Vision, Rio de Janeiro, Brazil, 2007, pp. 1–8.
28. A. Bosch, A. Zisserman, and X. Muñoz, Scene classification using a hybrid
generative/discriminative approach, IEEE Transactions on Pattern Analysis
and Machine Intelliegnce 30 (2008), 4, 1–16.
29. L.M. Bregman, The relaxation method of finding the common point of convex
sets and its application to the solution of problems in convex programming,
USSR Computational Mathematics and Physics 7 (1967), 200–217.
30. L. Breiman, Random forests, Machine Learning 1 (2001), 45, 5–32.
31. L. Breiman, J. Friedman, R. Olshen, and C. Stone, Classification and regression
trees, Wadsworth, Belmont, CA, 1984.
32. J.E. Burr, Properties of cross-entropy minimization, IEEE Transactions on
Information Theory 35 (1989), 3, 695–698.
33. M. Bǎdoiu and K.-L. Clarkson, Optimal core-sets for balls, Computational
Geometry: Theory and Applications 40 (2008), 1, 14–22.
34. X. Calmet and J. Calmet, Dynamics of the Fisher information metric, Physical
Review E 71 (2005), 056109.
35. J.-F. Cardoso and A. Souloumiac, Blind beamforming for non Gaussian signals,
IEE Proceedings-F 140 (1993), 6, 362–370.
References 345

36. M.A. Cazorla, F. Escolano, D. Gallardo, and R. Rizo, Junction detection and
grouping with probabilistic edge models and Bayesian A∗ , Pattern Recognition
9 (2002), 35, 1869–1881.
37. T.F. Chan and L. Vese, An active contour model without edges, Proceedings
of International Conference Scale-Space Theories in Computer Vision, Corfu,
Greece, 1999, pp. 141–151.
38. X. Chen and A.L. Yuille, Adaboost learning for detecting and reading text in
city scenes, Proceedings of IEEE Conference on Computer Vision and Pattern
Recognition, Washington DC, USA, 2004.
39. X. Chen and A.L. Yuille, Time-efficient cascade for real time object detection,
1st International Workshop on Computer Vision Applications for the Visually
Impaired. Proceedings of IEEE Conference on Computer Vision and Pattern
Recognition, Washington DC, USA, 2004.
40. H. Chui and A. Rangarajan, A new point matching algorithm for nonrigid
registration, Computer Vision and Image Understanding 89 (2003), 114–141.
41. P. Comon, Independent component analysis, a new concept? Signal Processing
36 (1994), 287–314.
42. J.M. Coughlan and A. L. Yuille, Bayesian A∗ tree search with expected o(n)
node expansions: applications to road tracking, Neural Computation 14 (2002),
1929–1958.
43. T. Cover and J. Thomas, Elements of information theory, Wiley, New York
1991.
44. I. Csiszár, A geometric interpretation of Darroch and Ratcliff’s generalized
iterative scaling, Annals of Probability 17 (1975), 3, 1409–1413.
45. I. Csiszár, I-divergence geometry of probability distributions and minimization
problems, Annals of Probability 3 (1975), 1, 146–158.
46. S. Dalal and W. Hall, Approximating priors by mixtures of natural conjugate
priors, Journal of the Royal Statistical Society(B) 45 (1983), 1.
47. J.N. Darroch and D. Ratcliff, Generalized iterative scaling for log-linear models,
Annals of Mathematical Statistics 43 (1983), 1470–1480.
48. P. Dellaportas and I. Papageorgiou, Petros Dellaportas1 Contact Information
and Ioulia Papageorgiou, Statistics and Computing 16 (2006), 1, 57–68.
49. A. Dempster, N. Laird, and D. Rubin, Maximum likelihood estimation from
incomplete data via the EM algorithm, Journal of the Royal Statistical Society
39 (1977), 1, 1–38.
50. G.L. Donato and S. Belongie, Approximate thin plate spline mappings, Proceed-
ings of the European Conference on Computer Vision, Copenhagen, Denmark,
vol. 2, 2002, pp. 531–542.
51. R. O. Duda and P. E. Hart, Pattern classification and scene analysis, Wiley,
New York, 1973.
52. D. Erdogmus, K.E. Hild II, Y.N. Rao, and J.C. Prı́ncipe, Minimax mutual in-
formation approach for independent component analysis, Neural Computation
16 (2004), 6, 1235–1252.
53. L. Fei-Fei and P. Perona, A Bayesian hierarchical model for learning natural
scene categories, Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, San Diego, USA, vol. 2, 2005, pp. 524–531.
54. D.J. Field, What is the goal of sensory coding? Neural Computation 6 (1994),
559–601.
346 References

55. M.A.T. Figueiredo and A.K. Jain, Unsupervised selection and estimation
of finite mixture models, International Conference on Pattern Recognition
(ICPR2000) (Barcelona, Spain), IEEE, 2000.
56. M.A.T. Figueiredo and A.K. Jain, Unsupervised learning of finite mixture
models, IEEE Transactions on Pattern Analysis and Machine Intelligence 24
(2002), 3, 381–399.
57. M.A.T. Figueiredo, J.M.N. Leitao, and A.K. Jain, Adaptive parametrically de-
formable contours, Proceedings of Energy Minimization Methods and Pattern
Recognition (EMMCVPR’97), Venice, Italy, 1997, pp. 35–50.
58. M.A.T. Figueiredo, J.M.N. Leitao, and A.K. Jain, Unsupervised contour rep-
resentation and estimation using B-splines and a minimum description length
criterion, IEEE Transactions on Image Processing 6 (2000), 9, 1075–1087.
59. M.A.T Figueiredo, J.M.N Leitao, and A.K. Jain, On fitting mixture models,
Energy Minimization Methods in Computer Vision and Pattern Recognition.
Lecture Notes in Computer Science 1654 (1999), 1, 54–69.
60. D.H. Fisher, Knowledge acquisition via incremental conceptual clustering,
Machine Learning (1987), 2, 139–172.
61. Y. Freund and R.E. Schapire, A decision-theoretic generalization of on-line
learning and an application to boosting, Journal of Computer and System
Sciences 1 (1997), 55, 119–139.
62. D. Geman and B. Jedynak, Model-based classification trees, IEEE Transactions
on Information Theory 3 (2001), 47, 1075–1082.
63. J. Goldberger, S. Gordon, and H. Greenspan, Unsupervised image-set cluster-
ing using an information theoretic framework, IEEE Transactions on Image
Processing 2 (2006), 449–458.
64. P.J. Green, Reversible jump Markov chain Monte Carlo computation and
Bayesian model determination, Biometrika 4 (1995), 82, 711–732.
65. U. Grenander and M.I. Miller, Representation of knowledge in complex sys-
tems, Journal of the Royal Statistical Society Series B 4 (1994), 56, 569–603.
66. R. Gribonval, From projection pursuit and cart to adaptive discriminant anal-
ysis? IEEE Transactions on Neural Networks 16 (2005), 3, 522–532.
67. P.D. Grünwald, The minimum description length principle, MIT Press,
Cambridge, MA, 2007.
68. S. Guilles, Robust description and matching of images, Ph.D. thesis, University
of Oxford, 1998.
69. I. Guyon and A. Elisseeff, An introduction to variable and feature selection,
Journal of Machine Learning Research (2003), 3, 1157–1182.
70. A. Ben Hamza and H. Krim, Image registration and segmentation by maximiz-
ing the Jensen-Rényi divergence, Lecture Notes in Computer Science, EMM-
CVPR 2003, 2003, pp. 147–163.
71. J. Harris, Algebraic geometry, a first course, Springer-Verlag, New York, 1992.
72. T. Hastie and R. Tibshirani, Discriminant analysis by Gaussian mixtures, Jour-
nal of the Royal Statistical Society(B) 58 (1996), 1, 155–176.
73. A.O. Hero and O. Michel, Asymptotic theory of greedy aproximations to min-
nimal k-point random graphs, IEEE Transactions on Information Theory 45
(1999), 6, 1921–1939.
74. A.O. Hero and O. Michel, Applications of spanning entropic graphs, IEEE
Signal Processing Magazine 19 (2002), 5, 85–95.
References 347

75. K. Huang, Y. Ma, and R. Vidal, Minimum effective dimension for mixtures
of subspaces: A robust GPCA algorithm and its applications, Computer Vision
and Pattern Recognition Conference (CVPR04), vol. 2, 2004, pp. 631–638.
76. X. Huang, S.Z. Li, and Y. Wang, Jensen-Shannon boosting learning for object
recognition, IEEE Conference on Computer Vision and Pattern Recognition 2
(2005), 144–149.
77. A. Hyvärinen, New approximations of differential entropy for independent com-
ponent analysis and projection pursuit, Technical report, Helsinki University of
Technology, 1997.
78. A. Hyvarinen, J. Karhunen, and E. Oja, Independent component analysis,
Wiley, New York, 2001.
79. A. Hyvarinen and E. Oja, Independent component analysis: algorithms and
applications, Neural Networks 13 (2000), 4–5, 411–430.
80. A. Hyvrinen and E. Oja, A fast fixed-point algorithm for independent compo-
nent analysis, Neural Computation 9 (1997), 7, 1483–1492.
81. A.K. Jain and R. Dubes, Algorithms for clustering data, Prentice Hall, Engle-
wood Cliffs, NJ, 1988.
82. A.K. Jain, R. Dubes, and J. Mao, Statistical pattern recognition: a review,
IEEE Transactions on Pattern Analysis Machine Intelligence 22 (2000), 1,
4–38.
83. E.T. Jaynes, Information theory and statistical mechanics, Physical Review
106 (1957), 4, 620–630.
84. B. Jedynak, H. Zheng, and M. Daoudi, Skin detection using pairwise models,
Image and Vision Computing 23 (2005), 13, 1122–1130.
85. B. Jedynak, H. Zheng, and Daoudi M., Statistical models for skin detection,
Proceedings of IEEE International Conference on Computer Vision and Pat-
tern Recognition (CVPRV’03), Madison, USA, vol. 8, 2003, pp. 92–92.
86. R. Jin, R. Yan, J. Zhang, and A.G. Hauptmann, A faster iterative scaling algo-
rithm for conditional exponential model, Proceedings of the 20th International
Conference on Machine Learning (ICML 2003), Washington, USA, 2003.
87. G.H. John, R. Kohavi, and K. Pfleger, Irrelevant features and the sub-
set selection problem, International Conference on Machine Learning (1994),
pp. 121–129.
88. M.C. Jones and R. Sibson, What is projection pursuit?, Journal of the Royal
Statistical Society. Series A (General) 150 (1987), 1, 1–37.
89. M.J. Jones and J.M. Rehg, Statistical color models with applications to skin
detection, Proceedings of IEEE International Conference on Computer Vision
and Pattern Recognition, Ft. Collins, USA, 1999, pp. 1–8.
90. T. Kadir and M. Brady, Estimating statistics in arbitrary regions of interest,
Proceedings of the 16th British Machine Vision Conference, Oxford, UK, Vol. 2,
2005, pp. 589–598.
91. T. Kadir and M. Brady, Scale, saliency and image description, International
Journal on Computer Vision 2 (2001), 45, 83–105.
92. K. Kanatani, Motion segmentation by subspace separation and model selection,
International Conference on Computer Vision (ICCV01), vol. 2, 2001, pp. 586–
591.
93. M. Kass, A. Witkin, and D. Terzopoulos, Snakes: Active contour models, In-
ternational Journal on Computer Vision (1987), 1, 259–268.
348 References

94. Robert E. Kass and Larry Wasserman, A reference Bayesian test for nested hy-
potheses and its relationship to the Schwarz criterion, Journal of the American
Statistical Association 90 (1995), 928–934.
95. M. Kearns and L. Valiant, Cryptographic limitations on learning Boolean for-
mulae and finite automata, Journal of the ACM 1 (1994), 41, 67–95.
96. R. Kohavi and G.H. John, Wrappers for feature subset selection, Artificial
Intelligence 97 (1997), 1–2, 273–324.
97. D. Koller and M. Sahami, Toward optimal feature selection, Proceedings of
International Conference in Machine Learning, 1996, pp. 284–292.
98. S. Konishi, A.L. Yuille, and J.M. Coughlan, A statistical approach to multi-
scale edge detection, Image and Vision Computing 21 (2003), 1, 37–48.
99. S. Konishi, A.L. Yuille, J.M. Coughlan, and S.C. Zhu, Statistical edge detec-
tion: learning and evaluating edge cues, IEEE Transactions on Pattern Analysis
and Machine Intelligence 1 (2003), 25, 57–74.
100. S. Kullback, Information theory and statistics, Wiley, New York, 1959.
101. M.H.C. Law, M.A.T. Figueiredo, and A.K. Jain, Simultaneous feature selection
and clustering using mixture models, IEEE Transactions on Pattern Analysis
Machine Intelligence 26 (2004), 9, 1154–1166.
102. S. Lazebnik, C. Schmid, and Ponce P., Beyond bags of features: Spatial pyramid
matching for recognizing natural scene categories, Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, New York, USA,
vol. 2, 2006, pp. 2169–2178.
103. N. Leonenko, L. Pronzato, and V. Savani, A class of Rényi information esti-
mators for multidimensional densities, The Annals of Statistics 36 (2008), 5,
2153–2182.
104. J. Lin, Divergence measures based on the Shannon entropy, IEEE Transactions
on Information Theory 1 (1991), 37, 145–151.
105. R. Linsker, Self-organization in a perceptual network, Computer 3 (1988), 21,
105–117.
106. C. Liu and H.Y. Shum, Kullback–Leibler boosting, IEEE Conference on Com-
puter Vision and Pattern Recognition (2003), 407–411.
107. D. Lowe, Distinctive image features form scale-invariant keypoints, Interna-
tional Journal of Computer Vision 60 (2004), 2, 91–110.
108. S. Lyu, Infomax boosting, IEEE Conference on Computer Vision and Pattern
Recognition 1 (2005), 533–538.
109. Y. Ma, A.Y. Yang, H. Derksen, and R. Fossum, Estimation of subspace ar-
rangements with applications in modelling and segmenting mixed data, SIAM
Review 50 (2008), 3, 413–458.
110. J. Matas, O. Chum, U. Martin, and T. Pajdla, Robust wide baseline stereo from
maximally stable extremal regions, Proceedings of the British Machine Vision
Conference, Cardiff, Wales, UK, vol. 1, 2002, pp. 384–393.
111. G. McLachlan, Discriminant analysis and statistical pattern recognition, Wiley,
New York, 1992.
112. G. McLachlan and D. Peel, Finite mixture models, Wiley, 2000.
113. K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman, J. Matas, F. Schaf-
falitzky, T. Kadir, and L. Van Gool, A comparison of affine region detectors,
International Journal of Computer Vision 65 (2005), 1–2, 43–72.
114. W. Mio and X. Liu, 2113–2116, Proceedings of the IEEE International Con-
ference on Image Processing, Atlanta, USA, 2006, pp. 531–542.
References 349

115. A. Mokkadem, Estimation of the entropy and information of absolutely contin-


uous random variables, IEEE Transactions on Information Theory 35 (1989), 1,
193–196.
116. A. Peñalver, F. Escolano, and J.M. Sáez, Learning gaussian mixture mod-
els with entropy based criteria, Submitted to IEEE Transactions on Neural
Networks (2009).
117. H. Neemuchwala, A. Hero, S. Zabuawala, and P. Carson, Image registration
methods in high dimensional space, International Journal of Imaging Systems
and Technology 16 (2007), 5, 130–145.
118. F. Nielsen and R. Nock, On the smallest enclosing information disk, Informa-
tion Processing Letters 105 (2008), 93–97.
119. R. Nock and F. Nielsen, Fitting the smallest enclosing bregman ball, European
Conference on Machine Learning, ECML 2005, Lecture Notes in Computer
Science 3720, (2005), 649–656.
120. R. Nock and F. Nielsen, Bregman divergences and surrogates for learning,
IEEE Transactions on Pattern Analysis and Machine Intelligence (2009).
121. B.A. Olshausen and D.J. Field, Natural images statistics and efficient coding,
Network: Communication in Neural Systems 7 (1996), 2, 333–339.
122. E. Parzen, On estimation of a probability density function and mode, Annals
of Mathematical Statistics 33 (1962), 1, 1065–1076.
123. J. Pearl, Probabilistic reasoning in intelligent systems: networks of plausible
inference, Morgan Kaufmann, San Mateo, CA, 1998.
124. D. Pelleg and A. Moore, X-means: extending k-means with efficient estimation
of the number of clusters, Proceedings of the 17th International Conference on
Machine Learning, San Francisco, Morgan Kaufmann, 2000, pp. 727–734.
125. H. Peng, F. Long, and C. Ding, Feature selection based on mutual information:
criteria of max-dependency, max-relevance, and min-redundancy, IEEE Trans-
actions on Pattern Analysis and Machine Intelligence 27 (2005), 8, 1226–1238.
126. A. Pentland, R.W. Picard, and S. Sclaroff, Photobook: content-based manipu-
lation of image databases, International Journal of Computer Vision 18 (1996),
3, 233–254.
127. S. Perkins, K. Lacker, and J. Theiler, Grafting: fast, incremental feature se-
lection by gradient descent in function space, Journal of Machine Learning
Research 3 (2003), 1333–1356.
128. A. Peter and A. Rangarajan, A new closed-form information metric for shape
analysis, Proceedings of MICCAI, Copenhagen, Denmark, 2006, pp. 249–256.
129. A. Peter and A. Rangarajan, Shape analysis using the Fisher-Rao Rieman-
nian metric: unifying shape representation and deformation, IEEE Interna-
tional Symposium on Biomedical Imaging: Nano to Macro, 2006, pp. 531–542.
130. D. Ponsa and A. López, Feature selection based on a new formulation
of the minimal-redundancy-maximal-relevance criterion, IbPRIA (1), 2007,
pp. 47–54.
131. W. Punch, A. Topchy, and A. Jain, Clustering ensembles: models of consensus
and weak partitions, IEEE Transactions on Pattern Analysis and Machine
Intelligence 27 (2005), 12, 1866–1881.
132. J.R. Quinlan, C4.5: Programs for machine learning, Morgan Kaufmann, San
Mateo, CA, 1993.
133. A. Rajwade, A. Banerjee, and A. Rangarajan, Continuous image representa-
tions avoid the histogram binning problem in mutual information based image
350 References

registration, Biomedical Imaging: Nano to Macro, 2006. Proceedings of the


Thrid IEEE International Symposium on Biomedical Imaging (ISBI) 2006,
pp. 840–843.
134. A. Rajwade, A. Banerjee, and A. Rangarajan, Probability density estimation
using isocontours and isosurfaces: Application to information theoretic image
registration, IEEE Transactions on Pattern Analysis and Machine Intelligence,
Miami, USA 31 (2009), 3, 475–491.
135. S. Rao, A. M. Martins, and J. C. Principe, Mean shift: an information theoretic
perspective, Pattern Recognition Letters 30(3) (2009), 222–2309.
136. R.A. Redner and H.F. Walker, Mixture densities, maximum likelihood, and
the EM algorithm, SIAM Review 26 (1984), 2, 195–239.
137. A. Rényi, On measures of information and entropy, Proceedings of the
4th Berkeley Symposium on Mathematics, Statistics and Probability (1960),
Berkeley, USA, 547–561.
138. J. Rissanen, Modelling by the shortest data description, Automatica (1978), 14,
465–471.
139. J. Rissanen, A universal prior for integers and estimation by minimum descrip-
tion length, Annals of Statistics (1983), 11, 416–431.
140. J. Rissanen, Stochastic complexity in statistical inquiry, 1989, World Scientific,
River Edge, NJ.
141. C.P. Robert and G. Castella, Monte Carlo statistical methods, Series: Springer
Texts in Statistics, Springer-Verlag (1999).
142. R.E. Schapire, The strength of weak learnability, Machine Learning 2 (1990), 5,
197–227.
143. G. Schwarz, Estimating the dimension of a model, The Annals of Statistics 6
(1978), 2, 461–464.
144. J.E. Shore and R.W. Johnson, Properties of cross-entropy minimization, IEEE
Transactions on Information Theory 27 (1981), 4, 472–482.
145. K. Siddiqi, A. Shokoufandeh, S.J. Dickinson, and S.W. Zucker, Shock graphs
and shape matching, International Journal of Computer Vision 35 (1999), 1,
13–32.
146. C. Sima and E.R. Dougherty, What should be expected from feature selection
in small-sample settings, Bioinformatics 22 (2006), 19, 2430–2436.
147. J. Sivic and A. Zisserman, Video google: a text retrieval approach to object
matching in videos, Proceedings of the International Conference on Computer
Vision, Nice, France, vol. 2, 2003, pp. 1470–1477.
148. Q. Song, A robust information clustering algorithm, Neural Computation 17
(2005), 12 2672–2698.
149. L. Staib and J. Duncan, Boundary finding with parametrically deformable
models, IEEE Transactions on Pattern Analysis and Machine Intelligence
(1992), 14, 1061–1075.
150. A. Strehl and J. Ghosh, Cluster ensembles – a knowledge reuse framework for
combining partitionings, Proceedings of Conference on Artificial Intelligence
(AAAI 2002), Edmonton, pp. 93–98.
151. C. Strobl, A.-L. Boulesteix, A. Zeileis, and T. Hothorn, Bias in random for-
est variable importance measures: illustrations, sources and a solution, BMC
Bioinformatics 25 (2007), 8, 67–95.
152. M.J. Tarr and H.H. Bülthoff (Eds), Object recognition in man, monkey and
machine, MIT Press, Cambridge, MA, 1998.
References 351

153. D.M.J. Tax and R.P.W Duin, Support vector data description, Machine Learn-
ing 54 (2004), 45–66.
154. Z. Thu, X. Chen, A.L. Yuille, and S. Zhu, Image parsing: unifying segmenta-
tion, detection, and recognition, International Journal of Computer Vision 63
(2005), 2, 113–140.
155. A. Torsello and D.L. Dowe, Learning a generative model for structural represen-
tations, Proceedings of the Australasian Conference on Artificial Intelligence,
Auckland, New Zealand, 2008, pp. 573–583.
156. A. Torsello and E.R. Hancock, Learning shape-classes using a mixture of tree-
unions, IEEE Transactions on Pattern Analysis and Machine Intelligence 28
(2006), 6, 954–967.
157. I.W. Tsang, A. Kocsor, and J.T. Kwok, Simpler core vector machines with
enclosing balls, International Conference on Machine Learning, ICML 2007,
2007, pp. 911–918.
158. I.W. Tsang, J.T. Kwok, and P.-M. Cheung, Core vector machines: fast svm
training on very large datasets, Journal of Machine Learning Research 6 (2005),
363–392.
159. Z. Tu and S.-C. Zhu, Image segmentation by data-driven markov chain Monte
Carlo, IEEE Transactions on Pattern Analysis and Machine Intelligence 24
(2002), 5, 657–673.
160. M. Turk and A. Pentland, Eigenfaces for recognition, Journal of Cognitive
Neuroscience 3 (1991), 1.
161. N. Ueda, R. Nakano, Z. Ghahramani, and G. E. Hinton, SMEM algorithm for
mixture models, Neural Computation 12 (2000), 1, 2109–2128.
162. G. Unal, H. Krim, and A. Yezzi, Fast incorporation of optical flow into active
polygons, IEEE Transactions on Image Processing 6 (2005), 14, 745–759.
163. G. Unal, A. Yezzi, and H. Krim, Information-theoretic active polygons for
unsupervised texture segmentation, International Journal on Computer Vision
3 (2005), 62, 199–220.
164. M.J. van der Laan, Statistical inference for variable importance, International
Journal of Biostatistics 2 (2006), 1, 1008–1008.
165. V. N. Vapnik, Statistical learning theory, Wiley, New York, 1998.
166. N. Vasconcelos and M. Vasconcelos, Scalable discriminant feature selection for
image retrieval and recognition, Computer Vision and Pattern Recognition
Conference (CVPR04), 2004, pp. 770–775.
167. M.A. Vicente, P.O. Hoyer, and A. Hyvarinen, Equivalence of some common
linear feature extraction techniques for appearance-based object recognition
tasks, IEEE Transactions on Pattern Analysis and Machine Intelligence 29
(2007), 5, 233–254.
168. R. Vidal, Y. Ma, and J. Piazzi, A new GPCA algorithm for clustering sub-
spaces by fitting, differentiating and dividing polynomials, Computer Vision
and Pattern Recognition Conference (CVPR04), 2004.
169. R. Vidal, Y. Ma, and S. Sastry, Generalized principal component analysis
(gpca), Computer Vision and Pattern Recognition Conference (CVPR04),
vol. 1, 2003, pp. 621–628.
170. P. Viola and M.J. Jones, Robust real-time face detection, International Journal
of Computer Vision 2 (2004), 57, 137–154.
171. P. Viola and W.M. Wells-III, Alignment by maximization of mutual informa-
tion, 5th International Conference on Computer Vision, vol. 2, IEEE, 1997,
pp. 137–154.
352 References

172. N. Vlassis, A. Likas, and B. Krose, A multivariate kurtosis-based dynamic


approach to Gaussian mixture modeling, 2000, Intelligent Autonomous Sys-
tems Techinal Report.
173. C.S. Wallace and D.M. Boulton, An information measure for classification,
Computer Journal 11 (1968), 2, 185–194.
174. F. Wang, B.C. Vemuri, A. Rangarajan, and S.J. Eisenschenk, Simultaneous
nonrigid registration of multiple point sets and atlas construction, IEEE Trans-
actions on Pattern Analysis and Machine Intelligence 30 (2008), 11, 2011–2022.
175. G. Winkler, Image analysis, random fields and Markov chain Monte Carlo
methods: a mathematical introduction, Springer, Berlin, 2003.
176. L. Xu, BYY harmony learning, structural RPCL, and topological self-
organizing on mixture models, Neural Networks 15 (2002), 1, 1125–1151.
177. J.S. Yedidia, W.T. Freeman, and Y. Weiss., Understanding belief propagation
and its generalisations, Tech. Report TR-2001-22, Mitsubishi Research Labo-
ratories, January 2002.
178. J. Zhang, M. Marszalek, C. Lazebnik, and S. Schmid, Local features and ker-
nels for classification of texture and object categories: a comprehensive study,
International Journal of Computer Vision (2007), no. DOI: 10.1007/s11263-
006-9794-4.
179. Jie Zhang and A. Rangarajan, Affine image registration using a new informa-
tion metric, Proceedings of the 2004 IEEE Computer Society Conference on
Computer Vision and Pattern Recognition (CVPR’04), Washington DC, USA,
Vol. 1, 2004, pp. 848–855.
180. H. Zheg, M. Daoudi, and B. Jedynak, Blocking adult images based on statistical
skin detection, Electronic Letters on Computer Vision and Image Analysis 4
(2004), 2, 1–14.
181. Y-T. Zheng, S-Y. Neo, T-S. Chua, and Q. Tian, Visual synset: towards a
higher level representations, Proceedings of IEEE International Conference on
Computer Vision and Pattern Recognition, Anchorage, USA, 2008, pp. 1–8.
182. S.C. Zhu, Y.N. Wu, and D. Mumford, Filters, Random Fields And Maximum
Entropy (FRAME): towards a unified theory for texture modeling, Interna-
tional Journal of Computer Vision 27 (1997), 2, 107–126.
183. S.C. Zhu, Y.N. Wu, and D. Mumford, Minimax entropy principle and its ap-
plications to texture modeling, Neural Computation 9 (1997), 8, 1627–1660.
184. S.C. Zhu and A.L. Yuille, Region competition: unifying snakes, region grow-
ing, and bayes/mdl for multiband image segmentation, IEEE Transactions on
Pattern Analysis and Machine Intelligence 9 (1996), 18, 884–900.
185. K. Zyczkowski, Renyi extrapolation of Shannon entropy, Open Systems and
Information Dynamics 10 (2003), 3, 297–310.
Index

α-mutual information, 131 classification, 211


clustering ensembles, 197
active contours, 44 co-information, 121
active polygon, 46 conditional entropy, 22, 25, 107, 120,
active polygons, 44 228
ADA, 254 conditional independence, 234
Adaboost, 89, 298 conditional mutual information, 121
affine aligment, 123 consensus function, 199
agglomerative clustering, 150 contour, 53, 60
agglomerative information bottleneck, cross validation, 214
177
AIC, 258, 259 data-driven Markov chain, 88
aligment, 106 detection, 86
ambient space, 254, 259, 261 deterministic annealing, 184
arrangement of subspaces, 254 differential geometry, 140
asymptotic equipartition property, 137 discriminative, 99
average residual, 260 distributional shape model, 132

B-Splines, 53, 58 ED: effective dimension, 258, 259, 261


Bayesian error, 177, 222 edge detection, 21
Belief propagation, 83 edge localization, 23
belief propagation, 84 effective dimension, 254
bending energy, 135 embedded data matrix, 256
Bhattacharya coefficient, 131 entropic graphs, 162
Bhattacharyya distance, 17 entropic spanning graphs, 163
BIC criterion, 208, 258 entropy, 12, 109, 228
Blahut–Arimoto algorithm, 172, 175 entropy estimation, 126, 162
boosting, 298 entropy of degree, 204
bypass entropy estimation, 126, 162 ergodicity, 93
error tolerance, 261
categorical data, 228 expectation maximization, 159, 202
chain rule of mutual information, 231
channel capacity, 184 FastICA, 244
Chernoff Information, 16, 23 feature selection, 211

353
354 Index

filter feature selection, 212, 220 K-adventurers, 74


filter pursuit, 243 K-NN, 219
Fisher information matrix, 141 kernel, 157
Fisher–Rao metric tensor, 141 kernel splitting, 167
FRAME algorithm, 241 knee point, 260
Kullback–Leibler divergence, 18, 48, 75,
GAIC criterion, 260 181, 222
Gaussian blurring mean shift, 190
Gaussian mean shift, 191 LDA, 211, 254
Gaussian mixture, 132, 157, 181 log-likelihood, 90, 159, 202
Gaussianity, 162
gene selection, 236 Markov blanket, 234
generative model, 91 Markov chain, 240
Gibbs sampler, 240 Markov chain Monte Carlo (MCMC),
Gibbs theorem, 162 240
gPCA, 254 markov random fields, 79
gPCA segmentation, 264 maximum a posteriori, 159
gradient flow, 45, 46 maximum entropy principle, 79, 238
Grassman dimension, 259 maximum likelihood, 158, 159
Grassmanian manifold, 259 maximum residual, 260
greedy algorithm, 228 MDL criterion, 258
mean shift, 189
Hellinger dissimilarity, 131 MED gPCA algorithm, 258
hidden variables, 159 MED: minimum effective dimension,
258, 260, 261
ICA, 211, 244 Metropolis-Hastings dynamics, 91
image parsing, 86 microarray, 236
Infomax, 244 minimal spanning tree, 126
infomax boosting, 299, 304 minimax entropy principle, 243
infomax feature, 301 Minimax ICA, 249
infomax principle, 301 minimum description length, 53, 58, 60,
information bottleneck, 175 166
information bottleneck principle, 170 minimum description length advantage,
information geometry, 140 151
information particles, 194 minimum description length criterion,
information plane, 180 148
information potential, 194 minimum message length, 266
interaction information, 121 minimum message length principle, 153
interest points, 11 mixture model, 200
model order selection, 74, 161
Jacobian, 257 model selection, 157
Jensen’s inequality, 47 model-order selection, 258
Jensen–Rényi divergence, 128 monotonicity, 93
Jensen–Shannon divergence, 46, 136, Monte Carlo Markov chain, 75
178, 307 Monte Carlo simulation, 181
Jensen–Shannon feature, 305 mutual information, 97, 111, 203, 222,
joint entropy, 111, 228 227
JSBoosting, 307
jump-diffusion, 63 null space, 256
Index 355

object recognition, 86 shock trees, 146


simulated annealing, 241
parsing tree, 86 singular value, 256
Parzen window, 73, 189, 193 singular vectors, left and right, 256
Parzen windows, 163 speed function, 46
Parzen’s window, 108 structural risk minimization, 186
PCA, 211, 244 sub-pixel interpolation, 113
plug-in entropy estimation, 162 subspace angle, 262
plug-in estimation, 126 support vector machines, 186
projection pursuit, 254 SVD and PCA, 260
SVD, singular value decomposition, 256
quadratic divergence, 303 symmetric uncertainty, 120
quadratic mutual information, 303

Rényi entropy, 125, 126, 162 thin-plate splines, 134


Rényi α-divergence, 129 total correlation, 122
Rényi cross entropy, 194 trees mixture, 147
Rényi quadratic entropy, 193
rate distortion theory, 170 uncertainty coefficients, 119
residual error, 260 unsupervised classification, 197
reversibility, 93 unsupervised learning, 147, 258
robust information clustering, 184
robustness, 258 Vapnik and Chervonenkis bound, 186
Vapnik and Chervonenkis dimension,
saliency, 12 186
Sampson distance, 258 Veronese map, embedding, 256
Sanov’s theorem, 26
scale space, 13 weak learner, 90
segmentation, 44, 169 wrapper feature selection, 212
semi-supervised learning, 257
Shannon entropy, 162 X-means, 208
Markov
Chan & Vese Belief
Perceptual random
functional propagation
grouping fields
Maximum
entropy
(segmentation) H Channel
Junctions
Quadratic capacity
Rényi entropy (RIC) AIB
Active Information plane
Contours Segmentation Consensus
polygons and grouping
H clustering
Jensen-Shannon Contour Region
Bayesian A* divergence fitting Competition
MDL
Color Plates
Jensen-Shannon
Phase transition Information Bottleneck
(edges) B-Splines Clustering
Other
adn MDL Jump-diffusion
Sanov’s Th. Jensen-Rényi IT principles Information
(edges) (segmentation) MED AIC-GAIC Bottleneck
Data-driven Markov chain (gPCA) (gPCA)
Detection: BIC
edges, Extended MAP (k-Adventurers) (X-means)
points, Mutual
textures Information KL divergence Gaussianity
(EBEM)
Entropy
H
Saliency H Discriminative Convergence Infomax Minimax Minimax MML
Jensen-Rényi tests for of tests ICA ICA FRAME (mixtures)
(alignment) recognition
MML H
Interest points (graphs)
filtering Jensen-Shannon
MDL
Sanov’s Th. a-MI H (deformable
matching) (trees)
Bhattacharyya (textures) Feature selection
and transform
Recognition Fisher-Rao
Chernoff information
Jensen-
Phase Transition Maximum Iterative Infomax Shannon
entropy Markov
H
(texture) scaling boosting boosting mRMR MD
Features classifier
Method blankets
of Types
Discriminability
(texture)
Exponential
functions CVMs ULS
Measures Weak
Information Rate
learners
projection Bregman distorsion
(boosting)
Classification theory
Principles divergence
Other
IT theories
Theories Error Randomized Decision Random
bound Generalized trees trees forests
maximum
Entropy entropy
H estimation

Fig. 1.1. The ITinCVPR tube/underground (lines) communicating several problems


(quarters) and stopping at several stations.
Jump−diffusion convergence

Pure generative
Data−driven (edges)
Data−driven (Hough transform)
Data−driven (Hough and edges)
Energy

500 1000 1500 2000 2500 3000


Step

Fig. 3.12. Jump-diffusion energy evolution.

Solution space and 6 representative solutions

← solution 100
← solution 86
←solution 76
← solution 58
← solution 36
← solution 14

180
160
140
120
100
80
60
40
# solution 20
0

Fig. 3.13. The solution space in which the K-adventurers algorithm has selected
K = 6 representative solutions, which are marked with rectangles.
Fig. 3.16. Skin detection results. Comparison between the baseline model (top-
right), the tree approximation of MRFs with BP (bottom-left), and the tree approx-
imation of the first-order model with BP instead of Alg. 5. Figure by B. Jedynak,
H. Zheng and M. Daoudi (2003
c IEEE).
a football match scene

sports field spectator


person

point process

face texture curve groups texture persons


text
color region texture

Fig. 3.17. A complex image parsing graph with many levels and types of patterns.
(Figure by Tu et al. 2005
c Springer.)

Fig. 5.10. Color image segmentation results. Original images (first column) and
color image segmentation with different Gaussianity deficiency levels (second and
third columns). (Courtesy of A. Peñalver.)
b 0.5
1
a
0.4

Information Loss (bits)


0.3
2

0.2 3
4

5
0.1 876
10 9
18171413
19 11
12
1615
0
1400 1410 1420 1430 1440 1450 1460
Algorithm Steps
1

c 2

3
4
5 6
8 7
9
10
11
12
13
14
16 15
17
18

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

d
6
AIB - Color+XY GMM
AIB - Color GMM
AIB - Color histogram
5 AHI - Color histogram

4 e
Image Representation I (X,Y)
Color Histogram 2.08
I(C;Y)

3 Color GMM 3.93


Color + XY GMM 5.35
2

0
8 7 6 5 4 3 2 1 0
I(C;X)

Fig. 5.17. From left to right and from top to bottom. (a) Image representation. Each
ellipsoid represents a Gaussian in the Gaussian Mixture Model of the image, with its
support region, mean color and spatial layout in the image plane. (b) Loss of mutual
information during the IAB clustering. The last steps are labeled with the number of
clusters in each step. (c) Part of the cluster tree formed during AIB, starting from 19
clusters. Each cluster is represented with a representative image. The labeled nodes
indicate the order of cluster merging, following the plot in (b). (d) I(T ; X) vs.
I(T ; Y ) plot for four different clustering methods. (e) Mutual Information between
images and image representations. (Figure by Goldberger et al. (2006c IEEE)).
Fig. 6.1. A 3D reconstruction of the route followed during the acquisition of the
data set, and examples of each one of the six classes. (Image obtained with 6-DOF
SLAM. Figure by Juan Manuel Sáez (2007
c IEEE).)

MD Feature Selection mRMR Feature Selection


CNS CNS 1
CNS CNS
CNS CNS
RENAL RENAL
BREAST BREAST
CNS CNS
CNS CNS 0.9
BREAST BREAST
NSCLC NSCLC
NSCLC NSCLC
RENAL RENAL
RENAL RENAL
RENAL RENAL 0.8
RENAL RENAL
RENAL RENAL
RENAL RENAL
RENAL RENAL
BREAST BREAST
NSCLC NSCLC
RENAL RENAL 0.7
UNKNOWN UNKNOWN
OVARIAN OVARIAN
MELANOMA MELANOMA
PROSTATE PROSTATE
OVARIAN OVARIAN
OVARIAN OVARIAN 0.6
OVARIAN OVARIAN
OVARIAN OVARIAN
Class (disease)

OVARIAN OVARIAN
PROSTATE PROSTATE
NSCLC NSCLC
NSCLC NSCLC
NSCLC NSCLC
0.5
LEUKEMIA LEUKEMIA
K562B−repro K562B−repro
K562A−repro K562A−repro
LEUKEMIA LEUKEMIA
LEUKEMIA LEUKEMIA
LEUKEMIA LEUKEMIA 0.4
LEUKEMIA LEUKEMIA
LEUKEMIA LEUKEMIA
COLON COLON
COLON COLON
COLON COLON
COLON COLON 0.3
COLON COLON
COLON COLON
COLON COLON
MCF7A−repro MCF7A−repro
BREAST BREAST
MCF7D−repro MCF7D−repro
BREAST BREAST 0.2
NSCLC NSCLC
NSCLC NSCLC
NSCLC NSCLC
MELANOMA MELANOMA
BREAST BREAST
BREAST BREAST 0.1
MELANOMA MELANOMA
MELANOMA MELANOMA
MELANOMA MELANOMA
MELANOMA MELANOMA
MELANOMA MELANOMA
MELANOMA MELANOMA 0
→ 2080

→ 6145
1177
1470
1671

3227
3400
3964
4057
4063
4110
4289
4357
4441
4663
4813
5226
5481
5494
5495
5508
5790
5892
6013
6019
6032
6045
6087

6184
6643
→ 135
246
663
766
982

→ 2080

→ 6145
19

1378
1382
1409
1841

2081
2083
2086
3253
3371
3372
4383
4459
4527
5435
5504
5538
5696
5812
5887
5934
6072
6115
6305
6399
6429
6430
6566
→ 135
133
134
233
259
381
561

Number of Selected Gene Number of Selected Gene

Fig. 6.12. Feature selection on the NCI DNA microarray data. The MD (left) and
mRMR (right) criteria were used. Features (genes) selected by both criteria are
marked with an arrow.
Fig. 6.22. Unsupervised segmentation with gPCA. (Figures by Huang et al. IEEE
c
2004.)

Fig. 7.7. Left: appearance and shape histograms computed at different levels of
the pyramid (the number of bins of each histogram depends on the level of the
pyramid where it is computed). Right: several ROIs learn for different categories in
the Caltech-256 Database (256 categories). (Figure by A. Bosch, A. Zisserman and
X. Muñoz. [27] (2007
c IEEE)).
d=1 d=3 d=6

5 5 5

0 0 0

−5 −5 −5

0 5 10 0 5 10 0 5 10
sigma=1 sigma=5 sigma=15
10 10 10

5 5 5
C=25.0

0 0 0

−5 −5 −5

−10 −10 −10


−5 0 5 −5 0 5 −5 0 5

10 10 10

5 5 5
C= 0.1

0 0 0

−5 −5 −5

−10 −10 −10


−5 0 5 −5 0 5 −5 0 5

Fig. 7.13. Top: SVDD using different polynomial kernels (degrees). Bottom: using
different Gaussian kernels (variances). In both cases, the circles denote the support
vectors. Figure by D. M. J. Tax and R. P. W. Duin [153], ( c Elsevier 2004).

You might also like