Progress in Computer Vision and Image Analysis (Series in Machine Perception _ Artifical Intelligence) (Series in Machine Perception and Artificial Intelligence) ( PDFDrive )
Progress in Computer Vision and Image Analysis (Series in Machine Perception _ Artifical Intelligence) (Series in Machine Perception and Artificial Intelligence) ( PDFDrive )
COMPUTER
AND
VISION
IMAGE ANALYSIS
PROGRESS IN
COMPUTER
AND
VISION
IMAGE ANALYSIS
Editors
Horst Bunke
University of Bern, Switzerland
World Scientific
NEW JERSEY • LONDON • SINGAPORE • BEIJING • SHANGHAI • HONG KONG • TA I P E I • CHENNAI
Published by
World Scientific Publishing Co. Pte. Ltd.
5 Toh Tuck Link, Singapore 596224
USA office: 27 Warren Street, Suite 401-402, Hackensack, NJ 07601
UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE
For photocopying of material in this volume, please pay a copying fee through the Copyright
Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to
photocopy is not required from the publisher.
ISBN-13 978-981-283-445-4
ISBN-10 981-283-445-1
Printed in Singapore.
PREFACE
An image is worth more than ten thousand words - and for that reason Computer
Vision has received enormous amounts of attention from several scientific and
technological communities in the last decades. Computer Vision is defined as
the process of extracting useful information from images in order to be able to
perform other tasks.
An image usually contains a huge amount of information that can be utilized
in various contexts. Depending on the particular application, one may be inter-
ested, for example, in salient features for object classification, texture properties,
color information, or motion. The automated procedure of extracting meaning-
ful information from an input image and deriving an abstract representation of its
contents is the goal of Computer Vision and Image Analysis, which appears to be
an essential processing stage for a number of applications such as medical image
interpretation, video analysis, text understanding, security screening and surveil-
lance, three-dimensional modelling, robot vision, as well as automatic vehicle or
robot guidance.
This book provides a representative collection of papers describing advances
in research and development in the fields of Computer Vision and Image Analysis,
and their applications to different problems. It shows advanced techniques related
to PDE’s, wavelet analysis, deformable models, multiple classifiers, neural net-
works, fuzzy sets, optimization techniques, genetic programming, among others.
It also includes valuable material on watermarking, image compression, image
segmentation, handwritten text recognition, machine learning, motion tracking
and segmentation, gesture recognition, biometrics, shadow detection, video pro-
cessing, and others.
All contributions have been selected from the peer-reviewed international sci-
entific journal ELCVIA (http://elcvia.cvc.uab.es). The contributing authors (as
well as the reviewers) are all established researchers in the field and they pro-
vide a representative overview of the available techniques and applications of this
broad and quickly emerging field.
v
June 26, 2009 9:54 World Scientific Review Volume - 9in x 6in ws-rv9x6
vi Preface
H. Bunke
J.J. Villanueva
G. Sanchez
X. Otazu
June 26, 2009 9:54 World Scientific Review Volume - 9in x 6in ws-rv9x6
CONTENTS
Preface v
vii
June 26, 2009 9:54 World Scientific Review Volume - 9in x 6in ws-rv9x6
viii Contents
Contents ix
x Contents
CHAPTER 1
1.1. Introduction
The addition of temporal information in visual processing is a strong cue for un-
derstanding structure and 3D motion. Two main sub-problems appear when it
comes to deal with motion analysis; correspondence and reconstruction. First is-
sue (correspondence) concerns the location analysis of which elements of a frame
correspond to which elements in the following images of a sequence. From el-
1
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
contain information about its translations and rotations, the quadratic terms will
explain the projective behavior, and so forth. Each step is utilized as the input
entry to the next step; that is, once the eigen-subspace is computed, we show how
the transformations are estimated, therefore, images are registered according to
these estimates and again the eigen-subspace is built with the registered images in
the previous step. These two step are iterated until the error function converges
under a certain degree of tolerance.
The outline of the paper is as follows: section 2 frames the idea of using the
eigenfeatures approach and its relation with the parametric model of transforma-
tions. More specifically, we analyze how such an appearance subspace is built
according to a previously selected frame of reference. Therefore, a polynomial
model is introduced in order to link the appearance constraints to the transforma-
tions that occurred across the sequence. In the experimental results, section 3, we
show a new manner of encoding temporal information. We point out that when
parallax is involved in the problem of video registration, the temporal represen-
tation gives a visual notion of the depth in the scene, and therefore it offers the
possibility of extracting the affine 3D structure from multiple views. The relation
between the surface polynomial’s degree and 3D affine structure is also illustrated.
In section 4, the summary and the conclusions of this paper are shown.
In this section, we present an objective function which takes into account ap-
pearance representation and time evolution between each frame and a frame of
reference. In this case, temporal transformations estimation is based on the fact
that images belonging to a coherent sequence are also related by means of their
appearance representation.
Given a sequence of F images {I1 , . . . , IF } (of n rows and m columns) and
a selected frame of reference I0 , we can write them in terms of column vectors
{y1 , . . . , yF } and y0 of dimension d = n × m. Both pictures pixel-based Ii and
vector-form yi of the i-th image in the sequence are relevant in the description of
our method. The first representation Ii is useful to describe the transformations
that occurred to each pixel. The vector-form picture is utilized for analyzing the
underlying appearance in all the sequence.
Under the assumption of brightness constancy, each frame in the sequence Ii
can be written as the result of a Taylor’s expansion around the frame of reference
I0 :
y i = y 0 + ti (1.2)
where ti is the vector-form of the second summand ∇I0 (x)T ωi (x) in eq. (1.1).
First description is exploited in section 1.2.2, where the parametric polynomial
model to describe the velocity field estimates is applied. The vector-form de-
scription in eq (1.2) is employed in the following section 1.2.1 to develop the
appearance analysis respect to a chosen reference frame.
First of all, we need to define a space of features where images are represented as
points. This problem involves finding a representation as a support for analyzing
the temporal evolution. To address the problem of appearance representation,
authors in12–14 proposed Principal Component Analysis as redundancy reduction
technique in order to preserve the semantics, i.e. perceptual similarities, during the
codification process of the principal features. The idea is to find a small number of
causes that in combination are able to reconstruct the appearance representation.
One of the most common approaches for explaining a data set is to assume
that causes act in linear combination:
yi = W ξi + y0 (1.3)
where ξi ∈ q (our chosen reduced representation, q < d) are the causes and y0
corresponds to the selected frame of reference. The q-vectors that span the basis
are the columns of W (d × q matrix), where the variation between the diferents
images yi and the reference frame is encoded.
With regard to equation (1.2), and considering the mentioned approximation
in (1.3), we can see that the difference ti between the frame of reference y0 and
each image yi in the sequence is described by the linear combination W ξi of the
vectors that span the basis in W . Notice that in the usual PCA techniques y0 plays
the role of the sample mean. In recognition algorithms this fact is relevant, since
there is assumed that each sample is approximated by the mean (ideal pattern) with
an added variation which is given by the subspace W . However, in our approach,
each image yi tends to the frame of reference y0 with a certain degree of variation,
which is represented as a linear combination of the basis W .
Furthermore, from eq. (1.1), the difference ti , that relies on the linear com-
bination of the appearance basis vectors, can be described in terms of the para-
metric model which defines the transformation from the reference frame y0 and
each image yi . This parametric model is developed in the following section 1.2.2.
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
F
E(W, . . . , pi1 , . . . , pir , . . . ) = | ti (pi1 , . . . , pir ) − W ξi |2 (1.4)
i=1
i (x) = X (x)Pi
w (1.5)
with
Ω(x) = 1 x y xy x2 . . . (xl y k ) . . . y s
Given that the term Gd×2d X (x)2d×r is computed once for all the images in iter-
ation, we re-name it as Ψd×r = Gd×2d X (x)2d×r . Notice that even when images
are highly dimensional, (e.g. d = 240 × 320), the computation of Ψ can be per-
fomed easily in Matlab by means of the operator ”.*”, without incurring in an out
of memory.
Given the parametric model for the transformations of the images in a sequence,
the objective function (1.4) can be written explicitly in terms of the parameters to
be estimated:
F
E(W, P1 , . . . , PF ) = | ΨPi − W ξi |2 (1.7)
i=1
In order to minimize this objective function, we need a two step procedure: first
given a set of images, the subspace of appearance W is computed, and secondly,
once the parameters Pi that register each frame yi to the frame of reference y0
are obtained, the images are registered in order to build again a new subspace of
appearance.
F
Σ= (φi (yi , Pi ) − y0 )(φi (yi , Pi ) − y0 )T (1.8)
i=1
The column vectors of W correspond to the q first eigenvectors of (1.8), that have
been previously ordered from the largest eigenvalues to the smallest one. The
projected coordinates onto the appearance subspace are: ξi = W T (φi (yi , Pi ) −
y0 ).
In order to see the range of applications of this technique, we deal with two sort of
problems. First, we study a camera movement, where it is shown the different re-
sults that appear when it comes to deal with a specific selected frame of reference.
In particular, this camera movement is a zoom that can be interpreted in terms of
registration as zoom-in or zoom-out operations depending on the selection of the
reference frame. Secondly, the significance of the polynomial’s degree is analyzed
through a sequence that includes a moving object due to a parallax effect.
Fig. 1.1. Some selected frames (1st, 3rd, 5th) from a sequence: 1,41,81 form the original one.
This topic is about camera operations with a single planar motion. Figure 1.1
shows three frames from a sequence of 100 frames, where a zoom-in is originally
perfomed. In this particular case, we selected 5 frames (1st , 21st , 41st , 61st , 81st )
from the original sequence to perform this analysis. This was motivated in order
to exploit the fact that the images have not to be taken continuously; the key point
is that they are related by the same underlying appearance. Here, we analyze
three cases depending on the selection of the reference frame: zoom-in registration
fig.1.2 and zoom-out registration fig.1.3.
Figure 1.2 shows a zoom-in registration that has been obtained selecting as
reference frame the left side image in fig. 1.1. To this end, we utilized a linear
polynomial model (1 degree), and the subspace of appearance has been built using
just one eigenvector, given that appearance is mainly conserved in the sequence.
The point is that the dimension not only depends on the error reconstruction as in a
recognition problem,12–14 but also relies on the selection of the frame of reference.
Figure 1.2 (a) shows a time evolution of the registered sequence images, while
figure 1.2(d) the registration picture also explains the module of the velocity field
in each pixel. Latter figure gives a notion of the situation of the camera’s cen-
ter. This is highly useful to perform an analysis of camera operations from this
registration technique. Figures 1.2(b) and (c) show the estimate optical flow field,
which is computed respect to the reference frame, in some frames of the sequence.
When it comes to register from this vector field, we have to take the inverse direc-
tion that is indicated in each arrow.
Besides, even though the sequence evolution showed a zoom-in camera oper-
ation, we can register selecting as reference frame the last frame, (see right side
image in fig. 1.1). The main difference between the registrations in figure 1.2 and
figure 1.3 is the size of the final mosaic (top views of fig. 1.2(a) and fig. 1.3(a)).
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
(d) (e)
Fig. 1.2. Zoom in: (a) Registered images according to a 1 degree polynomial model, where the first
frame has been taken as reference frame. Optical flow field corresponding to the third frame (b), and
to the last frame (c). (d) Velocity field module representation of the sequence of images. (e) Top view
of (d).
Actually, the size of the final mosaic selecting as reference frame the first frame
is equal to the reference frame. However, taking as reference frame the last frame
(case fig1.3) the size of the final mosaic is bigger than the size of the reference
frame. This is clearly reflected in the module representations of the sequence
registration, figures 1.2(d) and 1.3(d).
In order to get an insight into the relation between the complexity of the polyno-
mial estimation of the velocity field and the 3D affine structure which is encoded
in the image sequence, we deal with three sort of experiments. The idea is to see
the variety of possibilities that the polynomial surface model offers in this regis-
tration framework. Three cases present different relative motions across the image
sequence.
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
(d) (e)
Fig. 1.3. Zoom out: (a) Registered images according to a 1 degree polynomial model, where the last
frame has been taken as reference frame. Optical flow field corresponding to the third frame (b), and
to the first frame (c). (d) Velocity field module representation of the sequence of images.(e) Top view
of (d), where the red lines show the original size of the reference frame.
(a) (b)
(c) (d)
Fig. 1.5. Velocity field module representation (a) of the registered images, where 2 eigenvectors of
appearance and a polynomial model of 3rd degree have been used to this estimation. Fig. (b) is the
top view of (a). Two views, (c) and (d), of the 3D affine structure of the sequence.
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
Fig. 1.6. Three frames of a sequence of five images, where 3 eigenvectors of appearance and a
polynomial model of 4th degree have been used to the registration process. Right side image shows
the estimated optical flow respect to the middle frame (which corresponds to the third one in the
sequence). Left side one is the computed optical flow respect to the middle one.
Fig. 1.7. Different views of the 3D affine structure estimation of the sequence in fig. 1.6.
Third experiment deals with a translational camera motion. Two main motion
layers are present in this sequence due to a parallax effect. Figure 1.8 shows three
frames of a sequence of five, where the tree belongs to a different motion layer
than the background (houses). Apparently, the sequence can be interpreted as a
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
moving object with moving background as well. Nevertheless, the cause is the
difference in depth that the tree is situated from the background, and, moreover,
the specific movement of the camera. The registration has been performed using
2 eigenvectors of basis appearance and a 3rd degree polynomial model for the
motion field. The result of this can be seen in figures 1.9 (a) and (b). More specif-
ically, figure 1.9 (a) gives a certain notion of the relative depth among different
regions in the images, due to the module representation of the velocity field; re-
gions with higher velocity module are meant to be nearer the camera than regions
with a lower module. Figure 1.9 (b) shows a top view of (a), where the result of
registering is regarded in terms of a mosaic image. Finally, figure 1.9(c) shows the
3D affine structure estimation using,15 where all the images pixels in the sequence
have been employed. With this, we can see that the final 3D smooth surface shows
this mentioned depth difference due to parallax.
Fig. 1.8. Three frames of a sequence of five images. These images correspond to 1st , 3st and 5st
(from right side to left side).
Acknowledgments
References
M. C. de Andrade
Centro de Desenvolvimento da Tecnologia Nuclear - CDTN, P.O. BOX 941,
Belo Horizonte, MG, Brazil
1. Introduction
Image denoising and segmentation play an important role in image
analysis and computer vision. Image denoising reduces the noise
introduced by the image acquisition process, while image segmentation
recovers the regions associated to the objects they represent in a given
image. Image segmentation typically relies on semantically poor
information, directly obtained from the image around a spatially
restrained neighborhood and, for this reason, is broadly classified as a
low-level treatment [6].
Image segmentation often requires pre- and post-processing steps,
where user judgment is fundamental and feeds information of highly
semantic content back into the process. Pre-processing is an essential
step, in which specialized filters smooth the image, simplifying it for the
17
18 M. C. de Andrade
Figure 1. (a) original image of bovine endothelial cells. (b) and (c) simultaneous
convergence problem. (d) leaking through weak or diffuse edges.
22 M. C. de Andrade
Figure 1 cont. (e) leaking through weak or diffuse edges. (f) over-segmentation occurs if
appropriate denoising is not provided.
(c) seeds as light dots (local minima) (d) 52 markers placed by the user
Figure 2. ISS algorithm in action: (a) anisotropic filter, (b) sorted image, (c) seeds, (d)
markers.
An Interactive Algorithm for Image Smoothing and Segmentation 25
Figure 2 cont. ISS algorithm in action: (e to h) sequence of snap shots showing region
growing.
26 M. C. de Andrade
Figure 2 cont. ISS algorithm in action: (o) last snap-shot showing final segmentation, (p)
edges and markers superimposed on the original image.
An Interactive Algorithm for Image Smoothing and Segmentation 27
I xx (1 + I y2 ) − 2 I x I y I xy + I yy (1 + I x2 )
κ= (1)
2(1 + I x2 + I y2 ) 3 / 2
I t +1 = I t + κ (2)
The decision regarding when to stop the iterative process depends on the
image characteristics and on the regions to be extracted. At each step, the
image is slightly “flattened” according to its local curvature. It is
important to notice that repeatedly applying this filter may “erase” the
image, therefore user judgement is crucial in deciding when to stop. If
features being extracted are relatively homogeneous a slight denoising
28 M. C. de Andrade
(a) original
(e) (f)
Figure 4 cont. Effect of denoising on ISS segmentation result, (c) after 80 iterations, (d)
to (f) segmentation results for (a) to (b).
The first relation is useful when the image characteristics are such
that the gray-levels already dictate a natural processing order. In the
example shown in Figure 2a, the regions already have edges at higher
elevations than their inner parts. The second relation is useful for images
having homogeneous textures. The third relation is useful, for instance,
in images having discrete transitions between the regions having
homogeneous gray-levels, as shown in Figure 4a. In this case, taking the
difference between the maximum and the minimum in N(x), forces
higher values at the edges and, also has the additional benefit of closing
small gaps at the borders.
Finally, by adding a merging mechanism, controlled by user-placed
seeds, the region-growing and merging process is complete. A
correspondence table, as shown below, can be used to merge the regions.
This table is initialized as a sequence of integers from 1 to N, where N is
the number of minima present in the image. N is updated according to the
temporal sequence of absorptions. If, for instance, the region having
34 M. C. de Andrade
before 1 2 3 4 5 … i … N
after 1 2 1 4 5 … i … N
• There is only one labeled pixel in N(p). The current pixel receives
this label and is integrated into the corresponding neighbor region.
• There are 2 or more positive labeled pixels in N(p). If 2 or more
neighbors have markers labels (label <= N), a border has been found,
mark the current pixel as a "border", say a -1 label. Otherwise merge
all neighbors into one region (the one having the smaller label; i.e.,
the first labeled in N(p) ) and add the current pixel to it. If there are 2
labeled pixels in N(p) and one has marker label and the other a seed
label, the one having a marker label absorbs the one having a seed
label.
4.By using a merging table, re-label all pixels to reflect the absorption
they have undergone.
5.Draw the segmented image according to the newly assigned labels.
Appendix A and B present ISS pseudo-code and ISS execution time
for test-images, respectively.
3. Applications
This section illustrates some practical results obtained with the ISS
algorithm for different classes of image and also the segmentation
obtained with other methods. Figures 6, 7 and 8 present ISS
segmentation for microscopic images of ceramic, geological and medical
images. Figure 9 illustrates the performance of ISS and other
segmentation methods on different kind of image. In the segmented
images, user selected markers are shown as green dots and the extracted
edges are shown as red lines. Figure 6a presents a micrograph of ceramic
material containing grains (dark gray) separated by thin gaps (light gray).
Observing that pixels on the edges are lighter than inside grains, they
were sorted and processed according to the original intensity of the gray
levels, i.e., from darker to lighter. Figure 6b shows the ISS segmentation
result. Figure 7a shows a color micrograph of a geological sample
containing several grains. As this image presents homogeneous regions
and discrete transitions between them, pixels were sorted in ascending
order and processed according to the intensities of the morphological
gradient (difference between maximum and minimum gray in N(p)), thus
delaying the processing of the pixels around the edges. Figure 7b shows
36 M. C. de Andrade
Figure 9 cont. Segmentation results for Deformable Models (FP and BUB), SRG and
ISS applied to a micrograph of corneal endothelial sample.
42 M. C. de Andrade
Figure 9 cont. Segmentation results for Deformable Models (FP and BUB), SRG and ISS
applied to a micrograph of hematite.
An Interactive Algorithm for Image Smoothing and Segmentation 43
Figure 9 cont. Deformable Models (FP and BUB), SRG and ISS applied to the flower
image.
44 M. C. de Andrade
Figure 9 cont. Deformable Models (FP and BUB), SRG and ISS applied the peppers
image.
Acknowledgments
The author would like to acknowledge the CNPq - Conselho Nacional de
Desenvolvimento Científico e Tecnológico of Brazil and FAPEMIG –
Fundação de Amparo à Pesquisa de Minas Gerais, for the financial
support and the CISE - Computer and Information Sciences and
Engineering of the University of Florida, for the technical support
46 M. C. de Andrade
By using the merging table, relable all pixel to reflect the absorption they have
undergone.
An Interactive Algorithm for Image Smoothing and Segmentation 47
References
1. R. Adams, et. al. Seeded region-growing. IEEE Trans. Pattern Analysis and
Machine Intelligence, 16, 6, 641-647, 1994.
2. M. C. Andrade, et. al. Segmentation of microscopic images by flooding simulation:
a catchment basins merging algorithm. Proceedings of SPIE Nonlinear Image
Processing VIII, San Jose, USA, 3026, 164—175, 1997.
3. S. Beucher. Segmentation d'image et morphologie mathematique. École Nationale
Supérieure de Mines de Paris, PhD thesis, 1990.
4. S. Beucher. Watershed, hierarchical segmentation and waterfall algorithm.
Mathematical Morphology and its Applications to Image Processing, Kluwer
Academic Publishers, 69—76, 1994.
5. F. Caselles, et al. Image selective smoothing and edge detection by nonlinear
diffusion. SIAM Journal on Numerical Analysis, 29, 1, 183—193, 1992.
6. J. P. Cocquerez, S. Philipp. Analyse d' images: filtrage et segmentation. Masson,
Paris, 1995.
7. L. D. Cohen, at. al. Finite element methods for active contours models and balloons
for 2D and 3D images. IEEE Trans. Pattern Analysis and Machine Intelligence, 15,
1131—1147, 1993.
8. M. Grimaud. La geodesie numerique en morphologie mathematique. Application a
la detection automatique de microcalcifications en mammographie numerique,
École Nationale Supérieure de Mines de Paris, PhD thesis, 1991
9. M. Grimaud. A new measure of contrast: the dynamics. Proceedings of SPIE.
Image Algebra and Morphological Image Processing, 1769, 292-305, 1992.
10. G. Guo et. al.. Bayesian learning, global competition and unsupervised image
segmentation. Pattern Recognition Letters, 21, 107-416, 2000.
11. J. Isaac, et al. Sorting by Address Calculation. Journal of the ACM, 169—174,
1954.
12. M. Kass, et. al. Snakes active contour models. International Journal of Computer
Vision, 1, 321-331, 1988.
13. R. Malladi, et al. Shape modeling with front propagating: a level set approach.
IEEE Trans. Pattern Analysis and Machine Intelligence, 17, 2, 158—175, 1995.
14. R. Malladi, et al. A fast level set based algorithm for topology-independent shape
modeling. Journal of Mathematical Vision, 6, 269-289, 1996.
15. A. Mehnert, et. al. An improved seeded region-growing algorithm. Pattern
Recognition Letters 18, 106-1071, 1997.
16. F. Meyer. Un algorithme optimal de ligne de partage des eaux. VIII Congrès de
Reconaissance de Forme et d'Intelligence Artificielle. Lyon, France, 847-857, 1991.
17. F. Meyer, S. Beucher.. Morphological segmentation. Journal of Visual
Communication and Image Representation, 1, 1, 21-46, 1990.
18. P. Perona, et. al. Scale-space and edge detection using anisotropic diffusion. IEEE
Trans. Pattern Analysis and Machine Intelligence. 12, 7, 629—639, 1990.
An Interactive Algorithm for Image Smoothing and Segmentation 49
19. T. Sebastian, et. al. Segmentation of carpal bones from a sequence of 2D CT images
using skeletally coupled deformable models. www.lem.s.brown.edu, 2000.
20. J. A. Sethian. Tracking Interfaces with Level Sets. American Scientist. May-jun,
1997.
21. J. A. Sethian Level Set Methods and Fast Marching Methods. Cambridge Press, 2nd
ed, 1999.
22. K. Siddiqi, et. al. Geometric shock-capturing ENO schemes for sub-pixel
interpolation, computation and curve evolution. Graphical Models and Image
Processing, 59, 5, 278—301, 1997.
23. H. Tek, et. al. Volumetric segmentation of medical images by three-dimensional
bubbles. Computer Vision and Machine Understanding, 65, 2, 246—258, 1997.
24. C. Vachier. Extraction de Caracteristiques, Segmentation et Morphology
Mathematique. École Nationale Supérieure des Mines de Paris. PhD. Thesis, 1995.
25. L. Vincent. Algorithmes morphologiques a base de files d'attente et de lacets.
Extension aux graphes.. École Nationale Supérieure de Mines de Paris, PhD thesis,
1990.
26. L. Vincent, P. Soille. Watersheds in digital spaces: An efficient algorithm based on
immersion simulations. IEEE Trans. Pattern Analysis and Machine Intelligence, 13,
6, 583—598, 1991.
27. J. Weickert. Anisotropic diffusion in image processing. B.G. Teubner Sttutgart,
Deutschland, 1998.
28. C. Xu, J. L. Prince. Snakes, Shapes , and Gradient Vector Flow. IEEE Trans. on
Image Processing. 7, 3, 359—369, 1988.
29. A. Yezzi. Modified curvature motion for image smoothing and enhancement. IEEE
Trans. on Image Processing, 7, 3, 345-352, 1998.
30. S. C. Zhu, et al. Region competition: unifying snakes, region-growing, and
bayes/MDL for multiband image segmentation, IEEE Trans. Pattern Analysis and
Machine Intelligence, 18, 9, 880—900, 1996.
June 25, 2009 8:55 World Scientific Review Volume - 9in x 6in ws-rv9x6
CHAPTER 3
Antonio Turiel∗
Air Project - INRIA. Domaine de Voluceau BP105
78153 Le Chesnay CEDEX. France
In the latest years, multifractal analysis has been applied to image analysis. The
multifractal framework takes advantage of multiscaling properties of images to
decompose them as a collection of different fractal components, each one asso-
ciated to a singularity exponent (an exponent characterizing the way in which
that part of the image evolves under changes in scale). One of those components,
characterized by the least possible exponent, seems to be the most informative
about the whole image. Very recently it has been proposed an algorithm to recon-
struct the image from this component, just using physical information conveyed
by it. In this paper, we will show that the same algorithm can be used to assess
the relevance of the other fractal parts of the image.
3.1. Introduction
Edge detection and texture classification are two main tasks in image processing,
recognition and classification.1 Extraction of edges provides information about
the objects composing the scene, sometimes allowing segmentation; edges are
thus the main source of information in the image and serve well also for classify-
ing purposes. Texture information is more subtle, concerning the patterns and reg-
ularities inside the objects, light rendering and similar features. They also provide
an important amount of information and they are specially useful in classification
and segmentation tasks.
One of the reasons to introduce the multifractal formalism in image process-
ing was to provide a unified, reasonable way to deal with edges and textures at
the same time.2 The multifractal classification splits the image in edge-like and
texture-like sets, which are arranged according to their properties under changes
∗ Presentaffiliation: Physical Oceanography Department. Institut de Cincies del Mar - CMIMA
(CSIC). Passeig Martim de la Barceloneta, 37-49. 08003 Barcelona. Spain.
51
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
52 A. Turiel
in scale (that is, under zooms). This approach is specially well adapted to certain
types of images (for instance, those of turbulent or chaotic nature, as multifractal-
ity arose to explain the statistical properties of turbulent flows), but a great variety
of real world scenes seem to be well described in this framework.3
There is another reason to use the multifractal formalism: due to some statis-
tical properties, one of the fractal components issued from the multifractal classi-
fication allows reconstructing the whole image. The implementation of the recon-
struction algorithm has been recently proposed.4 That reconstruction algorithm
was designed to work over the most edge-like of the fractal components (recon-
structing from edge-like structures has been explored in several contexts from
scale-space theory5 to wavelet analysis6 ). The key point is that the same algo-
rithm can potentially be applied to the other components of the multifractal de-
composition. The goal of this paper is to use this algorithm to evaluate the relative
importance of each one of those fractal components.
The paper is structured as follows: in Section 3.2, the theoretical fundations
of the multifractal framework are briefly explained and the main implications dis-
cussed. Section 3.3 shows how to apply the formalism in practice, in particular
to produce the multifractal decomposition. In Section 3.4 the reconstruction al-
gorithm is presented and its properties are discussed; next, in Section 3.5 we will
make use of it to obtain an assessment about the relevance of each fractal compo-
nent. Finally, in Section 3.6 the conclusions of our work are presented.
For the purposes of illustration, we will make use of Lena’s picture (Fig-
ure 3.1) and we will apply our techniques on it. The image presents remarkable
deviations from the multifractal scheme (for instance, it has fuzzy edges in out
of focus objects and numerous coding and processing artifacts), but however it is
rather well described as a multifractal object.
The multifractal formalism was developed first in the study of turbulent flows,7 as
a way to explain the properties under changes of scale of very turbulent systems.
It has been applied to the study of different types of images by several authors,2,8
as images have some properties which resemble to those of turbulent flows. We
briefly sketch here the basic concepts in the approach we are going to use; for
further details the reader is referred to.2
We will denote any image by c(x) where x denotes the vector coordinates of
the referred pixel and it is normalized so that its average over the image vanishes,
c(x)x∈image = 0. Acording to2 we define a positive measure μ as follows: for
any subset A of the image, its measure μ(A) is given by:
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
that is, the measure assigns a weight to the set A equal to the sum of the absolute
variations of the image over it. Texturized areas will contribute with larger weights
to the measure μ than flatly illuminated, smooth surfaces. In fact we will not be
interested in the value of the measure over sets of fixed size, but in its evolution
under changes in scale (resolution) around each point. Given a collection of balls
Br (x) of radii r and center x, we will say that the measure μ is multifractal if:
for r’s small enough. The exponent h(x) is called the local singularity exponent,
and characterizes the way in which image behaves under changes in the size pa-
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
54 A. Turiel
rameter r at the particular point xa . As we consider small r’s, the largest values of
the measures μ(Br (x)) correspond to the smallest values of the exponents h(x).
For that reason, we will be specially interested in negative singularity exponents,
which are found at pixels which contribute strongly to the measure by themselves
(take into account that we consider very small radii). One of the advantages of this
definition is that what determines the value of h(x) is not the absolute variation
of c(x) at the point x, but its relative importance compared to the variations at the
surrounding points: multiplying c(x) by a constant modifies α(x) in eq. (3.2), but
lefts h(x) unchanged. The classification of points accordingly is local, in opposi-
tion with global thresholding techniques.
Natural images, that is, real word scenes of “natural” objects are of multi-
fractal character,2,9 what has been tested for a large variety of scenes3 and even
with color images.10 This property is far from trivial, and accounts for a spe-
cial arrangement of edges and textures in images. In the following, we will only
discuss on this type of images, although the same methods could be applied to
other as well. Assessment of multifractality on real, digitized images can not be
easily performed by a direct application of eq. (3.2) because of several technical
reasons: some interpolation mechanism should be devised to take into account
non-integer radii, for instance (there may be also undesiderable long-range effects
which should be filtered; see2 for a full discussion). In order to obtain a good
evaluation of the singularity exponents, singularity analysis via wavelet analysis11
should be performed. Wavelet analysis is a quite straightforward generalization of
the scaling measurements in eq. (3.2): insted of applying the measure over finite
size balls of radii r, a convolution of the measure μ with a scaled version of a
wavelet Ψ is computed. More precisely, the wavelet projection TΨ μ(x, r) of the
measure μ at the point x and the scale r is defined as:
1 x − y
TΨ μ(x, r) = dy |∇c|(y ) Ψ( ) (3.3)
r2 r
The measure μ is multifractal (in the sense of eq. (3.2)) if and only if:
for small scale parameters r. Notice that αΨ is in general dependent of the wavelet
a The prefactor (2 in our case) in the definition of the singularity exponent, eq. (3.2), is conventionally
set to the dimension of the embedding space. This normalization allows to compare results from
subspaces of different dimensions: the value of h( x) becomes independent of the dimension of the
space.
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
Ψ and the measure μ, but the scaling exponent h(x) has exactly the same value
than in eq. (3.2) and does only depend on μ, that is, on the image c(x)b .
From the theoretical point of view, the choice of the particular wavelet Ψ is
irrelevant for the determination of the exponents h(x); it can be even chosen as
a positive functionc . However, in practical grounds there are wavelets which
resolve better the finer structures than other. In Figure 3.2 we show the repre-
sentations of the multifractal classifications for four different wavelets. We will
discuss further about the choice of the wavelet in Section 3.3.
Multifractal classification of points is the first stage for multifractal decompo-
sition of images (what justifies the name “multifractal” for the method). Points in
the image can be arranged in fractal components, each one associated to a value
for the singularity exponent. Namely, the fractal component Fh0 associated to the
exponent h0 is given by:
Fh0 = { x ∈ image | h(x) = h0 } (3.5)
As the measure verifies to be multifractal, every point in the image can be as-
sociated a particular singularity exponent, so the image can be decomposed as the
union of all its fractal components. They are indeed fractal sets,2 their dimensions
being connected with statistical properties of images.12 The most interesting of
those fractal components is the Most Singular Manifold (MSM),9 which is the
fractal component associated to the least possible exponent. This set is usually
related to the edges present in the image.2 The least possible exponent is usually
denoted h∞ and its associated manifold Fh∞ is generally denoted F∞ in short.
56 A. Turiel
1
Ψ1 (x) =
1 + r2
dΨ1 −2r
Ψ2 (x) = (x) =
dr (1 + r2 )2
d2 Ψ1 1 2
Ψ4 (x) = = (r2 − 1)e− 2 r
dr2
Each one of those wavelets fits the best for a particular application. Lorentzian
wavelet (Ψ1 ) is a possitive wavelet of slow decay at infinity. It is very good to re-
solve sharp (negative) singularities in the measure μ (good spatial localization),
but it has the backdraw of being unable to distinguish all the singularities beyond
h = 0 (it returns the value h = 0 for all of them); besides, it cannot be used to
analize the signal c(x) directly (a certain number of vanishing moments would
be required2,13). The gaussian wavelet (Ψ3 ) cannot be either used over the signal
itself, as it is positive also, but having fast decaying tails it is able to resolve the
whole range of singularities (typically between −1 and 2, see2 ); the backdraw
is a worse spatial localization, specially for the MSM. The second derivative of
the gaussian (Ψ4 ) is, from the theoretical point of view, the best possible choice
for analyzing signals: it resolves the whole range of values of h(x) and it can
be even used over the signal itself, without necessity of constructing a measure.
However, in practice it has very poor spatial localization, associated to an inner
minimum scale of several pixels, necessary to separate positive from negative ex-
trema in wavelet projections. The best choice in practice is then the derivative of
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
Fig. 3.2. Multifractal decompositions on Lena’s image for Lorentzian wavelet and its derivative (top)
and Gaussian wavelet and its second derivative (bottom) (see Section 3.5). The smaller is the singular-
ity exponent, the brighter is the point.
58 A. Turiel
Fig. 3.3. Multifractal decompositions on Lena’s image. From left to right: Lorentzian wavelet, its
derivative, Gaussian wavelet and its second derivative. From top to bottom: excluded manifolds,
MSMs, second MSMs, third MSMs, fourth MSMs and fifth MSMs.
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
Once the value of h∞ has been obtained, we isolate the MSM, defining it as the
set of points x for which h∞ −Δh ≤ h(x) < h∞ +Δh with a conventionally fixed
value of the dispersion Δh; in the following we take Δh = 0.15. We represent
also the other fractal manifolds according to the given dispersion, so the nth MSM
will be the set of points x for which h∞ +(2n−3)Δh ≤ h(x) < h∞ +(2n−1)Δh
(the MSM itself is the first MSM). Finally, we define the manifold of excluded
points or excluded manifod as the set of points x such that h(x) < h∞ − Δh, that
is, which are more singular than expected. Those events are generally associated
to the borders of the image and some particular events, which happen to have
singularities close to −1, typical to isolated edges.2 In Figure 3.3, we show all
those manifolds.
Recently, an algorithm to reconstruct the whole image from the most singular of
its fractal components has been proposed.4 We will not go into details about the
reconstruction algorithm; we will just present the final formula and discuss it. The
reader is referred to the original paper.
The reconstruction formula intends to reproduce the whole image from the
value of the gradient field over the MSM. First, let us define the essential gradient
over a general set F . We define it as a vector function which is only different from
zero over the set F , namely:
where the symbol δF stands for a delta function on the set F . The reconstruction
algorithm is given by the following expression:
where ⊗ stands for the convolution and the reconstructing kernel g is given in the
Fourier space by the following expression:
f
gˆ(f) = i 2 (3.8)
f
In the above expression, the symbol ˆ stands for the Fourier transform, f is
√
the spatial frequency (the variable in the Fourier domain) and i ≡ −1. The
reconstruction formula states that it is possible to retrieve the image from the
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
60 A. Turiel
essential gradient associated to the MSM F∞ . Note, however, that the formula
could be applied to any set F ; we will denote by cF the image retrieved from the
essential gradient associated to the set F ; namely:
We will call eq. (3.9) the generalized reconstruction formula. In this language,
the reconstruction formula states that cF∞ = c. The generalized reconstruction
formula has some nice properties.
• It is linear in the reconstructing data: If the set F is the disjoint union of two
sets F1 and F2 (i.e., F = F1 ∪ F2 , with F1 ∩ F2 = ∅), then cF = cF1 + cF2 .
This comes from the fact that vF1 ∪F2 = vF1 + vF2 if the sets are disjoint, and
the associativity of the convolution product.
• It always exists a set from which reconstruction is perfect: If F = 2 , that is,
the whole image, vF = ∇c, but as ∇c( ˆ f) = −ifĉ(f) and taking into account
the definition of g , trivially cF = c.
We will make an assess about the relative importance of the fractal manifolds
by means of the generalized reconstruction formula. In Figure 3.4 we show the
different images reconstructed from the manifolds presented in Figure 3.3 using
eq. (3.9); in Table 3.1 the associated PSNRs can be found. We see that the MSM
provides always the greatest amount of information about the image, which is
reflected both by visual inspection and the values of the PSNR. However, the
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
3.6. Conclusions
62 A. Turiel
Fig. 3.4. Reconstruction images from the sets represented in Figure 3.3.
Table 3.1. PSNRs (in dB) for the reconstructed images represented in Figure 3.4.
Ψ1 Ψ2 Ψ3 Ψ4
the modern techniques of ridgelets and curvelets,14 which have been shown to be
very efficient for image coding.
In order to implement compressing techniques using the reconstruction algo-
rithm, high performance reconstructing sets should be extracted from images. The
technique of singularity classification is a good first approach to obtain that set,
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
64 A. Turiel
Fig. 3.5. Accumulated reconstructed images, from the reconstructed images in Figure 3.4.
but the multifractal model is just approximate for general real word images (it
was derived for a subset of so-called natural scenes) and so the MSM is just an
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
Table 3.2. PSNRs (in dB) for the accumulated reconstructed images represented in Fig-
ure 3.5.
Ψ1 Ψ2 Ψ3 Ψ4
66 A. Turiel
however, other methods for the extraction of the reconstructing set need to be
devised.
Fig. 3.6. Left: MSM with Lorentzian wavelet, h∞ = −0.5 ± 0.2. Right: reconstructed image
(PSNR=24.52 dB).
Acknowledgements
References
CHAPTER 4
69
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
70 A. Caro et al.
4.1. Introduction
Active Contours (or snakes) are a low-level processing technique widely used to
extract boundaries in many pattern recognition applications.1 In this paper, an im-
proved snake is proposed to recognise muscles in MRI sequences of Iberian ham
in different maturation stages. In the next subsections, an overview of the Active
Contours is presented, and the relationship with the field of Food Technologies
is exposed. In addition, the algorithm design is presented in section 2, and the
obtained results are discussed in section 3. Conclusions are shown in section 4.
Deformable models are curves defined within an image domain that can be moved
under the influence of internal forces, which are defined within the curve or sur-
face itself, and external forces, which are computed from the image data. The
internal forces are designed to keep the model smooth during deformation. The
external forces are defined to move the model toward an object boundary or other
desired features within an image.2
Energy-minimising Active Contour models were proposed by Kass et al.3
They formulated a model using an energy function. They developed a controlled
continuity spline which can be operated upon by internal contour forces, images
forces, and external forces which are supplied by an interactive user, or potentially
by a higher level process. The goal was to obtain a local minimum that seems
most useful to that process or user. An algorithmic solution involves derivation of
this objective function and optimisation of the derived equation for finding an ap-
propriate solution. However, in general, variational approaches do not guarantee
global optimality of the solution.4
Amini et al.4 also proposed a dynamic programming algorithm for minimis-
ing the functional energy that allows addition of hard constraints to obtain a more
desirable behaviour of the snakes. However, the proposed algorithm is slow, hav-
ing a great complexity O(nm3 ), where n is the number of points in the contour
and m is the size of the neighbourhood in which a point can move during a single
iteration.4,5
Cohen5 proposed an additional force that made the curve behave like a balloon
which is inflated by this new force. On the other hand, Williams and Shah6 de-
veloped a Greedy algorithm which has performance comparable to the Dynamic
Programming and Variational Calculus approaches. They presented different for-
mulations for the continuity term, and they examined and evaluated several ap-
proximations for the curvature term. The proposed approach was compared to the
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
original Variational Calculus method of Kass et al. and the Dynamic Program-
ming method developed by Amini et al. and found to be comparable in the final
results, while having less computational cost than Dynamic Programming (lower
complexity) and being more stable and flexible for including hard constraints than
the Variational Calculus approach.
Kichenassamy7 presented a new Active Contour and surface model based on
novel gradient flows, differential geometry and curve and surface evolutions. This
led to a novel snake paradigm in which the feature of interest may be considered
to lie at the bottom of a potential well.
In addition, Radeva et al.8 proposed new approaches incorporating the gradi-
ent orientation of image edge points, and implementing a new potential field and
external force in order to provide a deformation convergence, and attraction by
both near and far edges.9
McInerney and Terzopoulos10 also developed a parametric snake model that
had the power of an implicit formulation by using a superposed simplicial grid to
quickly and efficiently reparameterise the model during the deformation process.
To reduce the problems caused by convergence to local minima, some authors
have proposed simulated annealing as well as multiscale methods.11 Prez et al.12
presented a new technique to construct Active Contours based on a multiscale
representation using wavelet basis. Another approach to deal with this problem
was proposed by Giraldi et al.13 They presented the Dual Active Contour Model,
which consisted basically in comparing one contour that expands from inside the
target feature, and another one which contracts from the outside. The two contours
were interlinked to drive the contour out of local minima, making the solution less
sensitive to the initial position.
Caselles et al.14 proposed a Geodesic Active Contour model based on energy
minimisation and geometric Active Contours based on the theory of curve evo-
lution. They proved that a particular case of the classical energy snake model is
equivalent to finding a geodesic or minimal distance path in a Riemannian space
with a metric derived from the image content. This means that under a specific
framework, boundary detection can be considered equivalent to finding a path of
minimal weighted length via an Active Contour model based on geodesic or local
minimal distance computation. Nevertheless, no method has been proposed for
finding the minimal paths within their Geodesic Active Contour model.15 Gold-
enberg et al.16 proposed a new model, using an unconditionally stable numerical
scheme to implement a fast version of the geodesic Active Contour model.
Xu and Prince17 developed a new external force for Active Contours, which
they called Gradient Vector Flow. This new force was computed as a diffusion
of grey-level gradient vector of a binary edge map derived from the image. The
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
72 A. Caro et al.
corresponding snake was formulated directly from a force balance condition rather
than a variational formulation.18
Ballerini19 proposed an energy minimisation procedure based on Genetic Al-
gorithms. These Genetic Algorithms operate on the position of the snake, and
their fitness function is the total snake energy. A modified version of the image
energy was used, considering both the magnitude and the direction of the gradi-
ent and the Laplacian of Gaussian, though the region of interest is defined by an
external user.
Park and Keller20 presented a new approach that combines Dynamic Program-
ming and the watershed transformation, calling it the Watersnake. The watershed
transformation technique is used to decide what points are needed, in order to
eliminate unnecessary curves while keeping important ones.
the Food Technology is still recent and it is confined for researching purposes.
Cernadas et al.26–28 analyse MR images of raw and cured Iberian loin to clas-
sify genetic varieties of Iberian pigs and to predict the intramuscular fat content.
The results are promising to its application to ham.
The loin is an uniform and simple muscle, and this is a very important advan-
tage, comparing with the great number and complex distribution of muscles of the
ham, being this one a significant drawback.
In a previous work,31 classical snakes (mainly the greedy algorithm) have been
applied to ham MRI sequences to extract boundaries of the Biceps Femoris mus-
cle. Although the obtained results were nearly satisfactory, the method suffers
from robustness for others muscles. This is one of the reasons because of the
Quadriceps muscle has been studied in this paper too. An enhanced Active Con-
tour approach is proposed, based on the use of potential fields as external force
and the improvements of the standard greedy algorithm for taking into account the
peculiarities of the particular environment.
This new method is applied over a database of specific MR images from Food
Technology, particularly Iberian ham images obtained at four different maturation
stages (raw, post-salting, semi-cured and cured ham). Deformable Models are
used to achieve the extraction of different muscles (Biceps Femoris and Quadri-
ceps), studying their volume changes during the ripening of Iberian ham. The
verification of the presented approach is shown examining these muscles, and the
obtained practical results may allow us to design a methodology to optimise the
ripening process.
Deformable Models (Active Contours, or Snakes), are curves that can be moved
due to the influence of internal and external forces.1 These forces are defined so
that the snake can detect the image objects in which we are interested.29 Active
Contours are defined by an energy function. By minimising this energy function,
the contour converges, and the solution is achieved.
An Active Contour is represented by a vector, v, which contains all of the n
points of the snake. The functional energy of this snake is given by:
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
74 A. Caro et al.
The internal forces of Deformable Models are designed to hold the curve together
(elasticity forces, i.e. Econt ) and to keep it from bending too much (bending
forces, i.e. Ecurv ). Typically, the external forces are defined as a gradient of
a potential function. Both internal and external forces attempt to drive the curve
toward the edges (object boundary) or other desired features within an image. Un-
fortunately, the initial snake often needs to be placed near to the searched border.
Furthermore, Active Contours have difficulties progressing into concave bound-
ary. Then, selecting correct external forces that solve these problems is highly
recommended.
One of the proposed ideas in this work consists in creating potential fields,
using them instead of traditional external forces. The purpose of building these
potential fields is to move the points of the contours toward the object boundary,
not only when they are situated close to the borders, but even when they are not
located near to the edges. A traditional potential force cannot attract distant points
or either moves them into concave boundary regions, being these two key diffi-
culties with standard Active Contour algorithms. A potential field is developed
for solving these problems, and it is presented in this section. Capture range for
snakes has been extended, and concave regions could be explored using this new
field. These are the main advantages of using this field as an external force for the
Active Contour.
The potential fields are computed in a two steps algorithm. The algorithm is
described as follows:
As a first stage, edge map images are necessary before computing the potential
field, in order to determine the object initial boundary. These primary borders will
be used to increasingly grow the potential field.
A 7x7 Gausian filter has been used to smooth the images. The filter size is
either 13x13 or 15x15. The goal is to smooth the images converting similar tex-
tures in homogeneous grey levels, avoiding dissimilarities. A 3x3 Sobel operator
is applied, obtaining the edge images.
Although the edge images apparently seem to be almost black (except for
edges, which are shown in a light white colour), they contain a great variety of
data. This extra information is found in dark grey levels, and needs to be equalized
to obtain an adequate binary image. The equalisation process converts the grey
levels of the edges to values close to 255. After that, the images are converted to
binary using a threshold. This value is calculated considering the grey level which
divides the histogram in two parts: the black colour (80% of the total pixels) and
the white colour (the other 20%).
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
76 A. Caro et al.
This bi-level image is used as an edge map to grow the potential field, so
removing all the groups of isolated pixels is desirable. These groups of noisy
pixels can seriously affect the potential field, producing a local convergence for
the snake algorithm (global minimum would not be assure). Eliminating islands of
pixels is a remarkable task in the pre-processing stage. A recursive process based
on a growing seed is developed for finding islands of pixels with a size (number
of pixels) lower than a given value (48 or 96 pixels, depending on the image).
Therefore, the original image has been filtered, equalized, converted to binary
level and processed to eliminate the undesirable noise, just before the potential
field is computed (see Figure 4.1).
Fig. 4.2. The calculated potential fields between two points of boundaries.
In this way, images containing potential field magnitudes have been calcu-
lated. For each point of the image, the potential field is computed, obtaining a
new image, with the same dimensions as the original, which contains the potential
field value for each one of the image points.
Contour initialisation is one of the main problems of the Deformable Mod-
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
els. The snake must be developed to find the object searched for. An automatic
algorithm has been developed to place an initial contour inside the images.
For its realisation, the potential field image is known. Searching inside the
images in order to find the points with the smallest values is required. The key is
to distribute all the points of the contour surrounding all those points of the image
with smallest potential field values. In this manner, it is ensured that the snake will
evolve towards the edges of the object, searching for points with levels of energy
smaller than the energy values of the points in the initial snake.
While the contour is being deformed another difficulty could arise: some
points of the contour could be attracted to the same place and cross over their
trajectories (Figure 4.3.a). This is highly undesirable, because great amounts of
nodes situated near by do not have significant information in the recognition task.
Moreover, contours with dots that cross over their trajectories (Figure 4.3.b)
would be useless. The goal is to distribute all the nodes of the snake in such a way
that they determine the object contour in the best way possible. A procedure has
been added to eliminate the nearest knots and aggregate new points between the
most distant nodes (Figure 4.3.c).
Figure 4.3 shows a 7-point contour. Points 3 and 4 cross over their trajectories
during the evolution of the curve (Figure 4.3.a), producing a non-desirable snake
(Figure 4.3.b). The algorithmic improvement remove one of this two points when
they are getting closer (Figure 4.3.c), adding a new point in the middle of the
largest segment (between the points 1 and 7 from the initial situation is added a
new one, renaming all the points).
78 A. Caro et al.
The evolution study of the Iberian ham muscles during the ripening process could
be one of the goals to confirm the practical viability of using the proposed ap-
proach. Muscle recognition could be used for determining the fat content and
its distribution, as well as for studying how the hams evolve in their maturation
process.
The presented research is based on MRI sequences of Iberian ham images.
One of the images of these sequences is shown in figure 4.4.a. A technique to
recognise the main muscle structures (Biceps Femoris and Quadriceps) is em-
ployed. Four Iberian hams have been scanned, in four stages during their ripening
time.
The images have been acquired using an MRI scan facilitated by the ”Infanta
Cristina” Hospital in Badajoz (Spain). The MRI volume data set is obtained from
sequences of T1 images with a FOV (field-of view) of 120x85 mm and a slice
thickness of 2mm, i.e. a voxel resolution of 0.23x0.20x2mm. The total number
of images of the obtained database is 336 for the Biceps Femoris, and 448 for the
Quadriceps muscle.
As a previous step, a pre-processing stage is introduced, in order to compute
the potential field values (Figure 4.4.b and 4.4.c). Therefore, images containing
potential field magnitudes have been calculated.
In addition, the initial snakes for the central images of the sequence have been
previously calculated too (Figure 4.4.d). When the final snake for this image has
been achieved, this final contour is automatically modified, and a scaled version
(the same contour, but smaller) of the final snake is selected as the fist contour for
the immediately preceding and succeeding images.
Once the complete database of images and the initial values of the snakes for
these images are set, the application of Active Contours to compute the area of the
muscle is needed. The greedy algorithm runs over the central image. The snake is
initialised with the computed values, and next, the algorithm finishes after further
iterations, and the final snake is reached for this image (Figure 4.4.e). This snake
determines the area of the muscle over the image.
The next step is based on applying this final snake for the central image as
an initial snake for the following image, as it was previously mentioned. In such
a manner, the final snake that could be used as initial for the next image of the
sequence is obtained. Similarly, the final snake achieved in the central image
could be used as an initial snake for the previous image, and so on.
The final step computes areas and volumes for the extracted muscles (Fig-
ure 4.4.f). Calculating the surface of the final obtained snake for each image is
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
(a) (b)
(c) (d)
(e) (f)
Fig. 4.4. Algorithm design for the practical application. (a) Original image. (b) Map image. (c)
Potential field. (d) Initial snake. (e) Final snake. (f) Area of the muscle.
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
80 A. Caro et al.
(a) (b)
(c) (d)
Fig. 4.5. Initial (a) and final (b) snake for the Biceps Femoris muscle, and initial (c) and final (d)
snake for the Quadriceps muscle.
4.4. Conclusions
Using potential fields as external forces is a suitable solution for Deformable Mod-
els. It is allowed to initialise snakes far from the searched border, combining this
new external force with algorithmic improvements. The redistribution of the snake
points during the snake deformation stage, the elimination of groups of isolated
pixels in the pre-processing stage and the utilisation of scaled versions of the final
snakes used as initial snakes for consecutive images suppose important and valid
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
82 A. Caro et al.
Fig. 4.6. Biceps Femoris (a) and Quadriceps (b) muscle size evolution during the ripening time.
Acknowledgements
References
84 A. Caro et al.
14. V. Caselles, R. Kimmel and G. Sapiro, Geodesic Active Contours, International Jour-
nal of Computer Vision, 22(1):61-79 (1997).
15. C. Han, T.S. Hatsukami, J.N. Hwang and C. Yuan, A Fast Minimal Path Active Con-
tour Model, IEEE Transactions on Image Processing, 10(6):865-873 (2001).
16. R. Goldengerg, R. Kimmel, E. Rivlin and M. Rudzsky, Fast Geodesic Active Contours,
IEEE Transactions on Image Processing, 10(10):1467-1475 (2001).
17. C. Xu and J.L. Prince, Gradient Vector Flow: A New External Force for Snakes, IEEE
Proc. On Computer Vision and Pattern Recognition, 1:66-71 (1997).
18. C. Xu and J.L. Prince, Snakes, Shapes, and Gradient Vector Flow, IEEE Transactions
on Image Processing, 1:359-369 (1998).
19. L. Ballerini, Genetic snakes for medical images segmentation, Lectures Notes in Com-
puter Science, 1596:59-73 (1999).
20. J. Park and J.M. Keller, Snakes on the Watershed, IEEE Transactions on Pattern Anal-
ysis and Machine Intelligence, 23(10):1201-1205 (2001).
21. T. Antequera, C.J. Lpez-Bote, J.J. Crdoba, C. Garca, M.A. Asensio, J. Ventanas and Y.
Daz, Lipid oxidative changes in the processing of Iberian pig hams, Food Chemical,
54:105 (1992).
22. R. Cava and J. Ventanas, Dinmica y control del proceso de secado del jamn ibrico en
condiciones naturales y cmaras climatizadas, Tecnologa del jamn ibrico, Ed. Mundi
Prensa, 1:260-274 (2001).
23. E. Cernadas, M.L. Durn and T. Antequera, Recognizing Marbling in Dry-Cured
Iberian Ham by Multiscale Analysis. Pattern Recognition Letters (2002).
24. M.L. Durn, E. Cernadas, A. Caro and T. Antequera, Clasificacin de distintos tipos de
jamn ibrico utilizando Anlisis de Texturas, Revista Electrnica de Visin por Computa-
dor, 5 (2001).
25. M.L. Durn, A. Caro, E. Cernadas, A. Plaza and M.J. Petrn, A fuzzy schema to evalu-
ate fat content in Iberian pig meat images, V Ibero American Symposium on Pattern
Recognition, 1:207-216 (2000).
26. E. Cernadas, M.L. Durn, P.G. Rodrguez, A. Caro, E. Muriel and R. Palacios, Esti-
mating intramuscular fat content of cured Iberian loin using statistical analysis of its
magnetic resonance images, Portuguese Conf. on Pattern Recognition (2002).
27. E. Cernadas, A. Plaza, P.G. Rodrguez, M.L. Durn, J. Hernndez, T. Antequera, R.
Gallardo and D. Villa, Estimation of Dry-Cured Iberian Ham Quality Using Magnetic
Resonance Imaging, The 5th International Conference on Applications of Magnetic
Resonance in Food Science, 1:46-47 (2000).
28. E. Cernadas, T. Antequera, P.G. Rodrguez, M.L. Durn, R. Gallardo and D. Villa, Mag-
netic Resonance Imaging to Classify Loin from Iberian Pigs, Magnetic Resonance in
Food Science - A View to the Next Century, 1:239-254. Ed. The Royal Society of
Chemistry (2001).
29. J.M. Bonny, W. Laurent, R. Labas, R. Taylor, P. Berge and J.P. Renou, Magnetic
Resonance Imaging of connective tissue: a non-destructive method for characterising
muscle structure, Journal of the Science of Food and Agriculture, 81:337-341 (2000).
30. S. Ranganath, Contour Extraction from Cardiac MRI Studies Using Snakes, IEEE
Transactions on Medical Imaging, 14:328-338 (1995).
31. A. Caro, P.G. Rodrguez, E. Cernadas, M.L. Durn, E. Muriel and D. Villa, Computer
Vision Techniques Applying Active Contours to Muscle Recognition in Iberian Ham
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
CHAPTER 5
5.1. Introduction
The field of off-line handwriting recognition has been a topic of intensive research
for many years. First only the recognition of isolated handwritten characters was
investigated,1 but later whole words2 were addressed. Most of the systems re-
ported in the literature until today only consider constrained recognition prob-
lems based on small vocabularies from specific domains, e.g. the recognition of
handwritten check amounts3 or postal addresses.4 Free handwriting recognition,
without domain specific constraints and large vocabularies, was addressed only
recently in a few papers.5,6 The recognition rate of such systems is still low, and
there is a need to improve it.
The combination of multiple classifiers has become a very active area of re-
search recently.7,8 It has been demonstrated in a number of applications that using
more than a single classifier in a recognition task can lead to a significant im-
provement of the system’s overall performance. Hence multiple classifier systems
seem to be a promising approach to improve the recognition rate of current hand-
writing recognition systems. Concrete examples of multiple classifier systems in
87
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
class specific error rates in the combination method, as it was proposed in,23 is no
longer feasible because of its low reliability in case of a high number of classes.
Further constraints on possible combination schemes are imposed by the use of
HMMs as base classifiers. In our framework, only the class on the first rank to-
gether with its score is returned by each individual HMM classifier. Therefore,
Borda count, as well as sum, product, and median rule can’t be applied. Yet
weighted voting is feasible for this problem. It is, in fact, the most general form
of classifier combination available in the proposed framework.
In weighted voting, each classifier has a single vote for its top ranked class,
and this vote is given a weight. To derive the final decision in a multiple classifier
system using weighted voting, the weights assigned to each class by the different
classifiers are summed up and the class with the highest score is selected as the fi-
nal result. Under a weighted voting scheme, the weights assigned to the individual
classifiers are free parameters. Sometimes these weights are chosen proportional
to the recognition performance of individual classifiers. In this paper, we apply
a more general approach where the weights are considered as parameters which
are to be selected in such a way that the overall performance of the combined sys-
tem is optimized. A genetic algorithm is used to actually determine an optimal
(or suboptimal) combination of weight values. Also in29 a genetic algorithm was
used for weight optimization in a multiple classifier system. However, an easier
recognition problem was considered there, i.e. the application was the recognition
of handwritten digits and the combined classifiers were not created by an ensem-
ble creation method, but were each separately designed by hand. In30 a genetic
algorithm was used for the selection of a subset of classifiers from an ensem-
ble, which is equivalent to weight optimization using only the weights 0 and 1.
Another application of a genetic algorithm in a multiple classifier framework has
been proposed in.16 In this work, a genetic algorithm was used to select individual
classifiers from a pool for the different modules of a multiple classifier framework.
The remainder of this paper is organized as follows. In Section 5.2 our base
classifier, which is a handwritten word recognizer based on hidden Markov Mod-
els (HMMs), is introduced. The following section describes the methods used
to produce classifier ensembles from the base classifier. Then the classifier com-
bination schemes used in this work are introduced in Section 5.4. The genetic
algorithm for the calculation of the weights applied in the weighted voting com-
bination scheme is described in Section 5.5. In Section 5.6 experimental results
comparing genetic weight optimization with other combination schemes are pre-
sented. Finally the last section draws conclusions from this work.
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
image of a
handwritten page
preprocessing
normalized
word images
feature extraction
feature vectors
HMM
recogntion result
The basic handwritten text recognizer used in the experiments of this paper is
similar to the one described in.6 It follows the classical architecture and consists
of three main modules (see Fig. 5.1): the preprocessing, where noise reduction
and normalization take place, the feature extraction, where the image of a hand-
written text is transformed into a sequence of numerical feature vectors, and the
recognizer, which converts these sequences of feature vectors into a word class.
The first step in the processing chain, the preprocessing, is mainly concerned
with text image normalization. The goal of the different normalization steps is to
produce a uniform image of the writing with less variations of the same character
or word across different writers. The aim of feature extraction is to derive a se-
quence of feature vectors which describe the writing in such a way that different
characters and words can be distinguished, but avoiding redundant information as
much as possible. In the presented system the features are based on geometrical
measurements. At the core of the recognition procedure is an HMM. It receives
a sequence of feature vectors as input and outputs a word class. In the following
these modules are described in greater detail. In the Appendix a small subset of
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
Fig. 5.2. Preprocessing of the images. From left to right: original, skew corrected, slant corrected
and positioned. The two horizontal lines in the right most picture are the two baselines.
the words used in the experiments described in Section 5.6 are shown.
5.2.1. Preprocessing
Each person has a different writing style with its own characteristics. This fact
makes the recognition task complicated. To reduce variations in the handwritten
texts as much as possible, a number of preprocessing operations are applied. The
input for these preprocessing operations are images of words extracted from the
database described in.31,32 In the presented system the following preprocessing
steps are carried out:
• Skew Correction: The word is horizontally aligned, i.e. rotated, such that the
baseline is parallel to the x-axis of the image.
• Slant Correction: Applying a shear transformation, the writing’s slant is trans-
formed into an upright position.
• Line Positioning: The word ’s total extent in vertical direction is normalized to
a standard value. Moreover, applying a vertical scaling operation the location
of the upper and lower baseline are adjusted to a standard position.
Window
Fig. 5.3. Illustration of the sliding window technique. A window is moved from left to right and
features are calculated for each position of the window. (For graphical representation purposes, the
window depicted here is wider than one pixel.)
characterizes the window from the global point of view. It includes information
about how many pixels in which region of the window are, and how they are
distributed. The other features represent additional information about the writing.
Features four and five define the position of the upper and the lower contour in
the window. The next two features, number six and seven, give the orientation
of the upper and the lower contour in the window by the gradient of the contour
at the window’s position. As feature number eight the number of black-white
transitions in vertical direction is used. Finally, feature number nine gives the
number of black pixels between the upper and lower contour. Notice that all these
features can be easily computed from the binary image of a text line. However, to
make the features robust against different writing styles, careful preprocessing, as
described in Subsection 5.2.1, is necessary.
To summarize, the output of the feature extraction phase is a sequence of 9-
dimensional feature vectors. For each word to be recognized there exists one
such vector per pixel along the x-axis, i.e. along the horizontal extension of the
considered word.
Fig. 5.4. HMM for a single character with linear transition structure.
states. One or several of the states are defined as final states. For each state a like-
lihood value for each possible observation is defined. If there is a finite number
of observations then a probability for each observation, i.e. feature vector, is de-
fined, but if we have continuous observation vectors a probability distribution is
used. A valid sequence of states for a observation sequence oseq = o1 , o2 , . . . , on
is sseq = s1 , s2 , . . . , sn where sn is a final state. Note that the number of states
in sseq is the same as the number of observations in oseq . The likelihood of the
sequence of states sseq is the product of the likelihoods of observing oi in state
si for all observations, multiplied by the probabilities of the transitions from state
si to si+1 for all i ∈ {1, . . . , n − 1}. There are two possibilities to define the
likelihood of an observation sequence oseq for a given HMM. Either the highest
likelihood of all possible state sequences is used (Viterbi recognition), or the sum
of the likelihoods of all possible state sequences is considered as the likelihood of
the observation sequence (Baum-Welch recognition). In the system described in
this paper the first possibility is used. For details see,33 for example.
In word recognition systems with a small vocabulary, it is possible to build an
individual HMM for each word. But for large vocabularies this method doesn’t
work anymore because of the lack of enough training data. Therefore, in our
system an HMM is build for each character. The use of character models allows
us to share training data. Each instance of a letter in the training set has an impact
on the training and leads to a better parameter estimation.
To achieve high recognition rates, the character HMMs have to be fitted to
the problem. In particular the number of states, the possible transitions and the
type of the output probability distributions have to be chosen. In our system each
character model consists of 14 states. This number has been found empirically.
(The rather high number can be explained by the fact that the sliding window
used for feature extraction is only one pixel wide and that many different writing
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
d o o r
e
m o r e m
styles are present in the used database.) Because of the left to right direction
of writing, a linear transition structure has been chosen for the character models.
From each state only the same or the succeeding state can be reached. (A graphical
representation of the HMMs used in our system is shown in Fig. 5.4.) Because
of the continuous nature of the features, probability distributions for the features
are used. Each feature has its own probability distribution and the likelihood of
an observation in a state is the multiplication of the likelihoods calculated for
all features. This separation of the elements of the feature vector reduces the
number of free parameters, because no covariance terms must be calculated. The
probability distribution of all states and features are assumed to be Gaussians, so
that only two free parameters per distribution exist, namely, the mean and the
variance. The initialization of the models is done by Viterbi alignment to segment
the training observations and recompute the free parameters of the models, i.e. the
mean and variance of each probability distribution and the transition probabilities
between the states. To adjust these free parameters during training, the Baum-
Welch algorithm33 is used.
To model entire words, the character models are concatenated with each other.
Thus a recognition network is obtained (see Fig. 5.5). Note that this network
doesn’t include any contextual knowledge on the character level, i.e., the model
of a character is independent of its left and right neighbor. In the network the best
path is found with the Viterbi algorithm.33 It corresponds to the desired recogni-
tion result, i.e., the best path represents the sequence of characters with maximum
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
probability, given the image of the input word. The architecture shown in Fig. 5.5
makes it possible to avoid the difficult task of segmenting a word into individual
characters. More details of the handwritten text recognizer can be found in.6
In this section the ensemble creation methods used in this paper are described.
Each ensemble creation method takes a base classifier and a training set as input
and returns a number of trained instances of the base classifier as a result. In the
first subsection general aspects of ensemble creation are discussed. Then details
of the various methods are given.
A good performing ensemble creation method should have at least two properties.
First, the method should create diverse classifiers, which means that the misclas-
sification of patterns should have a low correlation across different classifiers (or
in other words, the recognition rate of a classifier Ci on the patterns misclassified
by another classifier Cj should be close to the average recognition rate of Ci ). In
the ideal case independent classifiers are created, but this is almost impossible in
real world applications. The diversity of classifiers is crucial, because all of the
known combination rules can only increase the performance of single classifiers
if they are used with an ensemble of diverse classifiers. It is well known that a
high correlation between the errors committed by individual classifiers may lead
to a decreasing performance of the ensemble when compared to the best individ-
ual classifier. For a more detailed discussion of classifier diversity the reader is
referred to.35
The second requirement is that an ensemble creation method should produce
individual classifiers whose recognition rate is not much lower than that of the
trained base classifier. It is obvious that the recognition rate of an ensemble using
a combination rule depends on the performance of its individual members. There
are some ensemble creation methods that have the potential of creating classifiers
which outperform the best base classifier. But if many members of an ensemble
have a poor performance they may eventually become dominant over the well-
performing classifiers. To avoid performance degradation an ensemble creation
method should particularly avoid to overfit the training data.
In the following, four ensemble creation methods, namely, Bagging, Ad-
aBoost, random subspace, and architecture variation are introduced. These meth-
ods were originally proposed in the area of machine learning. Note that their
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
quality with regard to the two properties discussed above is application dependent
and can’t be guaranteed a priori.
5.3.2. Bagging
Bagging,18 an acronym for bootstrapping and aggregating, was among the first
methods proposed for ensemble creation. Given a training set S of size n, bagging
generates N new training sets S1 , . . . , SN , each of size n, by randomly drawing
elements of the original training set, where the same element may be drawn mul-
tiple times. If the probability of being drawn is equally distributed over S, as it
is the case here, about two third of all training elements are contained in each
modified training set Si , some of them multiple times. Each of the new sets Si is
used to train exactly one classifier. Hence an ensemble of N individual classifiers
is obtained from N new training sets.
5.3.3. AdaBoost
Similarly to Bagging, AdaBoost36 modifies the original training set for the cre-
ation of the ensemble. To each pattern of the training set a selection probability is
assigned, which is equal for all elements of the training set in the beginning. Then
elements for a new training set are randomly drawn from the original training set
taking the selection probabilities into account. The size of the new training set
is equal to the size of the original one. After the creation of a new training set,
a classifier is trained on this set. Then the new classifier is tested on the origi-
nal training set. The selection probabilities of correctly classified patterns in the
original training set are decreased and the selection probabilities of misclassified
patterns are increased. During the execution of the AdaBoost procedure the se-
lection probabilities are dynamically changing. Hence, unlike Bagging, where the
classifiers are created independently, the classifiers generated by AdaBoost are
dependent on selection probabilities, which in turn depend on the performance of
previously generated classifiers.
The main idea of AdaBoost is to concentrate the training on “difficult” pat-
terns. Note that the first classifier is trained in the same way as the classifiers in
Bagging. The classical AdaBoost algorithm can only be used for two-class prob-
lems, but AdaBoost.M1,36 a simple extension of AdaBoost, can cope with multi-
class problems. Consequently, AdaBoost.M1 was applied in the system described
in this paper.
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
In the random subspace method19 an individual classifier uses only a subset of all
features for training and testing. The size of the subset is fixed and the features
are randomly chosen from the set of all features.
For the handwritten text recognizer described in Section 5.2 the situation is
special in the sense that the number of available features is rather low. (As de-
scribed in Section 5.2, only nine features are extracted at each position of the win-
dow.) Therefore, the features are not completely randomly chosen. If the number
of classifiers which use feature fi is denoted by n(fi ), then the following relation
holds: ∀i, j |n(fi ) − n(fj )| <= 1. This means that each individual feature is
used in approximately the same number of classifiers. Therefore, all features have
approximately the same importance. By means of this condition it is enforced that
the information of every feature is exploited as much as possible. By contrast,
when choosing completely random feature sets, it is possible that certain features
are not used at all.
In the experiments described in Section 5.6, always subsets of six features
were used. This number was experimentally determined as a suitable value. The
whole training set with feature vectors of reduced dimensionality was used for the
training of each individual classifier.
Another way to create an ensemble out of a base classifier is to vary its architec-
ture. In a feed-forward neural network, for example, one may change the number
of hidden layers or the number of neurons in each layer.21 Similar possibilities
exist for HMM. Our base classifier was changed as follows.
First, the linear topology was replaced by the Bakis model (see Fig. 5.6). This
topology allows more flexibility in the decoding process by skipping certain states.
Next, two additional architectures were implemented. The HMM models used in
our base classifier don’t include any ligature statesa . But the transition from one
character to the next is often context dependent. Therefore, if certain character
pairs are not sufficiently well represented in the training set, misalignments at the
beginning and at the end of a character model during decoding may be expected.
To account for this kind of problem, the semi-jumpin and semi-jumpout archi-
tecture shown in Fig. 5.6 were introduced. Here the first or last n−4 2 states of a
linear model may be skipped (with n denoting the total number of states of the
considered HMM).
a Here the term ligature denotes a connection stroke between two consecutive characters
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
linear
bakis
semi-jumpin
semi-jumpout
Fig. 5.6. HMM topologies (for a small HMM with 6 emitting states. Note that the HMMs of the
classifier in Section 5.2 have 14 emitting states.)
Normally the columns of a word image are read form left to right. Another
possibility is to read them from right to left. Because the Viterbi search used in the
decoding phase is a suboptimal procedure that prunes large portions of the search
space, the results of a forward and a backward scan of the word are not necessarily
the same. To implement a right-to-left scan of the image, only the concatenation
of character HMMs needs to be changed appropriately.
Apparently, left-to-right as well as right-to-left scanning can be combined with
any of the architectures shown in Fig. 5.6. Therefore, a total of eight different
classifiers were generated. Each of these classifiers was trained on the full training
set.
In this section the combination schemes used in our multiple classifier system for
handwriting recognition are described.
class is calculated and the combined result is the word class that has the largest
sum of weights. In the performance weighted voting scheme, which is denoted as
perf voting in the following, the weight of the classifier is equal to the classifier’s
performance (i.e. recognition rate) on the training set. The system described in
Section 5.2 was found to have a good generalization power, i.e. the results on the
training set allow a good estimation of the behavior of the system on test data. So
the training set was used for the evaluation of the performance of the classifiers.
For other classifiers it may be necessary to use a separate validation set to evaluate
the performance of the created classifiers. (The Nearest-Neighbor classifier, for
example, has always a recognition rate of 100 % on its training set.)
Using the performance of a classifier as its weight is based on the intuitive as-
sumption that classifiers with a high recognition rate are more trustworthy than
classifiers that perform poorly. However, there is no objective proof that this strat-
egy is optimal. Under a more general approach, one considers the set of weights
as free parameters in a multiple classifier system, and tries to find the combination
of values that lead to the best performance of the whole system. Out of many
possible optimization procedures it was decided to use a genetic algorithm37 for
weight optimization. Among the reasons to favor a genetic approach over other
methods was the simplicity and elegance of genetic algorithms as well as their
demonstrated performance in many other complex optimization problems.38–40
The training set used to find the individual classifiers was also used to derive
the optimal combination of weights, assuming that the obtained values lead to
a good performance on the test set as well. The genetic algorithm for weight
optimization will be described in Section 5.5 in greater detail. In the following
this algorithm will be denoted as ga voting.
Under this scheme a normal voting procedure is executed first, i.e., if the occur-
rence of a class among the results of the classifiers is higher than the occurrence of
any other class then this class is output as the combined result. A tie occurs if no
unique result is obtained. For some applications it may be sufficient to just reject
a pattern if a tie occurs, but here we use a more general approach. In case of a tie
we focus our attention on those classes that compete under the tie and apply one
the above mentioned combination schemes. So there is voting with tie handling
by maximum score rule, by performance weighted voting, and by weighted voting
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
First proposed in,37 genetic algorithms have been found to be robust and practical
optimization methods. In a genetic algorithm a possible solution of the problem
under consideration is represented by a chromosome. In the initialization step of
the algorithm a set of chromosomes is created randomly. The actual set of chro-
mosomes is called the population. A fitness function is defined to represent the
quality of the solution given by a chromosome. Only the chromosomes with the
highest values of this fitness function are allowed to reproduce. In the reproduc-
tion phase new chromosomes are created by fusing information of two existing
chromosome (crossover) and by randomly changing them (mutation). Finally the
chromosomes with the lowest values of the fitness function are removed. This
reproduction and elimination step is repeated until a predefined termination con-
dition is become true. In the following we describe the genetic algorithm that is
used in our multiple classifier system in more details.
5.6. Experiments
All ensemble creation methods discussed in Section 5.3 were implemented and
tested on a part of the IAM database. This database is publicly availableb and has
been used by several research groups meanwhile.31 The original version of the
b http://www.iam.unibe.ch/
˜ zimmerma/iamdb/iamdb.html
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
database contains complete lines of text as its basic entities, without any segmen-
tation of a line of text into individual words.31 Meanwhile, however, part of this
database has been segmented into individual words.31,32 A subset of these words
was used in the experiments described in this section.
The training set used in the experiments contains 9861 and the test set 1066
word instances over a vocabulary of size 2296. The test set was chosen in such a
way that none of its writers was represented in the training set. Hence all experi-
ments described in this paper are writer independent. The total number of writers
who contributed to the training and test set is 81. A small sample of words from
this database is shown in the Appendix.
Table 5.1 shows the results of the experiments. The recognition rate of the
classifier with the original architecture and training set was 66.23 %. Bagging,
AdaBoost and random subspace method each created 10 classifiers while the ar-
chitecture variation method generated only 8 (see Subsection 5.3.5).
Table 5.1. Recognition rates achieved by the ensemble creation methods under different combination
rules. The best result for each ensemble creation method is printed in bold face.
HMM, and identical score values from different HMMs don’t necessarily imply
that the word classes which correspond to these values have the same probability
of being correct. Note that the performance of the maximum combination rule
is especially poor for architecture var and random subspace, where the HMMs
of the classifiers are very different. A possibility to overcome this problem is to
normalize the score values for each classifier. However, this possibility has not
been explored in the context of this paper and is left to future research.
All other combination schemes lead to an increase of the recognition rate for
all ensemble creation methods when compared to the original classifier. The pro-
posed ga voting combination was the best scheme for three out of the four en-
semble creation methods considered in the tests. The quality of the other schemes
relative to each other varied among the tests.
Please note that with the simple weighting mechanism of perf voting also good
results were achieved. The superior performance of ga voting over perf voting
doesn’t hold true any longer for voting with tie handling. Here ties ga voting is
outperformed by ties perf voting for two ensemble creation methods. The rea-
son for this behavior is that the weights calculated by the genetic algorithm are
optimized for weighted voting and not for voting with ties handling by weighted
voting. Nevertheless ties ga voting is clearly superior to the original classifier.
To compare the different ensemble methods in more detail the average perfor-
mance and the standard deviation of the performances of the individual classifiers
were calculated. Those values are shown in Table 5.2.
Table 5.2. Average performance and the standard deviation of the performance of the individual clas-
sifiers. The performances of the original classifier is 66.23 %.
Bagging produced classifiers with very similar performances and which were
in average almost as good as the original classifier. As the performance of the
ensemble is not much higher than the performance of the original classifier in
respect to the other ensemble methods it may be concluded that the diversity of
the classifiers is low.
The classifiers produced by AdaBoost had a wider range of performance than
Bagging. Although the average performance of the individual classifier is slightly
lower than in Bagging, a much better ensemble performance was achieved. This
indicates that the classifiers are quite diverse. AdaBoost was the only ensemble
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
method were ga voting did not produce the best result. A possible reason for this
is the following. In AdaBoost the performance of the ensemble on the training set
is optimized by focusing on “difficult” patterns. Such optimization on the training
set normally leads to classifiers which are much better on the training set than on
the test set. As the genetic algorithm works with the results of the training set,
it may overestimate the performance of some classifiers and produce suboptimal
weights. This problem may be overcome by using a separate validation set for
calculating the weights.
The average performance of the classifiers produced by random subspace was
much lower than the performance of AdaBoost, yet the ensemble performance
was still quite good. So the diversity of classifiers increased again. For random
subspace the best performance of ga voting in respect to the other combination
schemes was achieved (ga voting had a 0.56 % higher performance than the sec-
ond best scheme). An analysis of the calculated weights showed that the weights
of three out of the ten classifiers were so low that in fact those classifiers were
almost irrelevant for the combination. This means that ga voting was capable
to discover the classifiers which lower the ensemble performance and to exclude
them from the combination.
The classifiers produced by architecture var. had in average a very low perfor-
mance (20.26 % lower than the performance of the original classifier). Yet good
ensemble results were achieved by this method which leads to the conclusion that
the classifiers must be very diverse. For all ensemble methods but architecture
var. perf voting and ties perf voting produced te same results.
When using ga voting or ties ga voting, in addition to the testing of all classi-
fiers on the training set also the genetic algorithm must be executed. Yet the time
consumption of the genetic algorithm is over 1000 times lower than that of the
tests on the training set so that this additional overhead is not significant.
5.7. Conclusions
In this paper the recognition of cursively handwritten words was considered. Be-
cause of the large number of classes involved and the great variations of words
from the same class, which is due to the considerable number of individual hand-
writing styles, this is regarded a difficult problem in pattern recognition. Multi-
ple classifier systems have demonstrated very good performance in many pattern
recognition problems recently. In this paper we have explored a number of clas-
sifier ensemble generation methods and related combination schemes. As hidden
Markov Models (HMMs) are considered to be one of the most powerful meth-
ods for cursive handwriting recognition today, we have focused on those classifier
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
Acknowledgments
The research was supported by the Swiss National Science Foundation (Nr. 20-
52087.97). The authors thank Dr. Urs-Victor Marti for providing the handwrit-
ten word recognizer and Matthias Zimmermann for the segmentation of a part
of the IAM database. Additional funding was provided by the Swiss National
Science Foundation NCCR program “Interactive Multimodal Information Man-
agement (IM)2” in the Individual Project “Scene Analysis”.
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
References
1. C.Y. Suen, C. Nadal, R. Legault, T.A. Mai, and L. Lam. Computer recognition of
unconstrained handwritten numerals. Proc. of the IEEE, 80(7):1162–1180, 1992.
2. J.-C. Simon. Off-line cursive word recognition. Proc. of the IEEE, 80(7):1150–1161,
July 1992.
3. S. Impedovo, P. Wang, and H. Bunke, editors. Automatic Bankcheck Processing. World
Scientific Publ. Co, Singapore, 1997.
4. A. Kaltenmeier, T. Caesar, J.M. Gloger, and E. Mandler. Sophisticated topology of
hidden Markov models for cursive script recognition. In 2nd Int. Conf. on Document
Analysis and Recognition, Tsukuba Science City, Japan, pages 139–142, 1993.
5. G. Kim, V. Govindaraju, and S.N. Srihari. Architecture for handwritten text recog-
nition systems. In S.-W. Lee, editor, Advances in Handwriting Recognition, pages
163–172. World Scientific Publ. Co., 1999.
6. U.-V. Marti and H. Bunke. Using a statistical language model to improve the per-
formance of an HMM-based cursive handwriting recognition system. Int. Journal of
Pattern Recognition and Art. Intelligence, 15:65–90, 2001.
7. J. Kittler and F. Roli, editors. 1st International Workshop on Multiple Classifier Sys-
tems, Cagliari, Italy, 2000. Springer.
8. J. Kittler and F. Roli, editors. 2nd International Workshop on Multiple Classifier Sys-
tems, Cambridge, UK, 2001. Springer.
9. A. Bellili, M. Gilloux, and P. Gallinari. An hybrid MLP-SVM handwritten digit recog-
nizer. In 6th International Conference on Document Analysis and Recognition, pages
28–32, 2001.
10. A. Brakensiek, J. Rottland, A. Kosmala, and G. Rigoll. Off-line handwriting recogni-
tion using various hybrid modeling techniques and character n-grams. In 7th Interna-
tional Workshop on Frontiers in Handwritten Recognition, pages 343–352, 2000.
11. Y. Huang and C. Suen. Combination of multiple classifiers with measurement values.
In Second International Conference on Document Analysis and Recognition, pages
598–601, 1993.
12. D. Lee and S. Srihari. Handprinted digit recognition: A comparison of algorithms. In
Third International Workshop on Frontiers in Handwriting Recognition, pages 153–
162, 1993.
13. A. Rahman and M. Fairhurst. An evaluation of multi-expert configurations for the
recognition of handwritten numerals. Pattern Recognition, 31(9):1255–1273, 1998.
14. X. Wang, V. Govindaraju, and S. Srihari. Multi-experts for touching digit string recog-
nition. In 5.International Conference on Document Analysis and Recognition, pages
800–803, 1999.
15. L. Xu, A. Krzyzak, and C. Suen. Methods of combining multiple classifiers and their
applications to handwriting recognition. IEEE Transactions on Systems, Man and Cy-
bernetics, 22(3):418–435, 1992.
16. A. Rahman and M. Fairhurst. Automatic self-configuration of a novel multiple-expert
classifier using a genetic algortihm. In Proceedings of the 7th International Conference
on Image Processing and its Applications, pages 57–61.
17. T. G. Dietterich. Ensemble methods in machine learning. In7 , pages 1–15, 2000.
18. Leo Breiman. Bagging predictors. Machine Learning 2, pages 123–140, 1996.
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
19. T. K. Ho. The random subspace method for constructing decision forests. IEEE Trans.
on Pattern Analysis and Machine Intelligence, 20(8):832–844, 1998.
20. T.G. Dietterich and E.B. Kong. Machine learning bias, statistical bias, and statisti-
cal variance of decision tree algorithms. Technical report, Departement of Computer
Science, Oregon State University, 1995.
21. D. Partridge and W. B. Yates. Engineering multiversion neural-net systems. Neural
Computation, 8(4):869–893, 1996.
22. R. Duin and D. Tax. Experiments with classifier combination rules. In7 , pages 16–29,
2000.
23. C. Suen and L. Lam. Multiple classifier combination methodologies for different out-
put level. In7 , pages 52–66, 2000.
24. J. Kittler, R. Duin, and M. Hatef. On combining classifiers. IEEE Trans. on Pattern
Analysis and Machine Intelligence, 20:226–239, 1998.
25. T. Huang and C. Suen. Combination of multiple experts for the recognition of uncon-
strained handwritten numerals. IEEE Trans. on Pattern Analysis and Machine Intelli-
gence, 17:90–94, 1995.
26. T.K. Ho, J.J. Hull, and S.N. Srihari. Decision combination in multiple classifier sys-
tems. IEEE Trans. on Pattern Analysis and Machine Intelligence, 16:66–75, 1994.
27. G. Houle, D. Aragon, R. Smith, M. Shridhar, and D. Kimura. A multilayered
corroboration-based check reader. In J. Hull and S. Taylor, editors, Document Analysis
System 2, pages 495–546. World Scientific, 1998.
28. L. Rabiner. A tutorial on hidden Markov models and selected applications in speech
recognition. Proc. of the IEEE, 77(2):257–285, 1989.
29. L. Lam, Y.-S. Huang, and C. Suen. Combination of multiple classifier decisions for
optical character recognition. In41 , pages 79–101.
30. K. Sirlantzis, M. Fairhurst, and M. Hoque. Genetic algorithms for multiclassifier sys-
tem configuration: A case study in character recognition. In8 , pages 99–108.
31. U. Marti and H. Bunke. The IAM-database: an English sentence database for off-line
handwriting recognition. Int. Journal of Document Analysis and Recognition, 5:39–
46, 2002.
32. M. Zimmermann and H. Bunke. Automatic segmentation of the IAM off-line database
for handwritten English text. In Proc. of the 16th Int. Conference on Pattern Recogni-
tion, volume 4, pages 35–39, Quebec, Canada, 2002.
33. L. Rabiner and B.-H. Juang. Fundamentals of Speech Recognition. Prentice Hall,
1993.
34. A. Kundu. Handwritten word recognition using hidden Markov model. In41 , pages
157–182.
35. A. Krogh and J. Vedelsby. Neural networks ensembles, cross validation, and active
learning. In Advances in Neural Information Processing Systems, volume 7, pages
231–238. MIT Press, 1995.
36. Yoav Freund and Robert E. Schapire. A descision-theoretic generalisation of on-line
learning and an application to boosting. Journal of Computer and Systems Sciences,
55(1):119–139, 1997.
37. J. Holland. Adaption in Natural and Artificial Systems. University of Michigan Press,
1975.
38. F.J. Ferri, V. Kadirkamanathan, and J. Kittler. Feature subset search using genetic
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
CHAPTER 6
6.1. Introduction
∗ Corresponding author.
111
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
fuzzy logic inference. DS theory makes inferences from incomplete and uncertain
knowledge, provided by different independent knowledge sources. A first advan-
tage of DS theory is its ability to deal with ignorance and missing information.
In particular, it provides explicit estimation of imprecision and conflict between
information from different sources and can deal with any unions of hypotheses
(clusters).4 This is particularly useful to represent ”mixed” pixels in image seg-
mentation problems. The main limitation of Bayesian inference is that it cannot
model imprecision about uncertainty measurement. The degree of belief we have
on a union of clusters (without being able to discriminate between them) should
be shared by all the simples hypotheses, thus penalizing the good one. DS theory
handles uncertain and incomplete information through the definition of two dual
non additive measures: plausibility and belief. These measures are derived from
a density function, m, called basic probability assignment (bpa) or mass function.
This probability assigns evidence to a proposition (hypothesis). The derivation of
the bpa is the most crucial step since it represents the knowledge about the appli-
cation as well as the uncertainty incorporates in the selected information source.
pba definition remains a difficult problem to apply DS theory to practical applica-
tions such in image processing. For example, bpa may be derived, at pixel level,
from probabilities5–7 or from the distance to cluster centers.8 In this work bpa is
estimated in unsupervised way and using fuzzy membership functions to take into
account the ambiguity within pixels. This ambiguity is due the possible multi-
valued levels of brightness in the image. This indeterminacy is due to inherent
vagueness rather than randomness. The number of the clusters of the image is
supposed known. In7 the bpa estimation is based on the assumption that the prob-
ability distribution of the gray level values (image histogram) is Gaussian model.
Our estimation approach does not make any assumption about the probability dis-
tribution of the gray level histogram and is not limited to only two sources.
Θ = {H1 , H2 , . . . , Hq }
The representation scheme, Θ, defines the working space for the desired ap-
plication since it consists of all propositions for which the information sources
can provide evidence. Information sources can distribute mass values on subsets
of the frame of discernment, Ai ∈ 2Θ (6.1). An information source assign mass
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
0 ≤ m(Ai ) ≤ 1 (6.1)
bpa has to fulfill the conditions: m(∅) = 0 and m(Ai ) = 1. If an informa-
Ai ∈2Θ
tion source can not distinguish between two propositions, Hi and Hj , it assigns a
mass value to their union (Hi Hj ). Mass distribution from different information
sources, mj (j = 1, . . . , d), are combined with Dempster’s orthogonal rule (6.2).
The result is a new distribution, m(Ak ) = (m1 ⊕ m2 ⊕ . . . ⊕ md )(Ak ), which
incorporates the joint information provided by the sources.
m(Ak ) = (1 − K)−1 × mj (Aj ) (6.2)
A1 ∩A2 ...Ad =Ak 1≤j≤d
K= mj (Aj ) (6.3)
A1 ∩A2 ...Ad =∅ 1≤j≤d
The equations (6.4) and (6.5) imply that Bel(.) and P ls(.) are dual measures
related by
P ls(Ai ) = 1 − Bel(¬Ai ) (6.6)
the need for a statistical model of the data. For image fuzzification we use an
histogram based gray-level fuzzification.12 Thus, we use gray level g instead of
the intensity of (m, n)th pixel xmn (Figure 1). The FCM only operates on the
histogram and consequently is faster than the conventional version,11 which pro-
cesses the whole data set X
In image segmentation problem, Θ is the set of all the clusters of the image, |Θ| =
C is the number of clusters and 2C contains all the possible unions of clusters.
The hypotheses considered in DS formulation are: ∅ (whose mass is null), simple
hypothesis Hi and compound hypotheses Hj . . . Hl . For the choice of the bpa
of Hk and Hl , the following strategy is used :
(1) Affecting a non null mass to Hk Hl if Hk and Hl are not discriminated on
the image (not distinguishable by the sensor) (Figure 1). There is an ambiguity
between Hk and Hl . In this case affecting a pixel with gray level g to cluster
k or l using of fuzzy membership rule is not valuable (μk (g) ≈ μl (g)).
(2) Affecting a null mass to Hk Hl if Hk and Hl are discriminated on the im-
age. There is no or less ignorance about clusters k and l.
H2UH3
0,8
Fuzzy membership degrees
H1 H2 H4
H3
0,6 High ambiguity
0,4
v3
0,2 v1 v2 v4
0
0 50 100 150 200 250
Gray Levels
Fig. 6.1. Plot of fuzzy membership functions generated by FCM algorithm (RX image). Vi stands
for the centroid of the ith cluster.
where
C
μi (g) = 1
i=1
I is the set of cluster indices and its cardinal is the number of clusters, C.
arg(β) = arg( max μi (g)) is the maximum fuzzy membership defuzzification
1≤i≤C
rule. The pixel with gray level g is affected to cluster arg(β).
In the proposed fusion scheme for all C values, both simples and compound
hypotheses are taken into account. In the framework of histogram based segmen-
tation and for C ≤ 3, ambiguity can not occur between all the C classes. Thus, a
null mass is affected to the union of hypotheses (Eqs. (6.12),(6.16),(6.22),(6.25)).
For C = 2, in general there is at least one pixel where the two classes (hypothe-
ses) are not sufficiently distinguishable form each other so that a new compound
hypotheses is created with a non null mass (Eq. (6.9)). However, if the two hy-
potheses are well distinguishable from each other, the mass value of their union is
null (Eq. (6.20)). For all C values, and for all cases (with less or high ambiguity),
the mass value affected to single hypothesis proportional to the corresponding
fuzzy membership degree (Eqs. (6.10),(6.13),(6.17),(6.18),(6.19),(6.23),(6.26)).
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
C
m( Hi )(g) = α (6.9)
i=1
m(Hi )(g) = [1 − α] × μi (g) i ∈ I (6.10)
(2) For C = 3
C
m( Hi )(g) = 0 (6.16)
i=1
C
m(Ht )(g) = [1− m( Hi )(g)− m(Hk ∪Hl )(g)]
i=1
i=k,i=l
× μt (g) (6.17)
where t, (k, l)k=l ∈ I
(1) For C = 2
(2) For C = 3
C
m( Hi )(g) = 0 (6.25)
i=1
C
m(Ht )(g) = (1 − m( Hi )(g)) × μt (g) t ∈ I (6.26)
i=1
i=k
ξ is a threshold value. We make assumption that the images are well registered.
Since, images are clustered separately then a spatial correspondence between the
labels of clusters of different images is necessary so that pixels representing the
same physical object of the scene may be superimposed and thus to be able to
correctly combine the different information sources (6.2). The label-to-label map-
ping strategy is described in.13 The use of image histogram loose spatial informa-
tion about pixels arrangement and the spatial correlation between adjacent pixels.
Furthermore, the membership resulted from the FCM algorithm are considerably
troublesome in a very noisy environment. To reduce noise effect and to improve
the classification results contextual processing is performed. Thus, before bpas
estimation, membership value of each pixel is updated by using its neighborhood
contextual membership values. In this work, a 3 × 3 neighborhood mean and
median filters are used.13
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
6.5. Results
The proposed data fusion method is first tested on synthetic images. Two images,
corrupted by Gaussian noise, simulating US and RX acquisitions are shown in
Figure 2. Each image contains four clusters (C=4). In the US image (Fig. 2(a)),
one region (smallest thickness) is confused with the background and in the RX
image (Fig. 2(b)) the greatest thickness is under-exposed and the thicker regions
are not well distinguished. The aim here is to exploit, through using the proposed
data fusion technique, the redundant and complementary information of the two
images in order to correctly segment the image in four clusters. The maximum of
plausibility is used as a decision rule. Figures 2(e) and 2(f) show the DS fusion
result obtained using median and average filters respectively. ξ is set 0.05. Note
that within the segmented regions, some artifacts are present (Figs. 2(e)-(f)), re-
flecting the influence of noise present in the initial images (Figs. 2(a)-(b)) on final
segmentation. Both filters give a good segmentation result but the regions given
by the average operation are more homogeneous than in the median case. The
four regions are well brought out and this shows that informations provided by
two images are well exploited by the fusion scheme. This result also shows that
the estimated bpas are a good modeling of the information associated to simple
and compound hypotheses. This also shows the interest of taking into account the
contextual information in bpas estimation. In order to get a better insight into the
actual ability of the DS fusion based segmentation, in comparison with conven-
tional algorithms which exploit information only from one image, we give in Fig.
2(c) and 2(d) a comparison example. The segmentation results in Figs 2(c) and
2(d) have been obtained using the FCM algorithm. They correspond respectively
to the US and RX images respectively. When segmentation is performed with one
image, we observe that 23.94% and 34.94% of pixels have been mis-segmented
for RX and US images respectively. Segmentation errors have been largely re-
duced when exploiting simultaneously the two images through the use of DS fu-
sion approach including spatial information. Indeed, in the latter case, only 0.95%
of pixels have been mis-segmented. This good performance difference between
these two types of segmentation approaches can also be easily assessed by visually
comparing the segmentation results. Figure 3 illustrates the application of the pro-
posed fusion scheme to human brain Magnetic Resonance (MR) of three patients
with Multiple Sclerosis (MS) lesions (Figures 3(a)-(f)). Figures 3(a)-(c) represent
T2 -weighted images and Figures 3(d)-(f) the corresponding Proton Density (PD)
weighted images. Each pair of images (T2 ,PD) are strongly correlated and also
spatially registered, and show the MS lesions as hypersignal regions. The fused
images are shown in Figures 3(g)-(i). In each patient, regions such as white matter,
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
grey matter, cerebrospinal fluid (CSF), background are correctly segmented. This
is of great interest in medical applications in particular the estimation of size and
volume of the brain tissues. However, the proposed scheme is not able to separate
the MS lesions region from CSF (Fig. 3(g)-(i)). This is due essentially to the fact
that pixels of CSF and MS lesions share the same intensities.
6.6. Conclusion
well registered. The presented results are limited to only one modality. Extensive
tests on real data and analysis of several decision rules are necessary in order to
evaluate the robustness of the method. Only filters with 3 × 3 size are used. Thus,
different window sizes must be tested to show their effect on the fusion results.
Fig. 6.3. Segmentation result of MR images obtained in 3 patients with MS lesions. (a), (b), (c) T2
weighted images. (d), (e), (f) PD weighted images. (g), (h), (i) DS fusion result of the three patients
using average operation.
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
References
1. D.L. Hall, Mathematical Techniques in Multisensor Data Fusion. (Artech House, MA,
1992).
2. A. Dempster, Upper and lower probabilities induced by a multivalued mapping, An-
nals of Mathematical Statistics 38, 325–339, (1967).
3. G. Shafer, A Mathematical Theory of Evidence. (Princeton University Press, 1976).
4. S. Le Hegarat-Mascle, I. Bloch and D. Vidal-Madjar, Application of DS evidence
theory to unsupervised classification in multiple remote sensing, IEEE Trans. Geosci.
Remote Sensing. 35, 1018–1031, (1997).
5. T. Lee, J.A. Richards and R.H. Swain, Probabilistic and evidential approaches for
multisource data analysis, IEEE Trans. Geosci. Remote Sensing. 25, 283-293, (1987).
6. F. Salzenstein and A.O. Boudraa, Unsupervised multisensor data fusion approach,
Proc. IEEE ISSPA. Kuala Lumpur, 1, 152–155, (2001).
7. M. Rombaut and Y. M. Zhu, Study of Dempster-Shafer theory for image segmentation
applications, Image and vision computing. 20 (1), 15–23, (2002).
8. I. Bloch, Some aspect of Dempster-Shafer evidence theory for classification of multi-
modality medical images taking partial volume effect into account, Pattern Recogni-
tion Letters. 17, 905–916, (1996).
9. P. Smets, The combination of evidence in the transferable belief model, IEEE Trans.
Patt. Anal. Mach. Intell. 12, 447–458, (1990).
10. S. Le Hegarat-Mascle, D. Richard and C. Ottle Multi-scale data fusion using
Demspter-Shafer evidence theory, Integrated Computer-Aided Engineering. 10, 9–
22, (2003).
11. J.C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms, Plenum
Press, NY, (1981).
12. A.O. Boudraa and P. Clarysse, Fast fuzzy gray level image segmentation method, Med-
ical Biological Engineering Computing. 35, 686, (1997).
13. A. Bentabet, Détermination des fonctions de masses dans le cas de la théorie de
l’évidence par coalescence floue. (MSc. Dissertation, Institut National des Sciences
Appliquées de Lyon, France, 1999).
June 25, 2009 8:55 World Scientific Review Volume - 9in x 6in ws-rv9x6
CHAPTER 7
7.1. Introduction
123
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
specific marks are needed; the system works with the usual surgical instruments.
The system is based on a sequence of different image processing techniques. The
objective of this paper is to show, in detail, the way that these techniques have
been used, and the results obtained.
The effects of noise on the results of the transformation are a matter of concern
when dealing with real image data. Such noise may be due to random corruption
of the signal during the data acquisition process or it may be a normally distributed
random error in the localization of image points due to the effects of digitizing
continuous data. The characterization and prediction of the effects of noise have
been studied extensively.3,4
In order to reduce the effects of noise on the edge orientation determination, a
Gaussian filter is applied to the original image in this first stage:
x2 +y2
1 − ( 2σ2 )
h(x, y) = e (7.1)
2πσ 2
x2 +y2
1 − ( 2σ2 ) 1 − x2 1 − y2
h(x, y) = 2
e =√ e 2σ2 ·√ e 2σ2 = h1D (x) · h1D (y)
2πσ 2πσ 2πσ
(7.2)
A 2D Gaussian convolution can be implemented using two orthogonal 1D
Gaussian convolutions. Thus, the computational cost of the filtering stage is linear
instead of quadratic.
Fig. 7.2(b) shows the image output after the Gaussian filtering with a standard
deviation σ = 1.5 and a kernel size of 7.
Edge orientations are needed to compute the Hough transform. The prior filtering
stage is mandatory since the precision in the orientation of gradient operators is
very sensitive to noise.
For extracting edge orientation a simple gradient operator is used. Given the
filtered image f(x,y), an approximation of the gradient direction θ(x, y) is com-
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
puted as:
Δy
θ(x, y) = atan (7.3)
Δx
where Δy = f (x, y − 1) − f (x, y + 1) and Δx = f (x − 1, y) − f (x + 1, y)
The Sobel operator is not used in this stage. The justification is quite sim-
ple. The Sobel operator performs local smoothing. In our application there is
no need of new smoothing since the image has been filtered previously. So, the
computational load may be reduced by using the central difference masks instead.
The computational load is also reduced by considering only pixels whose gra-
dient magnitude is above a certain threshold, Th. Fig. 7.2(c) shows the orientation
image. Orientations have been quantified in 256 levels.
Paul Hough introduced the Hough transform in 1962.1 It is known that it gives
good results in the detection of straight lines and other shapes even in the presence
of noise and occlusion.
Our vision system detects the surgical instruments using the Hough transform.
Since the instruments show a structured shape, mainly straight lines, the Hough
transform is a powerful tool to detect them. There can be found in the literature
other tracking applications that also use the Hough transform; see, for example.2
Other works within the medical imaging discipline that make use of the Hough
transform include.6–8
At this stage, the normal parameterization of the Hough transform is used to
extract the most significant straight lines in the scene.
where ρ and θ are the length and orientation of the normal vector to the line from
the image origin. Each straight line is uniquely defined by ρ and θ, and for every
point in the original image (x,y) it is possible to create a mapping from feature to
the parametric space.
If we divide the parameter space into a number of discrete accumulator cells,
we can collect ’votes’ in the (ρ, θ) space from each data point in the (x, y) space.
Peaks in (ρ, θ) space will mark the equations of lines of co-linear points in the (x,
y) space.
Interested readers can find a good survey about Hough transform5 . A book
which makes easily assimilated theory and advice available to the non specialist
concerning state of the art Hough transform techniques is.9
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
For every pixel in the image, the gradient direction has been determined in
the last stage. Thus, the computation of distance ρ, becomes a single operation.
Edge direction information made available at the edge detection stage is the most
commonly used constraint on the range of parameters to be calculated.14
The (ρ, θ) space has been implemented using a 256x256 array of accumula-
tors. All pixels, except those whose gradient magnitude is below the threshold
Th, are mapped to one point in the (ρ, θ) space. The corresponding cells in the
accumulator are incremented every time a new pixel is mapped into it. Fig. 7.2(d)
shows a 3D representation of the Hough table in the (ρ, θ) space.
Peaks in the (ρ, θ) space correspond to the presence of straight lines in the
scene. The maximum peak is selected as the longest straight line in the image.
(a) (b)
(c)
(d)
Fig. 7.2. (a) Original image. (b) Filtered image. (c) Gradient orientations. (d) 3D representation of
Hough table.
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
vk = xk − xk−1 (7.5)
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
(a)
(b) (c)
Fig. 7.3. (a) Error line profile. (b) Segment extraction. (c) Tool ending location.
x̂k+1 = xk + vk T (7.6)
where T is the sample period between two processed frames. We process a frame
in 125 ms.
There must be certain continuity between the location of the peak in the (ρ, θ)
space of the current frame and the location of the peak in the last frame. Thus,
not all the Hough table must be computed at every iteration, we only process
those pixels whose (ρ, θ) pair is within a window of acceptable values. Once the
next position x̂k+1 has been estimated, the processing window is centered at this
new position for the next frame. The objective of this processing window is, of
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
Fig. 7.4. Ten candidates, the longest straight lines in the image.
course, to increase the processing speed. The size of the processing window is
programmable. In our application it has been fixed to ±20 degrees.
The estimated position is also used to select the best target from the list of
candidates.
The position x̂k+1 estimated in the last stage, will determine which one of the
remaining candidates is the most probable. The error between the positions of all
candidates, and the estimated position is then computed:
Δθi = θi − θ T , Δρi = ρi − ρ T (7.7)
Once these errors are computed for all candidates, the one closest to the esti-
mated position and with the highest value in the Hough transform accumulator is
selected. The function used is:
valHoughi |Δθi | |Δρi |
di = · 1− · 1− (7.8)
max(valHough) max |Δθi | max |Δρi |
where max(valHough) is the absolute maximum of the Hough table values.
7.3. Results
Our vision system has been implemented on a PC system with a 1.7Ghz. Pentium
III processor and a commercial image acquisition board. Processing time for the
complete process is 125 ms., which is suitable for a real-time tracking application.
The size of the processed images is 768x576 pixels.
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
1st 77%
2nd 11%
3rd 4%
4th 4%
5th 2%
6th 1%
7th 1%
8th 0%
9th 0%
10th 0%
Some parameters are configurable and must be tuned by the user. We show in
Table 7.1 the values assigned in our application:
Table 7.2 shows the results of an experiment designed to show the percentage
of cases in which the correct tool corresponds to the straight lines detected. A
set of 128 images extracted from a real operation video have been used for the
experiment. The experiment has been divided in two stages: the first one is a
static test, and uses only the information provided by the image itself. The second
one is a dynamic test, and it takes into account the information obtained from the
previous images of the video sequence. The static test uses only the information
given by the Hough transform. The dynamic test uses the static information plus
the position prediction detailed in Sec. 7.2.6.
For each image, the ten top values of the Hough table are selected and sorted
by decreasing order. The coordinates (ρ, θ) of these maxima in the Hough table
correspond to ten different straight lines in the image. The objective of the experi-
ment is to show when the correct surgeon’s tool straight line corresponds with the
lines detected in the Hough transform stage. The first column of Table 7.2 are the
longest straight lines in the Hough transform table, the second are the probabilities
that these lines correspond with the correct tool.
The dynamic test takes into account the information obtained from the previ-
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
ous frames in the sequence, a position prediction (described in Sec. 7.2.6) and a
target selection (described in Sec. 7.2.7) are performed in order to detect the tool.
Using dynamic information the rate of correct detections goes up to 99%. Finally,
Fig. 7.5 and Fig. 7.6 show the results obtained with other different images.
(a) (a)
(d)
(c)
Fig. 7.5. (a) Original image. (b) Gradient orientations. (c) 3D representation of Hough table (d) Tool
ending location.
7.4. Conclusion
(a) (a)
(d)
(c)
Fig. 7.6. (a) Original image. (b) Gradient orientations. (c) 3D representation of Hough table (d) Tool
ending location.
The system can track tools whose orientations are within a 20 degrees interval
between two consecutive frames. Since eight frames are processed per second,
this means that angular speed of the tracked tool must be below 160 degrees per
second.
We get some false detections in some conflictive cases, due to:
- Tool goes out of the field of view. We get a very short straight line when
the tool progressively goes out from the visible area. This problem will be solved
when the robotic arm closes the loop and surgeon tools are always into the visible
area.
- Low contrast between the tool and the background. This problem is present
in areas with bad quality illumination. These areas are not suitable to perform any
surgical task.
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
- Sudden movements of the tool. The constraints shown in Table 7.1 must
be respected. Anyway, it is not recommended that the robotic arm make sudden
movements in a live surgical operation. The tracking system must be inhibited
when such movements occur.
Acknowledgements
The authors thank Dr. Silvina Thevenet, from the Universidad Nacional del Cen-
tro in Tandil, for her very efficient and pleasant cooperation.
References
1. Hough P. V. C. (1962), Method and Means for Recognising Complex Patterns, (U.S.
Patent No. 3069654)
2. Ermolin and.Ljustin, C. (1990) The use of image processing in tracking, Nucl. In-
strum,Methods Phys. Res.,( Sect. A) Vol. A 289(3) pp. 592–596.
3. Grimson, W. E .L.and Huttenlochner, D. P. (1990). On the sensitivity of the Hough
transform for object recognition, IEEE Pattern Anal. Mach. Intell, Vol. 10 pp. 25–
274.
4. Hunt, D. J., Nolte, L. W., Reibman, A. R and.Ruedger,W. H. (1990) Hough trans-
form and signal detection theory performance for images with additive noise, Comput.
Vision Graphics Image Process. Vol. 52 pp. 386-401.
5. Leavers, V. F. (1993) Which Hough transform?, CVGIP: Image Unders. Vol. 58(2)
pp.250-264.
6. Nixon, M. S., Hames T. K,. Martin P., Powell, S. and.de Pavia, S. (1992) 3-D ar-
terial modelling using feature extraction in ultrasound images, Int. Conf. on Image
Processing and its Applications, (Maastricht, Netherlands), pp. 37–376.
7. Nixon, M. S and Hames, T. K. 1993 New technique for 3D artery modelling by
noninvasive ultrasound, IEE Proc. I Commun. Speech and Vision, Vol. 140(1) pp.
8–94.
8. Thomas, A. D. H.,Davies, T. and Luxmoore, A. R. (1992) The Hough transform
for locating cell nuclei, IEE Colloquium on Applications of Image Processing in Mass
Health Screening, (London), pp. 81–84.
9. Leavers, V. F. (1992) Shape Detection in Computer Vision Using the Hough Trans-
form, Springer-Verlag, New (York/Berlin)
10. Hurteau, R., DeSantios, S., Begin, E. and Gagner, M. (1994) Laparoscopic Surgery
assisted by a robotic cameraman: Concept and experimental results, Proc.IEEE Int.
Conf. Robotics and Automation , (San Diego), pp.2286-2289
11. Taylor, R. H., Funda, J., Eldridge, B.,Gomory, S. and Gruben, K. (1995) A telerobotic
assistant for laparoscopic surgery, IEEE Engineering in Medicine and Biology, Vol.
14 (3) pp.279-288,
12. Casals, A., Amat, J., Prats, D., Laporte, E. (1995) Vision guided robotic system for
laparoscopic surgery, Proc.Int. Conf. Advanced Robots, pp.33-36, Catalonia,
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
13. Wei, G., Arbter, K., Hirzinge, G. (1997) Real-time visual servoing for laparoscopic
surgery, IEEE Engineering in Medicine and Biology, Vol. 16 (1) pp. 40–45,
14. Ballard, D. H. (1981) Generalizing the Hough transform to detect arbitrary shapes,
Pattern Recognit, Vol. 13(2), pp. 111–122,
CHAPTER 8
1. Introduction
With the ever increasing demand for images, sound, video sequences,
computer animations and volume visualization, data compression
remains a critical issue regarding the cost of data storage and
transmission times. While JPEG currently provides the industry standard
for still image compression, there is ongoing research in alternative
methods. Fractal image compression [1,2] is one of them. It has
generated much interest due to its promise of high compression ratios at
good decompression quality and it enjoys the advantage of very fast
decompression. Another advantage of fractal image compression is its
137
138 M. Hassaballah, M. M. Makky and Y. B. Mahdy
the useless domains will be removed from the pool achieving a more
productive domain pool. The proposed method can be extended to speed
up the hybrid fractal coders and improve their performance.
The rest of this paper is organized as follows. Section 2, briefly
describes fractal image coding and the baseline algorithm. In Section 3,
definition of entropy and using it in the proposed method to reduce the
encoding time of fractal image compression is presented, followed by
experimental results and discussion in Section 4. The conclusions of the
present work are summarized in Section 5.
n
E(R i ,D i )= ∑ (sd
i =1
i + o − ri ) 2 (1)
which occurs when the partial derivatives with respect to s and o are
zero. Solving the resulting equations will give the best coefficients s and
o [5].
n n n
n ∑ d i ri − ∑ d i ∑ ri
s= i =1 i =1 i =1
(2)
n n
n ∑ d i − (∑ d i )
2 2
i =1 i =1
n n
o = 1 (∑ r − s ∑ d ) (3)
i i
n i =1 i =1
3.1. Entropy
encoding time of this full search problem is to decrease the size of the
domain pool in order to decrease the number of domains to be searched.
The proposed method reduces the encoding time of fractal image
compression by performing less searches as opposed to doing a faster
search, by excluding many of domain blocks from the domain pool. This
idea is based on the observation that many domains are never used in a
typical fractal encoding, and only a fraction of this large domain pool is
actually used in the fractal coding. The collection of used domains is
localized in regions with high degree of structure [17]. Figure (1) shows
the domain blocks of size 8x8 that are actually used in the fractal code of
Lena image. As expected the indicated domains are located mostly along
edges and in the regions of high contrast of the image.
Analyzing the domain pool, there is a very large set of domain blocks
in the pool with high entropy, which are not used in the fractal code.
Thus, it is possible to reduce the search time by discarding a large
fraction of high entropy blocks, which affect only a few ranges. For these
ranges a sub-optimal domains with smaller entropy may be found. In this
way, the domain pool is constructed from blocks with the lowest entropy
instead of all domains. In this case, the encoding time is heavily reduced
Figure 1. Domains of size 8x8 that are used for fractal coding of 512x512 Lena are
shown in black.
A Fast Fractal Image Compression Method Based on Entropy 145
by a priori discarding those domains from the pool, which are unlikely to
be chosen for the fractal coding. Eq. (6) is used to calculate the entropy
value for each domain block. According to this value a decision is taken
to determine if this domain can become a part of the domain pool or not.
A parameter ε will control the domain entropy value in the
implementation, with ε being a quality parameter since it determines
the size of the domain pool. The proposed method can only reduce the
factor of proportionality in the O(N) complexity, where N is the domain
pool size. But one can use the Tree approach [21] on the resulting
efficient domain pool after removing all useless domain blocks, which is
able to fundamentally reduce the order of encoding time from O(N) to
O(log N).
The baseline algorithm mentioned above is modified in such a way
that the domain pool Ω contains only domain blocks which have a
certain entropy value. The main steps of the modified encoder algorithm
of fractal image compression can be summarized as follows:
If (min_error = = A C )
Set R i uncovered and partition it into 4 smaller blocks;
Else
Save_coefficients(best_domain, s, o);
}
4. Experimental Results
This section presents experimental results showing the efficiency of the
proposed method. The performance tests carried out for a diverse set of
well-known images of size 512x512 gray levels with 8bpp, on a PC with
Intel Pentium III 750 MHz CPU and 128MB memory under windows 98
operating system using Visual C++6.0 programming language and the
time is measured in seconds. Moreover, the scaling coefficient (contrast)
restricted to values between 0 and 1 in order to avoid searching domain
pool twice (i.e. allowed only positive scaling factors in the gray level
transformation). To ensure a compact encoding of the affine trans-
formation, the value of contrast and brightness are quantized using 4 and
6 bits for contrast and brightness, respectively, hence the compression
ratio is 95% and 89% for fixed range size and quadtree partitions
respectively. This study focuses on the implementation issues and
presents the first empirical experiments analyzing the performance of
benefits of entropy approach to fractal image compression. First, the
performance of the proposed method with fixed range size partition is
examined. The size of the range block is set to be 8x8 pixel, and hence
the domain size is 16x16, with domains overlapping i.e. the domain step
L (distance between two consecutive domains) is divided by 4. The result
is shown in table (1). Second, the same experiment is carried out with
well-known technique of quadtree partitioning, allowing up to three
quadtree levels. The average tolerated error between the original image
and its uncompressed version is set to be A C =2.0. The results are shown
in table (2).
A Fast Fractal Image Compression Method Based on Entropy 147
ε Lena
Time PSNR
Peppers
Time PSNR
Boat
Time PSNR Time
Hill
PSNR
0 797.78 40.66 749.63 40.50 1151.26 34.23 1304.45 39.14
1 760.91 40.65 745.91 40.51 1144.31 34.12 1323.31 39.12
1.2 753.86 40.65 743.14 40.52 1201.96 34.27 1313.14 39.12
1.5 712.72 40.64 746.93 40.51 1180.62 34.23 1318.08 39.09
1.8 647.72 40.59 736.98 40.50 981.00 34.2 1192.09 39.09
2 601.98 40.56 629.49 40.49 880.75 34.27 1062.43 39.05
2.5 489.06 40.48 553.58 40.45 632.66 34.28 677.08 38.96
2.8 417.90 40.39 477.20 40.49 541.81 34.17 442.98 38.85
3 367.91 40.36 398.01 40.43 494.21 34.11 317.13 38.75
3.5 246.4 40.12 236.96 40.32 408.76 33.79 121.95 38.69
3.8 174.00 39.88 127.97 39.98 327.58 33.71 80.03 38.63
4 120.18 39.83 64.09 39.71 250.91 33.53 65.82 38.51
The results in tables (1) and (2) show that the encoding time scales
linearly with ε . This is expected since the major computation effort in
148 M. Hassaballah, M. M. Makky and Y. B. Mahdy
the encoding lies in the linear search through the domain pool. For the
case without domain pool reduction ε =0 (full search) there is no savings
in the encoding time as shown in Fig. (2). Also, in the case of fixed range
size partition the loss in quality of the encoding in terms of fidelity is
larger than for quadtree partition. This is caused by the fact that some
larger range can be covered well by some domains, which are removed
from the domain pool at larger values of ε (e.g. ε ≥ 2.5). As a
consequence some of these ranges are subdivided and their quadrants
may be covered better by smaller domains than the larger range.
This simple entropy approach leads to very significant savings in
encoding time and is similar to the approach used in [5]. With fixed
range size partition, it causes only negligible or no loss in the equality of
image, thereby reducing by 2 the encoding time (at ε =2.5). In the
quadtree case, when ε =3.8 the encoding time of Hill image is 80.03 sec
while the PSNR is 38.63 dB. For comparison, the baseline (full search)
required 1304.45 sec and the PSNR achieved is 39.14 dB. This
represented a speed up factor of over 16 at the expense of a slight drop of
PSNR of 0.51 dB. Generally, the speed-up in terms of actual encoding
time is almost 7 times while the loss in quality of the image is almost
0.83 dB. This compares well with Saupe’s Lean Domain Pool Method,
x 10 2
Time/ seconds
Parameter ε
Figure 2. Encoding time versus epsilon ε for 512x512 Lena image.
A Fast Fractal Image Compression Method Based on Entropy 149
Figure 5. Peppers 512x512 image encoded in 2.93s by the proposed method and the
PSNR of the reconstructed image is 33.56dB.
5. Conclusions
In this paper a parameterized and non-adaptive version of domain pool
reduction is proposed, by allowing an adjustable number of domains to
be excluded from the domain pool based on the entropy value of the
domain block, which in turn reduced the encoding time. Experimental
results on standard images showed that removing domains with high
entropy from the domain pool have little effect on the image quality
while significantly reduce the encoding time. The proposed method is
highly comparable to other acceleration techniques. Next step in our
research is to use the proposed method to improve the speed of hybrid
coders (gaining better results than JPEG) that are based on fractal coders
and transform coders so as to improve their performance.
A Fast Fractal Image Compression Method Based on Entropy 151
References
1. M. Barnsley and A. Jacquin. Applications of Recurrent Iterated Function Systems to
Images. SPIE Visual Communications and Image Processing, Vol. 1001, pp.
122-131, (1988).
2. A. E. Jacquin. Image Coding Based on a Fractal Theory of Iterated Contractive
Image Transform. IEEE trans. on Image Processing, Vol. 1, pp. 18-30, (1992).
3. M. Barnsley and L. Hurd. Fractal Image Compression. on Image Processing:
Mathematical Methods and Applications. Clarendon Press, Oxford, (1997).
4. B. Ramanurthi and A. Gersho. Classified Vector Quantization of Image. IEEE
Trans. Communication, COM-34, Vol. 11, pp. 1105-1115, (1986).
5. Y. Fisher. Fractal Image Compression: Theory and Applications. Springer-Verlag,
New York, (1994).
6. R. Hamzaoui. Codebook Clustering by Self-Organizing Maps for Fractal Image
Compression. In NATO ASI Conf. Fractal Image Encoding and Analysis,
Trondheim, July 1995. Fractals, Vol. 5, Supplementary issue, April (1997).
7. R. Hamzaoui and D. Saupe. Combining Fractal Image Compression and Vector
Quantization. IEEE Trans. on Image Processing, 9(2), pp.197-208, (2000).
8. B. Hurtgen and C. Stiller. Fast Hierarchical Codebook Search for Fractal Coding of
Still Images. Proceeding of SPIE, Vol. 1977, pp. 397-408, (1993).
9. L.M. Po and C.K. Chan. Adaptive Dimensionality Reduction Techniques for Tree-
Structured Vector Quantization. IEEE Trans. on Communications, Vol. 42, No. 6,
pp. 2246-2257, (1994).
10. G. Caso, P. Obrador and C.-C. Kuo. Fast Methods for Fractal Image Encoding.
Proc. SPIE Visual Communication and Image Processing, Vol. 2501, (1995).
11. B. Bani-Eqbal. Enhancing the Speed of Fractal Image Compression. Optical
Engineering, Vol. 34, No. 6, pp.1705-1710, (1995).
12. D. Saupe. Accelerating Fractal Image Compression by Multi-dimensional Nearest
Neighbor Search. In Proc. Data compression Conference, March 28-30, (1995).
13. D. Saupe. Fractal Image Compression via Nearest Neighbor Searching. In Conf.
Proc. NATO ASI, Fractal Image Coding and Analysis, Trondheim, July (1995).
14. C.S. Tong and W. Man. Adaptive Approximation Nearest Neighbor Search for
Fractal Image Compression. IEEE Trans. on Image Processing, 11 (6), (2002).
15. E.W. Jacobs, Y. Fisher, and R.D. Boss. Image Compression: A study of the Iterated
Transform Method. Signal Process, Vol. 29, pp. 251-263, (1992).
16. D.M. Monro and F. Dudbridge. Approximation of Image Blocks. In Proc. Int.
Conf. Acoustics, Speed, Signal Processing, Vol. 3, pp.4585-4588, (1992).
17. D. Saupe. Lean Domain Pools for Fractal Image Compression. Proceedings
IS&T/SPIE 1996 Symposium on Electronic Imaging: Science & Technology Still
Image Compression II, Vol. 2669, (1996).
18. M. Polvere and M. Nappi. Speed-Up in Fractal Image Coding: Comparison of
Methods. IEEE Trans. on Image Processing, Vol. 9, pp.1002-1009, (2000).
152 M. Hassaballah, M. M. Makky and Y. B. Mahdy
19. B. Wohlberg and G. Jager. A review of the Fractal Image Compression Literature.
IEEE Trans. on Image Processing, Vol. 8, No. 12, pp. 1716-1729, (1999).
20. D. Saupe and R. Hamzaoui. Complexity Reduction Methods for Fractal Image
Compression. In I.M.A. Conf. Proc. on Image Processing; Mathematical methods
and applications, Sept., J.M. Blackedge (ed.), (1994).
21. X. Gharavi-Alkhansari, and T.S.Huang. Fractal Image Coding Using Rate-
Distortion Optimized matching Pursuit. Proc. SPIE, pp. 265-304, (1996).
22. C.S. Tong, and M. Pi. Fast Fractal Encoding Based on Adaptive Search. IEEE
Trans. on Image Processing, Vol .10, No.9, pp.1269-1277, (2001).
CHAPTER 9
1. Introduction
Digital watermarking[1,2], the art of hiding information into multimedia
data in a robust and invisible manner, has gained great interest over the
past few years. There has been a lot of interest in the digital
watermarking research, mostly due to the fact that digital watermarking
might be used as a tool to protect the copyright of multimedia data. A
digital watermark is an imperceptible signal embedded directly into the
153
154 C. Jin and J. Peng
media content, and it can be detected from the host media for some
applications. The insertion and detection of digital watermarks can help
to identify the source or ownership of the media, the legitimacy of its
usage, the type of the content or other accessory information in various
applications. Specific operations related to the status of the watermark
can then be applied to cope with different situations.
A majority of the watermarking algorithms proposed in the literature
operate on a principle analogous to spread-spectrum communications. A
pseudo-random sequence, which is called digital watermark, is inserted
into the image. During extraction, the same pseudo-random sequence is
correlated with the estimated pattern extracted from the image. The
watermark is said to be present if the computed correlation exceeds a
chosen threshold value. Among this general class of watermarking
schemes, there are several variations that include choice of specific
domain for watermark insertion, e.g. spatial, DCT, wavelet, etc; and
enhancements of the basic scheme to improve robustness and reduce
visible artifacts. The computed correlation depends on the alignment of
the pattern regenerated and the one extracted from the image. Thus
proper synchronization of the two patterns is critical for the watermark
detection process. Typically, this synchronization is provided by the
inherent geometry of the image, where pseudo-random sequences are
assumed to be placed on the same image geometry. When a geometric
manipulation is applied to the watermarked image, the underlying
geometry is distorted, which often results in the de-synchronization and
failure of the watermark detection process. The geometric manipulations
can range from simple scaling and rotation or cropping to more
complicated random geometric distortions as applied by Stirmark[3].
Different methods have been proposed in literature to reduce/prevent
algorithm failure modes in case of geometric manipulations. For non-
blind watermarking schemes, where the original image is available at the
detector, the watermarked image may be registered against the original
image to provide proper synchronization[4]. For blind watermarking
schemes, where the original image is not available at the detector,
proposed methods include use of the Fourier-Melin transform space that
provides rotation, translation, scale invariance[5], and watermarking
using geometric invariants of the image such as moments[6] or cross-
Robustness of a Blind Image Watermark Detector 155
ratios[7]. Hartung et al[8] have also proposed a scheme that divides the
image into small blocks and performs correlation for rotations and
translations using small increments, in an attempt to detect the proper
synchronization.
In this paper, the orthogonal projective sequence of a digital image is
analyzed. A blind image watermark detector is designed by using the
orthogonal projective sequence of digital image. In Section 2, we first
discuss definition and its properties of the orthogonal projective
sequence of a digital image. A conclusion, the orthogonal projection
sequence of a digital image is one-to-one correspondence with this
digital image, is obtained. By this conclusion, we designed a blind
watermark detector. Then, in Section 3, we present our experimental
results. Experiment results show that this watermark detector not only to
have very strong resistant ability to translation and rotation attacks, but
also to have the good robustness to Gaussian noise. Finally, Section 4
contains our conclusions.
∫ ∫x
p q
M pq = y I ( x, y )dxdy (1)
−∞ −∞
then there is an unique function I(x, y) such that I(x, y) satisfy the
equation (2).
∞
Proof Let I ( x, y ) = ∑α g ( x, y) . By uniformly convergent of the
j =1
j j
∞
function series ∑α g ( x, y) , we indicate that I(x, y) exists, and
j =1
j j
∫∫ I ( x, y ) gi ( x, y )dxdy = ∫∫ gi ( x, y ){ ∑α g ( x, y)}dxdy,
j =1
j i i = 1,2,...
A A
To exchange calculus order between the integral and sum and to use
the normal orthogonality of the function systems {gi ( x, y )}i∞=1 , we may
obtain
∫∫ I ( x, y) g ( x, y)dxdy = α ,
A
i i i = 1,2,...
∫∫ g ( x, y){I ( x, y) − I ( x, y)}dxdy = 0,
A
i 1 2 i = 1,2,…
∑ wγ
i =1
i i
c= (4)
N N
∑
i =1
wi2 ∑i =1
γ i2
~
To detect a watermark in a possibly watermarked image J ( x, y ) , we
~
calculate the correlation between the image J ( x, y ) and the W(x, y). In
general, W(x, y) generated using different keys have very low correlation
with each other. Therefore, during the detection process the correlation
value will be very high for a W(x, y) generated with the correct key and
would be very low otherwise. During the detection process, it is common
to set a threshold ρ to decide whether the watermark is detected or not.
If the correlation exceeds a certain threshold ρ , the watermark detector
~
determines that image J ( x, y ) contains watermark W(x, y).
Although the Fourier transformation[10] has many advantages for
image signal processing, its operation speed is influenced by the real and
imaginary part calculated respectively. We know that Walsh function
system[13] is a complete normal orthogonal basis, therefore, it can
become a basis when orthogonal projection sequence of digital image is
calculated. In addition, each Walsh function value is always 1 or -1, and
it is easy to obtain the kernel matrix, so the calculation is simple and
operation speed can be increased.
According to arrangement order, the Walsh function can be generated
by three methods. In this paper, the Walsh function is generated using the
Hadamard matrix.
By the one dimensional Walsh function systems, the two dimensional
Walsh function systems can be generated according to following as
arrangement order
projection matrix of this digital image. The projection matrix has the
same size with this digital image. Of course, if the digital image has
bigger size, we can’t use too many two dimensional Walsh functions.
How much two dimensional Walsh functions are used, it should be
decided according to actual situation.
From the Fig.4(b) and (c) we know that Detector 1 can't detect W500
correctly, and Detector 2 can generate a higher the peak value in 500
position. This show that Detector 2 is not sensitive to noise, and it has the
very strong anti-noise ability. This is because the projection character-
162 C. Jin and J. Peng
From the Fig.5(b) and (c) we know that Detector 1 can't detect W500
correctly, and Detector 2 can generate a higher the peak value in 500
position. This show that Detector 2 is not sensitive to rotation, it has the
very strong anti-rotation ability. This is because the projection character-
istic of digital image is internal characteristic of this image, and its
existence don’t depend on the pixel position. Therefore Detector 2 is not
sensitive to rotation attack.
From the Fig.6(b) and (c) we know that Detector 1 can't detect W500
correctly, are Detector 2 can generate a higher the peak value in 500
position. This show that Detector 2 is not sensitive to translation, it has
the very strong anti-translation ability. Its reason is the same with
Detector 2 has the very strong anti-rotation ability.
For two detectors, the other attacks, such as filtering, JPEG compression
etc, are tested. By these experiments we know that, for these attacks, two
detectors can't detect W500 correctly. This show that performance of
Detector 2 isn’t more superior than Detector 1’s in the aspects of
resisting filtering and JPEG compression etc.
Robustness of a Blind Image Watermark Detector 165
4. Conclusion
In this paper, the blind watermark detection is realized partly by
orthogonal projection sequence of digital image. By experiment we find
that the blind watermark detector, the normalized correlation value is
calculated by orthogonal projection sequence of digital image, has the
good robustness to Gaussian noise attack, rotation attack, and translation
attack. It points out a new way for designing the better blind watermark
detector.
References
1. M.D. Swanson, M. Kobayashi, and A.H. Tewfik. Multimedia data-embedding and
watermarking technologies, Proceedings of the IEEE, p.1064-1087, 86 (1998).
2. F. Hartung, and M. Kutter. Multimedia watermarking techniques, Proceedings of
the IEEE, p.1079-1107, 87 (1999).
3. Stirmark Package, http://www.cl.com.uk/~fapp2/watermarking/stirmark
4. Q. Sun, J. Wu, R. Deng. Recovering modified watermarked image with reference to
original image, In Proceedings of SPIE: Security and Watermarking of Multimedia
Contents, 3657 (1999).
5. J.J.K.O Ruanaidh and T. Pun. Rotation, scale and translation invariant spread
spectrum digital image watermarking, Signal Processing, p. 303-317, 66(1998).
6. M. Alghoniemy, A. Tewfik. Image watermarking by moment invariants, In
Proceedings of ICIP, (2000).
7. R. Caldelli, M. Barni, F. Bartolini, A. Piva. Geometric-invariant robust
watermarking through constellation matching in the frequency domain, In
Proceedings of ICIP, (2000).
8. J.S.F. Hartung, B. Girod. Spread spectrum watermarking: Malicious attacks and
counter-attacks, In Proceedings of SPIE: Security and Watermarking of Multimedia
Contents, 3657(1999).
9. M. K. Hu. Visual Pattern Recognition by Moment Invariants. IRE Trans.
Information Theory, IT-8, p. 179-187, (1962).
10. Ron N. Bracewell. The Fourier Transform and Its Applications, New York:
McGraw-Hill Book Company, (1965).
11. I.J.Cox., J.Killian, F.Thomson, and T.Shamoon. Secure Spread Spectrum
Watermarking for Multimedia. IEEE Transaction on Image Processing, p.1673-
1687, 6 (1997).
12. P. Moulin and E. Delp. A mathematical approach to watermarking and data hiding.
In Proc. IEEE ICIP, Thessaloniki, Greece, (2001).
13. Tzafestas, S.G. Walsh Functions in Signal and Systems Analysis and Design. New
York: Van Nostrand Reinhold, (1985).
June 25, 2009 8:55 World Scientific Review Volume - 9in x 6in ws-rv9x6
CHAPTER 10
We have recently developed in our lab a text recognizer for on-line texts writ-
ten on a touch-terminal. We present in this paper several strategies to adapt this
recognizer in a self-supervised way to a given writer and compare them to the
supervised adaptation scheme. The baseline system is based on the activation-
verification cognitive model. We have designed this recognizer to be writer-
independent but it may be adapted to be writer-dependent in order to increase
the recognition speed and rate. The classification expert can be iteratively mod-
ified in order to learn the particularities of a writer. The best self-supervised
adaptation strategy is called prototype dynamic management and gets good re-
sults, close to those of the supervised methods. The combination of supervised
and self-supervised strategies increases accuracy again. Results, presented on a
large database of 90 texts (5,400 words) written by 38 different writers are very
encouraging with an error rate lower than 10 %.
10.1. Introduction
Recently, handheld devices like PDAs, mobiles phones, e-books or tablet PC have
became very popular. In opposition to classical personal computers, they are
small, keyboard-less and mouse-less. Therefore, electronic pen is very attrac-
tive as pointing and handwriting device. Such a device is at the frontier of two
research fields: man-machine interface and handwriting recognition.
In this paper, we focus on the problem of handwriting recognition for hand-
held devices with large screen on which we can write texts. For such an applica-
tion, recognition rate should be very high otherwise it should discourage all the
possible users. With the last handwriting recognizers on the market (Microsoft
Windows XP Tablet Edition, Apple Ink, myScript. . . ,) the recognition rate has
became acceptable but is not high enough. The major problem for these recog-
167
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
nizers is the vast variation in personal writing style. Updating the parameters of a
writer-independent recognizer to transform it into a writer-dependent recognizer
with a higher accuracy can solve this difficulty. The systems listed above are not
able to adapt themselves to a given writer. We can get better recognition rates if
we adapt a writer-independent recognizer with an adequate architecture and trans-
form it quickly in a writer-dependent system. However, it should not be forgotten
that the use of a pen as input modality has to be user friendly. So, the training step
must be as shorter as possible or - better - totally hidden for the user.
Traditional adaptation technics require the writer intervention (the so-called
supervised adaptation). We propose in this article several self-supervised adap-
tation scheme that we compare to the already existing techniques like supervised
adaptation.
The article is organized as follows. In section 2, we present a review of the
various techniques of adaptation. In section 3, we describe the writer-independent
baseline system. In section 4, we describe the different adaptation strategies. In
section 5, we present a combination between self-supervised and supervised meth-
ods to achieve very good results. Finally, conclusions and prospects are given in
section 6.
The idea of writer adaptation was revealed by researches in the field of perceptive
psychology. It has been shown that, in the case of a hardly readable writer, it is
easier to read a word if we have already read other words written by the same
person. This phenomenon is called the graphemic priming effect.1 Thus, we learn
the user writing characteristics from the words we can read, and then, we use this
new knowledge to read the remaining words.
In the literature, we consider two adaptation strategies: systems where the
adaptation step takes place once first before use (called off-line) and systems with
continuous adaptation (on-line).
Most systems2–5 using an off-line adaptation scheme need a labeled database
of the writer. These examples are use to make a supervised training of the system.
Thus, the system learns the characteristics of this particular writer before being
used.
On the other hand, the following systems evolve continuously during use.
The on-line handwriting recognition and adaptation system of6 uses a super-
vised incremental adaptation strategy. The baseline system uses a single MLP
with 72 outputs (62 letters and 10 punctuation marks). An adaptation module, at
the output of the MLP modifies its output vector. This adaptation module is a RBF
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
(Radial Basis Function) network. The user informs the system of the classifica-
tion error, giving the letter label, and the RBF is re-trained (modification of the
existing kernels or addition of a new one).
Two other systems use a TDNN (Time Delay Neural Network) as classifier
instead of the MLP. This TDNN is trained on an omni-writer database and the
output layer of this network is replaced either by a k-nn classifier in7 or by a
discriminating classifier in.8 During the adaptation step, the TDNN is fixed and
the output classifier is trained, in order to learn mis-recognized characters.
The system described in9 is very close to our system but is dedicated to
isolated alphanumeric character recognition. The k-nn classifier uses the Dy-
namic Time Warping algorithm to compare the unknown characters to a prototype
database. The writer adaptation consists in adding the mis-classified characters in
this database. Moreover, useless prototypes can be removed from the database to
avoid an excessive growth of this latter.
There are also a lot of works on adaptation in off-line character recognition
and other pattern recognition fields including speech recognition.10 For example,
in,11 the authors adapt the Hidden Markov Models (HMM) first trained on a large
database with a small database of the particular writer.
Based on the results of all these studies, we can notice that model-based clas-
sifier (MBC) like k-nn have better ability to learn particular patterns than machine
learning classifier (MLC) like HMM, MLP or GMM (Gaussian Mixture Model).
MBC need very few samples to learn a new pattern (sometime one sample is
enough) and, as this learning consists in adding the new sample in the classifier
database, they are not time consuming. But the database size tends to increase sig-
nificantly, so the classification time and the memory needed, increase linearly with
this size. On the other hand, MLC need more samples and are time consuming to
re-estimate their parameters. But after the training, the size and the classification
time remain the same.
independent system.12
We use for adaptation a lexicon containing the 8,000 most frequent words of
the French language. Our system is also able to handle very large lexicons (some
200,000 words) as shown in the following. The complete analysis speed is about 6
words per second (P4 1,8GHz Matlab) and a small amount of memory is required
(about 500Ko including the system program, the 8K lexicon and the database).
blank
detector Lexicon
classifier
shape
extractor
of a given class and try to optimize the within-class variance by combining two
stages: a sub-optimal unsupervised research of prototypes followed by an adap-
tation stage using vector quantization. After clustering, the prototype database
contains some 3,000 stroke prototypes for the 62 classes (26 upper-case letters,
26 lower-case letters and 10 digits). Each sample represents a given character al-
lograph (for single-stroke characters) or a part of the allograph (for multi-stroke
characters). An allograph is a specific handwriting feature. It includes on the
one hand characters with the same static representation (i.e. the same image) but
written with variable dynamics (number of strokes, senses, direction ...) and on
the other hand, the different handwriting model for a given character : cursive,
hand-printed, mixed ... When an unknown character has to be classified, it is
first divided into strokes. Then, each stroke is compared with a prototype subset
producing a distance vector. The distance of the unknown data to each charac-
ter class is the sum of all the distance vectors (over the number of strokes). The
nearest-neighbor criterion is then applied to find the winning class.
All these experts provide probabilistic information at the stroke level. For
each expert, we also compute a confusion matrix on the training data, in order
to evaluate prior probabilities. We use the Bayesian rule to re-estimate posterior
probabilities by combining this latter with prior knowledge. The segmentation
probabilities are used to construct the smallest and most relevant segmentation
tree of a line of text. The classifier probabilities are used to activate a list of
hypothetical words in the lexicon for each segmentation in the tree. A probabilistic
engine that combines all the available probabilities evaluates the likelihood of each
hypothetic word in this list. We call this information the probability of lexical
reliability (PLR). We used dynamic programming in the segmentation tree where
each node has a PLR in order to get the best re-transcription of the line.
We evaluate this lexicon driven recognizer on differently lexicon size on the
whole text database used for adaptation (figure 10.2, graph Omni). We also add
some allographes from the text database into the classifier prototype database to
turn the system into a multi-writer recognizer (figure 10.2, graph Multi). Even if
the recognition rate is not so high, we can notice the very good ability to manipu-
late very big lexicon. We loose less than 5 % of the recognition rate when we use a
187,000 words lexicon comparing with a 400 words lexicon (4675 times smaller).
Finally, we achieve a word error rate of 25 % in a writer-independent frame with
a 8,0000 words lexicon.
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
Model-based classifier can be adapted very easily and quickly to new writ-
ing styles, just by storing new character samples in the writer dependent (WD)
database (when these latter miss) and, if needed, by inactivating existing proto-
types (when they are confusing). The system specialization on a given user – by
registration of his personal features – makes it writer-dependent and increases its
accuracy. The comparison of classification hypothesis with either the labeled data
(supervised adaptation) or the lexical hypothesis (self-supervised adaptation) de-
tects classification errors. The misclassified characters can be stored in the writer-
dependent (WD) database, using the lexical hypothesis as a label.
d
r
n
User
Classification Prototypes
te n eco nn a i s s a r c e UNIPEN
hypothesis database
Label de r eco nn a i s s a n c e
As we know the labels and the text segmentation (it is not realistic just an in-
teresting case study), we achieve an awesome word recognition rate of 99 % that
proves the necessity of applying adaptation strategies to recognition systems. The
WDDBS show the amount of prototypes added in the WD database regarding to
the WI database size. The line approach allows a faster improvement of the recog-
nition rate and adds fewer prototypes to the user database than the text approach.
When we add characters after a full text analysis, we can add several similar pro-
totypes (and the average number of added prototypes increases). On the other
hand, the line approach, adds the first prototype of a mis-recognized character.
Thanks to this new sample, the following similar characters are correctly classi-
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
WER WDDBS
Words 50 100 150
Baseline system 5% 100 %
Text appr.: min 0% 0% 0% +3 %
mean 1.3 % 1.1 % 0.6 % +6 %
max 10 % 5.1 % 4.5 % +9 %
Line Appr.: min 0% 0% 0% +2 %
mean 1.1 % 0.7 % 0.4 % +4 %
max 6.2 % 5.2 % 3.7 % +8 %
fied, so they do not need to be stored in the prototypes database. So, the number
of added prototypes is smaller in the line approach than in the text approach and
we select the first strategy for the following works. Due to the architecture of the
recognition system, it is not possible to study a word approach, where we made
the adaptation after each analyzed words. It seems logical to think that a word
approach should perform better than the line approach but the difference should
not be enough to change completely the results obtained with the line approach.
From a perceptive point of view, the prototype storing imitates – at the letter
level – the priming repetition effect noticed at the word level: the initial pre-
sentation of a word reduces the amount of information necessary to its future
identification and this identification is performed faster. Nevertheless, activating
WD prototypes is not sufficient to perform perfect classification, even with a great
amount of labeled data. Some added characters will generate mis-classification
and new errors will appear. It seems necessary to inactivate – or even delete –
some WI prototypes.
d
m
n
User
Classification
te n eco nn a i s s a r c e Prototypes
hypothesis database
UNIPEN
Lexical de meco nn a i s s a n c e
hypothesis
Fig. 10.4. Self-Supervised adaptation method. Addition of prototypes in the user database.
WER WDDBS
Words 50 100 150
Baseline system 25 % 100 %
SA strategy: min 0% 1.9 % 2% +2 %
mean 25 % 23 % 23 % +6 %
max 53 % 73 % 51 % +14 %
CA strategy: min 0% 0% 2% +1 %
mean 22 % 20 % 17 % +2 %
max 71 % 58 % 43 % +3 %
This method has two goals. As seen previously, using lexical hypothesis as a ref-
erence may add confusing or erroneous prototypes, even when conditional acti-
vation is applied. Dynamic management is used to recover from those prototypes
that contribute more often to incorrect than correct classifications. Inactivation
methods are also used to prune the prototype set and speed-up the classification.9
Each prototype (of the WI database as of the WD database) has an initial ade-
quacy (Q0 = 1000). This adequacy is modified during the recognition of the
text according to the usefulness of the prototype in the classification process, by
comparing the classification hypothesis and the lexical hypothesis. Let us con-
sider the prototype i of the class j, three parameters are necessary for the dynamic
management:
The three parameters act differently. The U parameter is used to reduce the
adequacy of the useless prototypes for a given writer. As the baseline recognizer
is writer-independent, it needs many prototypes (an average of 40 prototypes per
class) to model a character class but only a few ones will be useful for a given
user. This parameter eliminates the prototypes that are not used during a long
time. The value of U defines this life “time”. The M parameter is used to penalize
strongly erroneous prototypes. The value of this parameter must be bigger than
the value of U because erroneous prototypes are much more troublesome than
useless prototypes. By preserving only these two parameters, all the prototypes
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
WER WDDBS
8k words 187k words
Baseline system 25 % 28 % 100 %
DM strategy 17 % 21 % -80 %
Fig. 10.5. Best recognition rate writer (99 %) and worst writer (70 %).
Now, let us focus on the evolution of the adequacy of some prototypes (figure
10.6). For some writers, the WI prototypes are sufficient. For the class ‘a’, 2
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
prototypes are used and thus the adequacy of the 45 others decreases. For the
class ‘s’, 4 prototypes are useful (the writer has probably an unstable writing, see
figure 10.5) and the 36 others are inactivated. For another writer (class ‘s’ and
‘e’), WD prototypes (in bold) are necessary. For the class ’s’, at the beginning, a
WI prototype is used and after some 15 occurrences, a WD prototype is added (the
writer gets familiar with the handheld device and the pen). Another WD prototype
is stored after some 35 occurrences (the user writes faster perhaps and changes his
way of writing). After 150 adaptation words, the size of the prototype database
was reduced by 80 %.
adequacy
2000 2000
1000 1000
0 0
0 10 20 0 50 100
occurence occurence
Classe s : 38 protos Classe e : 45 protos
3000 3000
adequacy
adequacy
2000 2000
1000 1000
0 0
0 20 40 0 50 100
occurence occurence
Fig. 10.6. Prototypes adequacy evolution vs. occurrence. Thin lines are WI prototypes and bold lines
are WD prototypes.
Soliciting the user for writing 150 words is much too constraining. On the
other hand, asking him (her) to write some words is acceptable, especially if the
recognition rate is largely improved. This last combination consists in carrying
out a supervised adaptation of the system on some known words and then uses the
self-supervised dynamic management adaptation strategy (table 10.4). Asking the
user to write a sentence of 30 words decreases the error rate to 10 % which is even
better than supervised adaptation performed alone (12 %)! We guess these very
interesting results are due to the fact that, in supervised adaptation, we do not use
the dynamic management of the prototypes.
In this paper, we have shown that model-based classifiers are easy to adapt.
Thanks to their structure, they can learn new writings styles, by activating new
prototypes and inactivating erroneous ones. We first present a supervised adapta-
tion strategy. It is very accurate but not user-friendly as it needs to be supervised
by the writer. Then we try to hide the adaptation process and present several
self-supervised strategies. The conditional activation scheme is the more accurate
as it focuses on reliable words alone. The prototype dynamic management in-
creases both recognition rate (from 75 % to 83 %) and classification speed (close
to twice). This process automatically transforms a writer-independent database
into a writer-dependent database of very high quality and compactness. Finally,
combining supervised and self-supervised improves again the system accuracy
(more than 90
It would be interesting to evaluate a semi-supervised strategy where the user
is solicited only in the ambiguous cases. We have also to adapt the parameters of
the segmentation expert, which actually is the biggest source of error.
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
References
The motivation of this work is based on two key observations. First, the
classification algorithms can be separated into two main categories:
discriminative and model-based approaches. Second, two types of
patterns can generate problems: ambiguous patterns and outliers.
While, the first approach tries to minimize the first type of error, but
cannot deal effectively with outliers, the second approach, which is
based on the development of a model for each class, make the outlier
detection possible, but are not sufficiently discriminant. Thus, we
propose to combine these two different approaches in a modular two-
stage classification system embedded in a probabilistic framework. In
the first stage we estimate the posterior probabilities with a model-
based approach and we re-estimate only the highest probabilities with
appropriate Support Vector Classifiers (SVC) in the second stage.
Another advantage of this combination is to reduce the principal burden
of SVC, the processing time necessary to make a decision and to open
the way to use SVC in classification problem with a large number of
classes. Finally, the first experiments on the benchmark database
MNIST have shown that our dynamic classification process allows
to maintain the accuracy of SVCs, while decreasing complexity by a
factor 8.7 and making the outlier rejection available.
181
182 J. Milgram, R. Sabourin and M. Cheriet
1. Introduction
The principal objective of a pattern recognition system is to minimize
classification errors. However, another important factor is the capability
to estimate a confidence measure in the decision made by the system.
Indeed, this type of measure is essential to be able to make no decision
when the result of classification is uncertain. From this point of view, it
is necessary to distinguish two categories of problematic patterns. The
first one relates to ambiguous data which may cause confusion between
several classes and the second category consists of data not belonging to
any class: the outliers.
Furthermore, most classification algorithms can be divided into two
main categories denoted as discriminative and model-based approaches.
The former tries to split the feature space into several regions by decision
surfaces, whereas the latter is based on the development of a model for
each class along with a similarity measure between each of these models
and the unknown pattern (see Fig. 1). Different terms are used in
literature to refer it, generative method2, density model18, approach by
modeling23 or model-based classifier21.
2. Model-based approach
One of the main advantages of this type of approach is the modularity.
Indeed the training process is computationally cheap because the model
of each class is learned independently. Thus, it is well scalable to large
category problems such as Chinese character recognition.12 On the other
hand, this also facilitates the increment/decrement of categories without
re-training all categories.
Application to Isolated Handwritten Digit Recognition 185
Figure 2. Use of model-based approach to detect outliers (c) and ambiguous patterns (d).
Thus, given a data point x of the feature space, the membership to the
class ω j can be evaluated by the square of the Euclidean distance d j
from the point x to its projection on the hyperplane:
f j ( x) = ((x − μ j )Ψ j )Ψ Tj + μ j . (2)
exp(−α d j (x))
Pˆ f (ω j | x) = c . (4)
∑ exp(−α d j' (x))
j'=1
Application to Isolated Handwritten Digit Recognition 189
The first step is to detect the patterns that may cause confusion. Bellili et
al.1 and Prevost et al.21 consider that conflict involves only two classes
and they use appropriate experts, to reprocess all samples,1 or just the
samples rejected by the first classifier.21 However, we consider that
conflict may involve more than two classes. Hence, it is preferable to use
a dynamic number of classes in conflict. With this intention, we
determine the list of p classes {ω (1),… ,ω ( p )} of which the posterior
probabilities estimated in the first stage are higher than a threshold ε.
Thus, ( j) is the index of the j th class that verifies:
Pˆ f (ω ( j) | x) > ε . (5)
( ( ) ( )) ,
n
−∑ tk log Pˆ ( yk = 1 | x k ) + (1 − tk ) log 1 − Pˆ ( yk = 1 | x k ) (8)
k =1
yk + 1
where t k = denotes the probability target.
2
Then, to solve this optimization problem, the author uses a model-
trust minimization algorithm based on the Levenberg-Marquardt
algorithm. But, in a recent note17 it is shown that there are two problems
in the pseudo-code provided by Platt.20 One is the calculation of the
objective value, and the other is the implementation of the optimization
algorithm. Therefore, the authors propose another minimization
algorithm more reliable, based on a simple Newton’s method with
backtracking line search. Thus, we use this second algorithm to fit
additional sigmoid function and estimate posterior probabilities.
Furthermore, SVC is a binary classifier, so it is necessary to combine
several SVCs to solve a multi-class problem. A most classical method is
Application to Isolated Handwritten Digit Recognition 191
the “one against all” strategy in which one SVC per class is constructed.
Each classifier is trained to distinguish the examples in a single class
from the examples in all remaining classes. Although this strategy is very
accurate, it seems better to use in the second stage of our system a
“pairwise coupling” approach, which consists to construct a classifier for
each pair of classes. Indeed, this strategy is more modular and as
reported by Chang & Lin,3 although we have to train as many as c(c-1)/2
classifiers, as each problem is easier, the total training time of “pairwise
coupling” may not be more than that of the “one against all” method.
Furthermore, if we use “one against all” SVCs in the second stage, we
are obliged to calculate the distances of a large number of SVs belonging
to the implausible classes, which increases the classifying cost. Thus, we
choose to use a “pairwise coupling” approach and we apply the
“Resemblance Model” proposed by Hamamura et al.10 to combine
posterior probability of each pairwise classifier into posterior probability
of multi-class classifier. Then, since prior probabilities are all the same,
posterior probabilities can be estimated by
∏ Pˆ (ω j | x ∈ ω j, j' )
Pˆ (ω j | x ) =
j '≠ j
c
, (9)
∑ ∏ Pˆ (ω j '' | x ∈ ω j '', j ' )
j '' =1 j ' ≠ j ''
∏ Pˆ (ω s ( j) | x ∈ω ( j ), ( j '') )
⎛ c ⎞, (10)
Pˆs (ω ( j) | x) = p
j ''=1, j ''≠ j
p
×⎜⎜1 − ∑ Pˆ (ω f ( j ') | x ) ⎟⎟
∑ ∏ Pˆ (ω s ( j ') | x ∈ω ( j '), ( j '') ) ⎝ j '= p +1 ⎠
j '=1 j ''=1, j ''≠ j '
where the first term is related to the second stage, while the second term
is related to the first stage. The objective of this second term is to
maintain the sum of all the probabilities equal to one.
4. Experimental results
To evaluate our method, we chose a classical pattern recognition
problem: isolated handwritten digit recognition. Thus, in our
experiments, we used a well-known benchmark database. The MNIST
(Modified NIST) dataseta was extracted from the NIST special database
SD3 and SD7. The original binary images were normalized into 20 × 20
grey-scale images with aspect ratio preserved and the normalized images
a
available at http://yann.lecun.com/exdb/mnist/
Application to Isolated Handwritten Digit Recognition 193
The learning dataset contains 60,000 samples and 10,000 others are
used for testing. Moreover, we have divided the learning database into
two subsets. The first 50,000 samples have been used for training and the
next 10,000 for validation. Finally, the number of samples per class for
each subset is reported in the Table 1.
Table 1. Number of samples per class in the three subset of the MNIST database.
ω1 ω2 ω3 ω4 ω5 ω6 ω7 ω8 ω9 ω10
training 4932 5678 4968 5101 4859 4506 4951 5175 4842 4988
validation 991 1064 990 1030 983 915 967 1090 1009 961
test 980 1135 1032 1010 982 892 958 1028 974 1009
Several papers dealt with the MNIST database. The best result
mentioned in the original paper16 is obtained by the convolutional neural
network LeNet-5 (0.95% of error rate on the test dataset). More recently,
a benchmarking of state-of-the-art techniques19 has shown that SVC with
194 J. Milgram, R. Sabourin and M. Cheriet
Table 2. Error rate on the MNIST test dataset reported by Liu et al.19 with
state-of-the-art techniques.
k-NN LVQ RBF MLP SVC
without feature extraction 3.66 % 2.79 % 2.53 % 1.91 % 1.41 %
with feature extraction 0.97 % 1.05 % 0.69 % 0.60 % 0.42 %
Figure 9. Example of outlier (generated with the 12th and 13th sample of the test dataset).
The training and testing of all SVCs are performed with the LIBSVM
software.3 We use the C-SVC with a Gaussian kernel
K ( xk , x ) = exp( −γ xk − x ) . The penalty parameter C and the kernel
2
As we can see on Table 3, after the first stage of classification the label
of the data is not always in the first two classes, which justifies the
choice of a dynamic number of classes in conflict.
Table 3. Ranking distibution of the label obtained with the model-based approach on
the validation dataset.
controls this tradeoff. Then, the validation dataset can be used to fix this
parameter according to the constraints fixed by the application.
Figure 12. Error-reject tradoff of our two-stage classification system on the validation
dataset.
For this reason, we fix the tolerance threshold ε at 10-3, which seems a
good tradeoff between accuracy and complexity. The Fig. 13 shows an
example of ambiguous pattern. We can see in dark the posterior
probability efficiently re-estimated by the second stage. Thus, if we had
used ε = 10-4, we would have obtained for this example a number p = 7
of classes in conflict and we would have used 21 SVCs to re-estimate
posterior probabilities.
Figure 13. Example of ambiguous pattern (5,907th sample of the test dataset).
Application to Isolated Handwritten Digit Recognition 201
Figure 14. The 150 errors obtained on the test dataset (label -> decision).
Figure 15. Distribution of the number p of SVCs used to classify the validation dataset.
References
1. A. Bellili, M. Gilloux and P. Gallinari, An MLP-SVM combination architecture for
offline handwritten digit recognition, International Journal on Document Analysis
and Recognition, 5(4), 244-252 (2003).
2. C.M. Bishop, Generative versus Discriminative Methods, in Computer Vision,
invited keynote talk at International Conference on Pattern Recognition (2004).
3. C.-C. Chang and C.-J. Lin, LIBSVM : a library for support vector machines (2001).
204 J. Milgram, R. Sabourin and M. Cheriet
19. C.-L. Liu, K. Nakashima, H. Sako and H. Fujisawa, Handwritten digit recognition:
benchmarking of state-of-the-art techniques, Pattern Recognition, 36(10), 2271-
2285 (2003).
20. J.C. Platt, Probabilities for SV Machines, Advances in Large Margin Classifiers,
MIT Press, 61-74 (1999).
21. L. Prevost, C. Michel-Sendis, A. Moises, L. Oudot and M. Milgram, Combining
model-based and discriminative classifiers: application to handwritten character
recognition, International Conference on Document Analysis and Recognition, 31-
35 (2003).
22. N. Ragot and E. Anquetil, A generic hybrid classifier based on hierarchical fuzzy
modeling: Experiments on on-line handwritten character recognition, International
Conference on Document Analysis and Recognition, 963-967 (2003).
23. H. Schwenk, The diabolo classifier, Neural Computation, 10(8), 2175-2200 (1998).
24. L. Vuurpijl, L. Schomaker and M. Van Erp, Architectures for detecting and solving
conflicts: two-stage classification and support vector classifiers, International
Journal on Document Analysis and Recognition, 5(4), 213-223 (2003).
June 25, 2009 8:55 World Scientific Review Volume - 9in x 6in ws-rv9x6
CHAPTER 12
We present a learning strategy for Hidden Markov Models that may be used to
cluster handwriting sequences or to learn a character model by identifying its
main writing styles. Our approach aims at learning both the structure and pa-
rameters of a Hidden Markov Model (HMM) from the data. A byproduct of
this learning strategy is the ability to cluster signals and identify allograph. We
provide experimental results on artificial data that demonstrate the possibility
to learn from data HMM parameters and topology. For a given topology, our
approach outperforms in some cases that we identify standard Maximum Like-
lihood learning scheme. We also apply our unsupervised learning scheme on
on-line handwritten signals for allograph clustering as well as for learning HMM
models for handwritten digit recognition.
12.1. Introduction
This paper deals with on-line handwriting signals clustering and Hidden Markov
Models (HMM) structure learning. These two problems may be closely related
and are of interest in the field of on-line handwriting processing and recognition.
Clustering on-line signals is useful for determining allograph automatically, iden-
tifying writing styles, discovering new handwritten shapes, etc. HMM structure
learning may help to automatically handle allograph when designing an on-line
handwriting recognition system. The standard way to learn HMM model is in-
deed only semi-automatic and requires manual tuning, especially for the HMM
topology. Learning HMM models involves learning the structure (topology) and
the parameters of the model. Usually, learning consists in first choosing a struc-
ture and then in automatically learning the model parameters from training data.
Learning parameters is generally achieved with Maximum Likelihood optimiza-
207
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
our goals by restricting the HMM to belong to the class of mixtures of left-right
HMMs. As in [SO93] we focus in our work on learning discrete HMM. The
learning consists in two steps. In a first step, a global HMM is built from training
data, using a procedure to build a left-right HMM from each training sequence.
We propose in this step an original procedure for initializing emission probability
distribution from the data and discuss its interest with respect to the Maximum
Likelihood strategy used in [SO93]. This initial global HMM is then iteratively
simplified by removing one left-right HMM in the mixture at a time. This en-
sures that at any step, the global HMM belongs to the class of mixtures of left-
right HMM, which in particular allows performing clustering. This study is an
extension of our previous work [BAG04] with new original contributions related
mainly to the iterative simplification algorithm and to extended results on different
databases.
We first present our unsupervised learning algorithm. First, we detail the
building of the initial HMM (section 2). Then, we describe the iterative sim-
plification algorithm applied to this initial model (section 3). The application of
our algorithm to cluster sequences and to learn character models in a recognition
engine is explained in section 4. The remaining of the paper is dedicated to exper-
iments. We present experimental databases in section 5 and evaluate the emission
probability distribution estimation in section 5. The two next sections present ex-
perimental results on the two databases for clustering (section 7) and classification
(section 8).
The main idea for building an initial global HMM covering all training data relies
on the build of a left-right HMM from one training sequence. We first detail this
idea, then we discuss how to build the global HMM.
Let D = {x1 , ..., xn } be a set of training sequences (e.g. a number of hand-
writing signals corresponding to a character). Each training
sequence xi , whose
length is noted li , is a sequence of symbols xi = si1 , si2 , ..., sili where each
symbol sij belongs to a finite alphabet Σ.
We detail first the structure of a left-right HMM built from a training sequence.
Then we discuss its parameters, i.e. emission probability distributions and tran-
sition probabilities. We aim at building, from an original training sequence, a
HMM that models well (i.e. gives high likelihood to) sequences that are close to
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
The HMM built from a training sequence x = (s1 , ..., sl ) of length l is a left-
right HMM with l states, one for each symbol in x . According to this procedure,
there exists a natural correspondence between any state and a particular symbol
in Σ. This step is illustrated in Figure 1 where a training sequence of length 3 is
used to build a three-states left-right HMM.
As we detail next, the emission probability distribution in a state of such a
HMM is determined from the associated symbol in Σ. This ensures that the HMM
will score well only sequences that are close to the original sequence.
12.2.1.2. Parameters
We consider as similar two strokes which appear in the same context: Let x be
any sequence of strokes ( x ∈ Σ ∗ ), and let Px (s) be the probability of seeing
stroke s after sequence x. An estimate for Px (s) may be computed by counting
on D:
w(xs)
Px (s) = w(x)
Ps = {Px (s), x ∈ Σ ∗ }
The idea is that two symbols with similar profiles, i.e. appearing with the same
frequency in the same contexts (sequence of symbols in Σ) should be very similar.
This distribution may be approximated on D by:
where sub(D) stands for all subsequences of length c (the context length) in
the training sequences in D.
We then define the similarity κ between two strokes (s1 , s2 ) ∈ Σ 2 by the
correlation between the profiles Ps1 and Ps2 :
n
P (x|M0 ) = wi P (x|λi )
i=1
where x is an observed sequence, λi the ith left-right HMM built from xi and
for each i, wi = n1 .
This HMM gives high likelihood to all training sequences in D and gives low
likelihood to any sequence that is far from every sequence in D. To sum up ideas
for the building of the global HMM, Figure 2 illustrates the procedure for a set of
3 training sequences D={abba, aab, bacca} with an alphabet Σ={a,b,c}. It is a
mixture of three left-right HMMs, each one corresponding to a training sequence.
In this construction, each state of the HMM is naturally associated to a symbol in
Σ. Probability density functions (p.d.f.) in all states are defined according to this
association; for instance states associated to symbol a use a p.d.f. pa.
Fig. 12.2. Illustration of the building of a global HMM from a training set of three training sequences.
A major difference between our work and previous works lies in these p.d.f.,
pa, pb and pc. In [SO93], the global HMM that is built maximizes the likelihood
of training data. It has the same topology, but p.d.f. are somehow Dirac functions.
This means that the p.d.f. associated to symbol a, pa, would be [1 0 0], pb would
be [0 1 0]. That means the only possible observation in a state associated to
symbol a would be a, the only one possible observation in a state associated to
symbol b would be b etc. In our case, p.d.f. are estimated from the data through
the computation of a similarity measure between symbols in Σ. This allows, as
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
we will describe next, to design an iterative simplification procedure for the initial
global HMM based on a simple likelihood criteria.
The general idea of the algorithm is to iteratively merge the two closest left-right
HMM in the global model, M, so that, at the end only typical left-right HMM
remain, each one may be viewed as a model of an allograph. However, in order
to keep a limited set of emission p.d.f., hence a limited number of parameters,
we do not actually merge left-right HMMs but we rather remove less significant
left-right HMMs. The principle of the algorithm is then to select the best models
from the initial mixture model. The iterative simplification algorithm relies on a
maximum likelihood criterion and is summed up below:
1. For each sequence of the database, build the corresponding left-right
HMM.
2. Build the initial global HMM model as detailed in §3. Using n training
data sequences, M is a mixture of n left-right HMM.
k=0.
3. Loop:
At the kth loop, model M k is a mixture of (n-k) left-right HMM.
(a) Build (n-k) alternate models for M k+1 by removing one of the (n-k) left-
right components of M k .
(b) Select the alternate model that maximizes the likelihood of the all training
data in D.
Several stop criteria may be used to determine when to stop simplification. In
the context of clustering, this corresponds to strategies for determining the good
number of clusters. Unfortunately, it does not exist satisfying methods to deter-
mine automatically such an optimal number of clusters; it remains an open prob-
lem. In the present implementation, the stop criterion is satisfied when a given
number of left-right HMMs is obtained. However we will show experimentally
that the likelihood decreases sharply when a right number of clusters is reached.
This suggests that standard strategies can provide effective hints to determine au-
tomatically a correct number of left-right HMMs.
First, it may be used for clustering sequences, when viewing each element of
the mixture (a left-right HMM) as a model of a given cluster. This may be of
interest to identify writing styles, for example to cluster writers according to the
way they write some characters.
Consider that M is a mixture of N left-right HMM, with N<<n:
N
P (x|M0 ) = wi P (x|λi )
i=1
wi P (x|λi )
P (ith cluster/x) =
N
wi P (x|λi )
i=1
This allows to assign any sequence x to a particular cluster using a Bayes rule,
i.e. a maximum posterior probability rule.
Second, the approach may be used to learn character models. For example,
we will provide experimental results for the task of digit classification. In these
experiments, the algorithm is run independently on the training data for each char-
acter, leading to a HMM whose topology and parameters are fully learned from
training data. This is an interesting procedure to design, with less manual tuning,
a new recognition engine for a particular set of symbols or characters.
Fig. 12.3. The four generating HMMs for the artificial datasets. F represents the final state; the
emission p.d.f. shown correspond to the easy dataset.
Two datasets of 1000 sequences were generated from this set of 4 HMMs. The
first set is labelled easy, with a = 0.22. The second set is labelled hard, with a =
0.15. Table 1 shows statistical details for these two datasets.
Fig. 12.4. Set Σ of 36 fixed elementary strokes used to represent handwriting signals — from left to
right: 12 straight lines (named es1 to es12 ), 12 convex strokes (named es13 to es24 ), and 12 concave
strokes (named es25 to es36 ).
We investigate here the quality of our method for estimating emission probability
distributions.
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
Fig. 12.5. Similarities between symbols inferred from the easy dataset, with context length 1. The jth
column of the matrix corresponds to the emission probability distribution in a state associated to the
jth symbol.
Fig. 12.6. Similarities between symbols of the alphabet inferred from the hard dataset, with different
context lengths (from left to right: 1, 2, 3).
column is the probability of observing the ith stroke in a state corresponding to the
jth stroke in Σ.
The left matrix has been tuned manually according to prior knowledge [AG02]
while the right matrix is estimated with the method presented in section §2.1.2.
As may be seen, there are strong correlations between these two matrices, which
shows that our estimation method allows capturing efficiently, from the training
database D, the similarity between symbols.
Fig. 12.7. 36x36 matrices representing probability distributions of states associated to strokes of Σ.
The matrix on the left has been tuned manually using prior knowledge and the matrix on the right has
been learned from the data using the procedure in §2.1.2.
We present here experimental results for the sequences clustering task. We first
discuss the evaluation criteria. And then we present a benchmark method, with
which we compare our approach. Finally, we present experiments on artifical data
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
nj
precision = max Pij
j
n i
In order to give more insights of our approach, labelled BAG in the figures, we
provide some comparative results using a standard learning scheme for HMM
parameters, based on the CEM algorithm (stochastic EM). It is a variant of the
EM algorithm that may outperform EM in unsupervised learning, especially when
dealing with too few data to estimate the likelihood correctly [CD88].
For each number of clusters K, we learn a HMM, whose topology is a mixture
of left-right HMMs. We use this HMM to perform clustering, using a Maxi-
mum Likelihood estimation scheme. To use such a learning strategy, one has to
define first the topology of the model and then to initialize parameters (emission
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
First, we investigate the clustering performance of our approach and compare this
to CEM reestimation (CEM2). This favours the CEM approach, since the main
problem for the latter is to find a correct initialization. On the easy dataset, as may
be seen in Figure 8, our approach outperforms CEM and its performance is close
to the Bayes error of classification, though the learning is totally unsupervised.
With the easy dataset, we also see that the likelihood function shows an inflexion
point for the “good” number of clusters, i.e. the number of HMMs that generated
the data. This allows to easily detect the correct number of clusters.
Fig. 12.8. Above, Performance on the easy dataset comparing our approach (BAG) to CEM2, an
EM reestimation of the model learned with our approach (above). Below, logarithm of the likelihood,
showing an inflexion point for 4 clusters, which corresponds to the number of HMMs that generated
the data.
A look to cluster models reveals that our approach correctly identifies the best
shortest sequences that are typical for each model. Our analysis is that the strength
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
of our approach is to correctly identify the most typical sequences in the data,
and use them as cluster models. Furthermore, we can stress that, given the high
probability of self-transition (0.9), there is a high tendancy to have in the data
much longer sequences that there are number of states in the generating models.
Therefore, to minimize the probability of having a misrepresentative state in the
cluster models, the shorter the sequence, the more likely it is to have only “good”
states. But there are also fewer short sequences present in the data.
The hard dataset provides weaker results, which is logical, given the high
noise ratio. Figure 9 shows the clustering results for all three approaches (BAG,
CEM1 and CEM2). There is no clear tendency between BAG and CEM1: CEM1
gives better results for a low number of clusters, our approach gives better results
for a high number of clusters.
Fig. 12.9. Performance on the hard dataset, comparing our approach (BAG) to the CEM1 and CEM2
clustering approaches.
For CEM2, we used our approach (BAG) as the initialization of the CEM clus-
tering algorithm and CEM2 provides better results. We can explain this using our
previous interpretation: Our approach works by selecting the most representative
sequences of the model in the data. Indeed, as we could check by a deep look
at the data, there is simply no single fully representative sequence of each model,
since the noise ratio is very high in the hard dataset. Therefore, the selected cluster
models contain some “bad” states, and our approach can not modify the left-right
HMMs which are part of the model, whereas the CEM reestimation does.
In the next section, we will look to our real world application – handwrit-
ten digit clustering and classification – to see how our approach compares, and
whether there exists at least some good sequences in the data in a real-world ap-
plication.
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
In a first series of experiments, we used 100 samples of digits ’0’ and ’9’ whose
drawings are very similar. As an illustration, the resulting clusters from one ex-
periment using our model are drawn in Figure 10: The discovered clusters are
homogeneous (including either ’0’ or ’9’ samples). The two clusters for digit
’0’ include indeed slightly different drawings since samples from the smaller set
are drawn the other way round. In this figure, the drawing is generated from our
model representation; therefore, characters do not display as nicely as the fine-
grained original representation.
Fig. 12.10. The three discovered clusters for a database of on-line handwriting samples of digits ’0’
and ’9’.
with the benchmark method (CEM2). Hence, at each step of the simplification al-
gorithm, i.e. for any number of clusters, the resulting models M are re-estimated
with the CEM algorithm. Graph “CEM2 (Fix)” use the model learned with man-
ually tuned emission probability distributions, while “CEM2 (Est)” use the model
using distributions estimated from the data.
For example, for 20 clusters, our approach leads to about 86% accuracy with
tuned emission probability distributions and to 83% with estimated emission prob-
ability distributions. These two systems when re-estimated using a CEM opti-
mization lead respectively to 80% and 74% accuracy.
As may be seen, whatever the number of clusters, CEM re-estimation lowers
the performance, although it maximizes the likelihood. Note that, assuming that
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
there are, in average, two allographs per digit, we are mostly interested here in the
performance for about 20 clusters; i.e. a perfect automatic method would find the
20 clusters that would represent all allographs. The reason for the ineffectiveness
of CEM reestimation scheme is not clear. However, we think that our learning
strategy is naturally more adapted to discover typical cluster of sequences. The
reason lies in that a left-right HMM built from a training sequence, as detailed in
section §2.1 cannot handle much variability around the original training sequence.
Thus it leads to compact and homogeneous clusters. At the opposite, performing
CEM re-estimation may result in less specific left-right HMM, thus in less pre-
cise clusters. These results answer the question we left open in section 6.1. Our
approach depends on its ability to find typical sequences in the data. Indeed, in
our real application, there are at least some characters that are well recorded and
associated to a given handwritten character.
At last, we conducted an experiment using 1000 samples of the letters ’a’ and
’d’, which are often confused in online handwriting systems. Whereas the preci-
sion is only 60% for 2 clusters, it jumps to 95% for 5 clusters, which constitutes a
rather acceptable approximation of the number of allograph for these two charac-
ters.
Preceding results show that clustering is indeed a difficult task since for a
reasonable number of clusters (20) precision does not exceed 85% whereas classi-
fication results on such handwriting signals may reach about 95% [AG02]. How-
ever, our unsupervised approach outperforms benchmark methods provided there
are enough clusters while performance falls sharply when the number of clusters
decreases.
Fig. 12.12. Clustering performance using our approach (BAG) and CEM2. The red graph BAG
(Est) corresponds to the model using estimated emission probability distributions; the blue graph BAG
(Fix) corresponds to the model using manually tuned emission probability distributions. Green (CEM2
(Est)) and yellow (CEM2 (Fix)) graphs correspond to the re-estimation of the two models BAG (Est)
and BAG (Fix).
writing recognition (i.e. duration model and pen-up moves) to keep the generality
of our approach. However, in sight of these shortcomings, our results appear
promising since we obtain the same level of performance than by using the ap-
proach described in [MSAG03].
12.9. Conclusion
Fig. 12.13. Recognition rate (for digit recognition) as a function of the number of components (left-
right HMMs) per digit model.
References
[AG02] Thierry Artières and Patrick Gallinari. Stroke level HMMs for on-line
handwriting recognition. In 8th International Workshop on Frontiers in Hand-
writing Recognition (IWFHR-8), Niagara, August 2002, pages 227-232.
[BAG04] Henri Binsztok, Thierry Artières, Patrick Gallinari: A Model-Based Ap-
proach to Sequence Clustering. European Conference on Artificial Intelligence
(ECAI 2004). Valencia, Spain. Pages 420-424.
[Bra99] M. Brand. Structure learning in conditional probability models via an
entropic prior and parameter extinction. Neural Computation, 11:1155–1182,
1999.
[CD88] Gilles Celeux and Jean Diebolt. A random imputation principle: the
stochastic em algorithm. Technical report, Rapport de recherche de l’INRIA- Roc-
quencourt, 1988.
[CGS00] Igor V. Cadez, Scott Gaffney, and Padhraic Smyth. A general probabilis-
tic framework for clustering individuals and objects. In Raghu Ramakrishnan, Sal
Stolfo, Roberto Bayardo, and Ismail Parsa, editors, Proceedings of the 6th ACM
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
[SO93] Andreas Stolcke and Stephen Omohundro. Hidden Markov Model in-
duction by bayesian model merging. In Stephen José Hanson, Jack D. Cowan,
and C. Lee Giles, editors, Advances in Neural Information Processing Systems,
volume 5, pages 11–18. Morgan Kaufmann, San Mateo, CA, 1993.
[VS97] L. Vuurpijl and L. Schomaker. Finding structure in diversity: A hier-
archical clustering method for the categorization of allographs in handwriting.
International Conference on Document Analysis and Recognition 1997. Pages
387-393.
June 25, 2009 8:55 World Scientific Review Volume - 9in x 6in ws-rv9x6
CHAPTER 13
13.1. Introduction
The last years have witnessed extraordinary advances in computer and communi-
cations technology, leading to an increasing availability of information and pro-
cessing capabilities of multimedia data.1,2 This fact is resulting in a higher and
wider demand for easier access to information.3 On one hand, this information
is mainly stored in digital format, so its acces is limited to the user’s ability to
communicate with computers. On the other hand, it has been remarked the great
expressive power of the natural language used in human-human communication,
as well as its intrinsic multimodal features.4 Consequently, the acces to digital
information could be carried out using this natural language: reducing the neces-
sity of knowing a specific way to interact with the computer and taking advantage
231
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
13.2.1. Fundamentals
The singular value decomposition of matrix Mp×q = [m1 · · · mq ] is given by:
Mp×q = Up×p Σp×q Vq×q
T
(13.1)
where U = [u1 · · · up ] and V = [v1 · · · vq ] are orthonormal matrices; ui are the
eigenvectors of MMT and span the column space of M; vi are the eigenvectors
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
of MT M and span the row space of M; and Σ is a diagonal matrix with the
singular values of either MMT and MT M in descending order. Notice that if
M is a rank r matrix, where r ≤ p and r ≤ q, its corresponding Σ has only r
non-null singular values and Eq. (13.1) can be rewritten as the thin SVD: Mp×q =
Up×r Σr×r Vq×r T
. By the other hand, let Cr×q = UTp×r Mp×q be the projections
of the columns of M over the eigenspace spanned by U. Using the thin SVD
expression the projections matrix C = [c1 · · · cq ] can be written also as Cr×q =
Σr×r Vq×r
T
.
In other fields, like classification problems pointed by,13 a more suitable rep-
q
resentation of M can be achieved including mean information m = 1q i=1 mi
in Eq. (13.1), which has to be computed and substracted previously from M in
order to be able to generate the SVD of M − m · 1:
Starting from Eq. (13.2) and matrix I, Eq. (13.4) can be obtained using the method
proposed by13 if the SVD of I is previously computed and q and c are known
beforehand. A new method for updating both the SVD and the mean using only
the new observations and previous factorization is presented in Sect. 13.2.3.
Mi = Ui Σi ViT + mi 1 . (13.5)
Mf = Mi I = Uf Σf VfT + mf 1 . (13.6)
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
Defining M̂i Eq. (13.7) and centering new columns I around mi Eq. (13.8), it can
be written:
The new columns Ip×c (see sect. 13.2.2) will be known through this paper as the
update block. Note that Eq. (13.10) is the updated SVD from Eq. (13.7) when
some new observations Î are added. This update can be done as12 suggests:
Σi UTi Î ViT 0
M̂i Î = Ui Qi · · =
0 QTi Î 0 1
T
T Vi 0
= Ui Qi ·Ud Σd Vd· =Ut Σt VtT (13.11)
0 1
The previous method can also be used to extract the mean information from an
existing SVD, e.g. trying to express S = Ut Σt VtT as S = Uf Σf VfT + s · 1
setting M̂i Î = S and mt = 0 in Eq. (13.12) to Eq. (13.15).
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
Table 13.1. Resource order requirements of the proposed mean update algorithm.
The mean update presented in section 13.2.3 does not increase the order of re-
sources required in methods
2of incremental
SVD developed in10–14 . The compu-
2
tational cost becomes O qr + pr and the memory complexity is O (pr + qr),
as shown in Table 13.1.
In this paper, On-the-fly Face Training is defined as the process of learning the
photo-realistic facial appearance model of a person observed in a sequence in
a rigorous causal fashion. This fact means that it is not necessary to take into
account subsequent images when adding the information of the current one, which
is considered only once. Note that the facial appearance is learnt in the same order
as the captured images, allowing a real-time learning capability in near future, as
computational resources are constantly being increased.
than 99.9% without any loss of perceptual quality (see Fig. 13.2).
Fig. 13.1. (a) Masks π r . (b) Image It . (c) Regions Rrt , obtained from the application of each mask
π r over image It . (d) Vectors ort related to the defined regions.
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
(a) (b)
Fig. 13.2. (a) Three observed frames of a subject’s face. (b) The same synthesized frames after
learning the appearance model of this person.
Table 13.2. Resource order requirements of the proposed SVD and mean update algorithm over
a matrix Op×q using update block size c and the s eigenvectors corresponding to the largest s
singular values of Op×q . To obtain more compact expressions, value n = s + c has been used.
Value k identifies the iteration number.
The value s consists in the number of eigenvectors kept in matrices Uk+1 (step
4 of the On-the-fly algorithm). Also, it must be noted that he update block size
is specified by c. In this analysis, we presuppose that p > q, q > s and q > c.
Moreover, if additional considerations are taken into account for the values of c
and s, particular cost functions can be described as follows:
The computational cost order of the batch process is O (pq (p + q)), which is
higher than the first assumption and slightly higher than the two last ones. Note
that the two last cases have also a similar cost.
Regarding to memory
2 costs, the batch process has memory require-
ments
of order O q + sp , while the proposed incremental approach has
O (c + s) (p + q + s) + c2 . As can be noted, for small values of c and s the
presented approach achieves great memory reduction and do not increase its order
in the other cases.
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
The On-the-fly Training Algorithm has been tested over a short sequence and a
long one, both recorded at a frame rate of 25 im/s. The short sequence consists
of 316 images and it has been used to compare the results obtained from our On-
the-fly Training Algorithm and its previous non-causal version.7 Achieving the
same quality in the results (see Fig. 13.4), the presented algorithm has reduced
the execution time about 66% with respect to7 and has required about 7 Mbytes in
front of the 200 Mbytes consumed by7 (see the comparison in Fig. 13.5). Later, if
we focus on the long sequence (10000 frames), its processing requirements were
impossible to met with the non-causal algorithm7 because its huge memory cost of
6000 Mbytes, although massive storage systems (e.g. hard drives) were used; the
On-the-fly Training Algorithm reduced the memory requirements to 17 Mbytes
with a processing time of a little more than 10 hours (using a 2GHz processor)
(see Fig. 13.5).
(a) (b)
Fig. 13.4. Output tracking results of the learning process for: (a) the On-the-fly Training Algorithm
and (b) the non-causal algorithm.
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
(a) (b)
Fig. 13.5. Solid line represents the On-the-fly Training Algorithm performance while the dashed one
belongs to the non-causal algorithm presented in.7 (a) Computation time in seconds. (b) Memory used
in bytes.
In this section, the goodness of the results given by the proposed incremental SVD
and mean update algorithm (sect. 13.2.3) is analyzed and compared to the ideal
performance offered by the batch solution.
Some experiments have been developed in order to test the analysis shown in the
previous section (13.3.3). Two video sequences have been recorded and the face
has been aligned in each one using our On-the-fly training algorithm. Starting
from these aligned observations set stored columnwise in every Ok , we have fac-
torized it using both the batch SVD process and our incremental SVD with mean
update algorithm (sect. 13.2.3), obtaining two approximations of the form:
k T
Okp×q ≈ Ukp×s Σks×s Vq×s + ōkp×1 · 11×q . (13.17)
T
Okp×q ≈ Ûkp×s Σ̂ks×s V̂q×s
k
+ ôkp×1 · 11×q . (13.18)
!
! k k T !
!
eb (c, τ ) = !Mp×q − Ukp×s Σks×s Vq×s − ōkp×1 · 11×q ! . (13.19)
2
∀k
! T !
! !
ei (c, τ ) = !Mkp×q − Ûkp×s Σ̂ks×s V̂q×s
k
− ô k
· 1 !
1×q ! . (13.20)
! p×1
∀k 2
Fig. 13.7. Execution time of the algorithm with different number of eigenvectors and update block
size.
We have measured the execution time of both the batch and our incremental com-
putation process done in section 13.4.2.1. The execution time of our incremental
SVD and mean update algorithm is depicted in fig. 13.7 as a function of update
block size and treshold (sect. 13.3.2) and has been obtained as the mean execution
time related to the observation matrices Ok .
It can be noted that the analysis made in sect. 13.3.3 is reflected in fig. 13.7.
It must be noted that the fastest results (about a third of the computation time be-
longing to the batch approach) can be achieved for small update block sizes and
large threshold, which translates in taking into account few (1-2) eigenvectors.
By the other hand, the heaviest computational load corresponds to the assumption
of small block size (1-5) and low threshold (0.001), which translates to a larger
number of eigenvectors (30) and further overcomes the computation time of the
batch process. Finally, it can also be seen that as the block size grows, the compu-
tational cost becomes more independent with respect to the threshold (or number
of eigenvector kept).
13.4.2.3. Conclusions
It can be concluded that the best alternative consists in using a small block size (i.e.
1-10) with a relatively small threshold (i.e. 0.01, obtaining about 10 eigenvectors);
it achieves a relative error of less than 10−3 with half the computation time of
the corresponding batch process. Moreover, when both update block size and
threshold are small enough (τ = 0.001, obtaining more than 30 eigenvectors,
and c = 1), the incremental SVD and mean update algorithm achieves the best
performance but with the heaviest computational load. By the other hand, the
fastest option, achieved with small update block size and high threshold (τ = 0.1,
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
In this paper, a new method for extracting the mean of an existing SVD is pre-
sented, without increasing either the cost order of memory or time. This fact
has allowed us to offer an incremental computation of SVD preserving a zero data
mean, which has been analyzed and compared with the batch approach. The preci-
sion offered by our method is high enough to allow photorealistic reconstructions
of observed face images using half the computation time of the non-incremental
processes. Fields that can benefit from it can be, e.g.: classification problems,
where the mean information is used to center the data; incremental computation
of covariation matices, which need to be centered around its mean; causal con-
struction of eigenspaces, where the principal components of the data are included,
as well as the mean information. With respect to the latter, the On-the-fly Algo-
rithm is presented in this work. Given an image sequence and a set of masks, this
algorithm is capable of generating a separate eigenspace for each facial element
(learning all their appearance variations due to changes in expression and visual
utterances) and effectively tracking and aligning them. Furthermore, longer se-
quences than previous methods5,7 can be processed with the same visual accuracy
when no ilumination changes appear. Finally, we plan to add more robustness to
this algorithm using methods like5 and more work will be done in order to achieve
real time perfomance, so specific appearance models can be obtained as a person
is being recorded.
References
CHAPTER 14
14.1. Introduction
245
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
Motion capture is the process of recording live movement and translating it into
usable mathematical terms by tracking a number of key points or regions/segments
in space over time and combining them to obtain a 3-D representation of the per-
formance.3
By reviewing the literature, we distinguish between two different strategies
for human action modeling based on motion capture data, namely data-driven and
model-driven. Data-driven approaches build detailed descriptions of recorded ac-
tions, and develop procedures for their adaption and adjustment to different char-
acters.4 Model-driven strategies search for parameterized representations con-
trolled by few parameters:5 computational models provide compactness and facil-
ities for an easy edition and manipulation. Both approaches are reviewed next.
Data-driven procedures do care of specific details of motion: accurate move-
ment descriptions are obtained by means of motion capture systems, usually opti-
cal. As a result, a large quantity of unstructured data is obtained, which is difficult
to be modified while maintaining the essence of motion.6,7 Inverse Kinematics
(IK) is a well-known technique for the correction of one human posture.8,9 How-
ever, it is difficult to apply IK over a whole action sequence while obeying spatial
constraints and avoiding motion discontinuities. Consequently, current effort is
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
In our experiments, an optical system was used to provide real training data to
our algorithms. The system is based on six synchronized video cameras to record
images, which incorporates all the elements and equipment necessary for the au-
tomatic control of cameras and lights during the capture process. It also includes
an advanced software pack for the reconstruction of movements and the effective
treatment of occlusions.
Consequently, the subject first placed a set of 19 reflective markers on the
joints and other characteristic points of the body, see Fig. 14.1.(a) and (b). These
markers are small round pieces of plastic covered in reflective material. Subse-
quently, the agent is placed in a controlled environment (i.e., controlled illumina-
tion and reflective noise), where the capture will be carried out. As a result, the
accurate 3-D positions of the markers are obtained for each recorded posture ps ,
30 frames per second:
ps = (x1 , y1 , z1 , ..., x19 , y19 , z19 )T . (14.1)
An action will be represented as a sequence of postures, so a proper body
model is required. In our experiments, not all the 19 markers are considered to
model human actions. In fact, we only process those markers which correspond
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
(a) (b)
Fig. 14.1. Procedure for data acquisition. Figs. (a) and (b) shows the agent with the 19 markers on
the joints and other characteristic points of its body.
to the joints of a predefined human body model. The body model considered is
composed of twelve rigid body parts (hip, torso, shoulder, neck, two thighs, two
legs, two arms and two forearms) and fifteen joints, see Fig. 14.2.(a). These joints
are structured in a hierarchical manner, where the root is located at the hips, see
Fig. 14.2.(b).
(a) (b)
Fig. 14.2. (a) Generic human body model represented using a stick figure similar to19 , here composed
of twelve limbs and fifteen joints. (b) Hierarchy of the joints of the human body model.
We next represent the human body by describing the elevation and orientation
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
of each limb using three different angles which are more natural to be used for
limb movement description.20 We consider the 3-D polar space coordinate system
which describes the orientation of a limb in terms of its elevation, latitude and
longitude, see Fig. 14.3. As a result, the twelve independently moving limbs in
the 3-D polar space have a total of twenty-four rotational DOFs which correspond
to thirty-six absolute angles.
Fig. 14.3. The polar space coordinate system describes a limb in terms of the elevation φl , latitude
θl , and longitude ψl .
Once the learning samples are available, we compute the aSpace representation Ω
of the aWalk action, as detailed in.2 In our experiments, the walking performances
of three females and two males were captured to collect the training data set. For
each walker, near 50 aWalk cycles have been recorded. As a result, the training
data is composed of near 1500 human posture configurations per agent, thus re-
sulting 7500 3D body postures for building the Walk aSpace. From Eq. (14.5),
the training data set A is composed of the acquired human postures:
A = {x1 , x2 , ..., xf }, (14.6)
where f refers to the overall number of training postures for this action:
r
f= fj . (14.7)
j=1
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
The mean human posture x̄ and the covariance matrix Σ of A are calculated.
Subsequently, the eigenvalues Λ and eigenvectors E of Σ are found by solving
the eigenvector decomposition equation.
We preserve major linear correlations by considering the eigenvectors ei cor-
responding to the largest eigenvalues λi . Fig. 14.4 shows the three eigenvectors
associated to the three largest eigenvalues, which correspond to the most relevant
modes of change of the human posture in the aWalk aSpace. As expected, these
modes of variation are mainly related to the movement of legs and arms.
Fig. 14.4. The three most important modes of variation of the aWalk aSpace.
If we need to guarantee that the first m eigenvectors actually model, for exam-
ple, 95% of the overall variance of the samples, we choose m so that:
m
k=1 λk
≥ 0.95. (14.9)
λT
The individual contribution of each eigenvector determines that 95% of the
variation of the training data is captured by the thirteen eigenvectors associated to
the thirteen largest eigenvalues. So the resulting aWalk aSpace Ω is defined as the
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
combination of the eigenvectors E, the eigenvalues Λ and the mean posture x̄:
Ω = (E, Λ, x̄). (14.10)
Using the aWalk aSpace, each performance is represented as a set of points, each
point corresponding to the projection of a learning human posture xi :
yi = [e1 , ..., em ]T (xi − x̄). (14.11)
Thus, we obtain a set of discrete points yi in the action space that represents
the action class Ω. By projecting the set of human postures of an aWalk perfor-
mance Hj , we obtain a cloud of points wich corresponds to the projections of the
postures exhibited during such a performance.
We consider the projections of each performance as the control values for
an interpolating curve gj (p), which is computed using a standard cubic-spline
interpolation algorithm.21 The parameter p refers to the temporal variation of the
posture, which is normalized for each performance, that is, p ∈ [0, 1]. Thus, by
varying p, we actually move along the manifold.
This process is repeated for each performance of the learning set, thus obtain-
ing r manifolds:
gj (p), p ∈ [0, 1], j = 1, ..., r. (14.12)
Afterwards, the mean manifold g(p) is obtained by interpolating between
these means for each index p. This performance representation is not influenced
by its duration, expressed in seconds or number of frames. Unfortunately, this re-
sulting parametric manifold is influenced by the fact that any subject performs an
action in the way he or she is used to. That is to say, the extreme variability of hu-
man posture configurations recorded during different performances of the aWalk
action affects the mean calculation for each index p. As a result, the manifold may
comprise abrupt changes of direction.
A similar problem can be found in the computer animation domain, where
the goal is to generate virtual figures exhibiting smooth and realistic movement.
Commonly, animators define and draw a set of specific frames, called key frames
or extremes, which assist the task of drawing the intermediate frames of the ani-
mated sequence.
Likewise, our goal is set to the extract the most characteristic body posture
configurations which will correspond to the set of key-frames for that action. From
a probabilistic point of view, we define characteristic postures as the least likely
body postures exhibited during the action performances. As the aSpace is built
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
based on PCA, such a space can also be used to compute the action class condi-
tional density P (xj |Ω).
We assume that the Mahalanobis distance is a sufficient statistic for charac-
terizing the likelihood:
So, once the mean manifold g(p) is established, we compute the likelihood
values for the sequence of pose-ordered projections that lie in such a manifold.22,23
That is, we apply Eq. (14.13) for each component of the manifold g(p). Lo-
cal maxima of this function correspond to locally maximal distances or, in other
words, to the least likely samples, see Fig. 14.5.
Fig. 14.5. Distance measure after pose ordering applied to the points of the mean manifold in the
aWalk aSpace. Maxima (i.e., the key-frames) also correspond to important changes of direction of the
manifold.
Once the key-frame set K is found, the final human action model is repre-
sented as a parametric manifold f (p), called p–action, which is built by interpola-
tion between the peaks of the distance function defined in Eq. (14.13). We refer
the reader to2 for additional details. Fig. 14.6 shows the final aWalk model Γ,
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
defined as the combination of the aWalk aSpace Ω, the key-frames K and the
p–action f :
Γ = (Ω, K, f ). (14.15)
Fig. 14.6. Prototypical performance manifold, or p–action, in the aWalk aSpace. Depicted human
postures correspond to the key-frame set.
that is, the set human postures exhibited during several aWalk performances for a
male and a female agent, respectively.
Next, we project the human postures of HWM and HWF in the aWalk aSpace,
as shown in Fig. 14.7. The cyclic nature of the aWalk action explains the resulting
circular clouds of projections. Also, note that both performances do not intersect,
that is, they do not exhibit the same set of human postures. This is due to the
high variability inherent in human performances. Consequently, we can identify a
posture as belonging to a male or female walker.
However, the scope of this paper is not centered on determining a discrimina-
tive procedure between generic male and female walkers. Instead, we look for a
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
(a) (b)
Fig. 14.7. Male and female postures projected in the aWalk aSpace, by considering two (a) and three
(b) eigenvectors for the aSpace representation.
where f WM and f WF refer to the male and female p–actions, respectively. These
manifolds have been obtained by interpolation between the key-frames of their
respective key-frame set, i.e., KWM and KWF . Fig. 14.8 shows the resulting
p–action representations in the aWalk aSpace Ω.
In order to compare the human posture variation for both performances, we sample
both p–actions to describe each manifold as a sequence of projections:
Fig. 14.8. Male and female performance representations in the aWalk aSpace.
Once the male and female p–actions are synchronized, the angle variation for
different limbs of the human body model can be analysed. Fig. 14.9.(a), (b), (c),
and (d) show the evolution of the elevation angle for four limbs of the human body
model, namely the shoulder, torso, left arm, and right thigh, respectively.
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
(a) (b)
(c) (d)
Fig. 14.9. The elevation variation for the shoulder (a), torso (b), left arm (c), and right thigh (d) limbs
are depicted for a male and a female walker.
By comparing the depicted angle variation values of both walkers, several dif-
ferences can be observed. The female walker moves her shoulder in a higher
degree than the male shoulder. That is, the swing movement of the shoulder is
more accentuated for the female. Also, the female bends the torso in a higher
inclination degree. Therefore, the swing movement of the shoulder and torso for
the male agent is less pronounced. The female walker also exhibits an emphasized
swing movement in her left arm. On the contrary, the male agent does not show a
relevant swing movement for his left arm. As expected, when the left arm swings
backward, the right thigh swings forward, and vice versa. When comparing the
angle variation of the right thigh for both walkers, few dissimilarities can be de-
rived. In fact, most differences between the male and the female performances
have been found in the elevation values of the limbs corresponding to the upper
part of the human body.
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
These results are also supported by other authors such as,26 which explicitly
differentiate walking modeling into upper and lower body. Thus, the lower body is
usually concentrated in locomotion, that is, modeling the walking motion so that
it is physically valid. This is usually achieved by applying inverse kinematics. On
the other hand, the movement of the upper body adds reality and naturalness to the
human walking, which is mainly attained by means of interpolation techniques.
Afterwards, both upper and lower body motion should be synchronized to creating
a natural looking walking.
Acknowledgements
This work has been supported by EC grants IST-027110 for the HERMES project
and IST-045547 for the VIDI-Video project, and by the Spanish MEC under
projects TIN2006-14606 and DPI-2004-5414. Jordi Gonzlez also acknowledges
the support of a Juan de la Cierva Postdoctoral fellowship from the Spanish MEC.
References
1. J. Davis and A. Bobick. The representation and recognition of movement using tem-
poral templates. In Proceedings of IEEE Conference on Computer Vision and Pattern
Recognition (CVPR’97), pp. 928–934, San Juan, Puerto Rico, (1997).
2. J. Gonzàlez, X. Varona, F. Roca, and J. Villanueva. aSpaces: Action spaces for recog-
nition and synthesis of human actions. In Proc. Second International Workshop on
Articulated Motion and Deformable Objects (AMDO 2002), pp. 189–200, Palma de
Mallorca, Spain, (2002).
3. F. Perales, A. Igelmo, J. Buades, P. Negre, and G.Bernat. Human motion analysis &
synthesis using computer vision and graphics techniques. Some applications. In IX
Spanish Symposium on Pattern Recognition and Image Analysis, vol. 1, pp. 271–277,
Benicassim, Spain (16-18 May, 2001).
4. M. Gleicher and N. Ferrier. Evaluating video-based motion capture. In Proceedings of
Computer Animation, pp. 75–80, Geneva, Switzerland (June, 2002).
5. R. Boulic, N. Magnenat-Thalmann, and D. Thalmann, A global human walking
model with real-time kinematics personification, The Visual Computer. 6(6), 344–358,
(1990).
6. D. Thalmann and J. Monzani. Behavioural animation of virtual humans : What kind
of law and rules? In ed. I. C. Press, Proc. Computer Animation 2002, pp. 154–163,
(2002).
7. M. Unuma, K. Anjyo, and R. Takeuchi. Fourier principles for emotion-based human
figure animation. In Proceedings of SIGGRAPH 95, pp. 91–96 (August, 1995).
8. R. Boulic, R. Mas, and D. Thalmann, A robust approach for the center of mass position
control with inverse kinetics, Journal of Computers and Graphics. 20(5), 693–701,
(1996).
9. M. Gleicher, Retargetting motion to new characters, Computer Graphics, Proceedings
of ACM SIGGRAPH 85. pp. 33–42, (1998).
10. M. Gleicher, Comparing constraint-based motion editing methods, Graphical Models.
pp. 107–134, (2001).
11. Z. Popović and A. Witkin. Physically based motion transformation. In Proceedings of
ACM SIGGRAPH 99, pp. 11–20 (August, 1999).
12. N. Badler, C. Phillips, and B. Webber, Simulating Humans. Computer Graphics Ani-
mation and Control. (Oxford University Press, 1993).
13. K. Perlin and A. Goldberg. Improv: a system for scripting interactive actors in virtual
worlds. In Proceedings of ACM SIGGRAPH 96, pp. 205–216, (1996).
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
14. A. Baumberg and D. Hogg, Generating spatio temporal models from examples, Image
and Vision Computing. 14, 525–532, (1996).
15. R. Bowden. Learning statistical models of human motion. In Proceedings of the IEEE
Workshop on Human Modeling, Analysis and Synthesis, pp. 10–17, (2000).
16. P. Glardon, R. Boulic, and D. Thalmann. Pca-based walking engine using motion cap-
ture data. In Computer Graphics International, pp. 292–298, Crete, Greece, (2004).
17. N. Troje, Decomposing biological motion: a framework for analysis and synthesis of
human gait patterns, Journal of Vision. 2, 371–387, (2002).
18. T. Heap and D. Hogg, Extending the point distribution model using polar coordinates,
Image and Vision Computing. 14, 589–599, (1996).
19. J. Cheng and M. Moura, Capture and represention of human walking in live video
sequences, IEEE Transactions on Multimedia. 1(2), 144–156, (1999).
20. D. Ballard and C. Brown, Computer Vision. (Prentice-Hall, Englewood Cliffs, NJ,
1982).
21. W. Press, B. Flannery, S. Teukolsky, and W. Vetterling, Numerical Recipes in C. (Cam-
bridge University Press, Cambridge, 1988).
22. H. Borotschnig, L. Paletta, M. Prantl, and A. Pinz, Appearance-based active object
recognition, Image and Vision Computing. 18, 715–727, (2000).
23. H. Murase and S. Nayar, Visual learning and recognition of 3-D objects from appear-
ance, International Journal of Computer Vision. 14, 5–24, (1995).
24. B. Guenter and R. Parent, Computing the arc length of parametric curves, IEEE Com-
puter Graphics and Applications. 10(3), 72–78 (May, 1990).
25. J. Gonzàlez, J. Varona, F. Roca, and J. Villanueva. A human action comparison frame-
work for motion understanding. In Artificial Intelligence Research and Developments.
Frontiers in Artificial Intelligence and Applications, vol. 100, pp. 168–177. IOS Press,
(2003).
26. K. Ashida, S. Lee, J. Allbeck, H. Sun, N. Badler, and D. Metaxas. Pedestrians: Creat-
ing agent behaviors through statistical analysis of observation data. In Proceedings of
Computer Animation, pp. 84–92, Seoul, Korea, (2001).
June 25, 2009 17:8 World Scientific Review Volume - 9in x 6in ws-rv9x6
CHAPTER 15
We propose a two-step method for detecting human heads with their orientations.
In the first step, the method employs an ellipse as the contour model of human-
head appearances to deal with wide variety of appearances. Our method then
evaluates the ellipse to detect possible human heads. In the second step, on the
other hand, our method focuses on features inside the ellipse, such as eyes, the
mouth or cheeks, to model facial components. The method evaluates not only
such components themselves but also their geometric configuration to eliminate
false positives in the first step and, at the same time, to estimate face orientations.
Our intensive experiments show that our method can correctly and stably detect
human heads with their orientations.
15.1. Introduction
261
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
Recently, a visual object detection framework was proposed and applied to face
detection.16,17 Though the framework is capable of processing images rapidly
with achieving high detection rate, it focuses on rapidly detecting human faces as
rectangle regions and does not pay any attention to the contours of their appear-
ances.
To build a fully automated system that recognizes human faces from images,
it is essential to develop robust and efficient algorithms to detect human heads
and, at the same time, to identify face orientations. Given a single image or a
sequence of images, the goal of automatic human-face recognition is to detect
human heads/faces and estimate their orientations regardless of not only their po-
sitions, scales, orientations, poses, but also individuals, background changes and
lighting conditions.
This paper proposes a two-step method for detecting human heads and, at the
same time, for estimating face orientations by a monocular camera. In the both
steps, we employ models of the human-head contour and face orientations to en-
hance robustness and stableness in detection. We also introduce model evaluation
with only image-features robust against lighting conditions, i.e., the gradient of
intensity and texture.
In the first step, our method employs an ellipse as the contour model of human-
head appearances to deal with wide variety of appearances. The ellipse is gener-
ated from one ellipsoid based on the camera position with its angle of depression
in the environment. Our method then evaluates the ellipse over a given image to
detect possible human heads. In evaluation of an ellipse, two other ellipses are
generated inside and outside of the ellipse, and the gradient of intensity along
the perimeter of the three ellipses is used for accurate detection of human-head
appearances.
In the second step, on the other hand, our method focuses on facial com-
ponents such as eyes, the mouth or cheeks to generate inner models for face-
orientation estimation. Based on the camera position with its angle of depression,
our method projects the facial components on the ellipsoid onto the ellipse to gen-
erate inner models of human-head appearances. Our method then evaluates not
only such components themselves but their geometric configuration to eliminate
false positives in the first step and, at the same time, to estimate face orienta-
tions. Here the Gabor-Wavelets filter, which is verified its robustness and stable-
ness against changes in scale, orientation and illumination, is used for detecting
features representing the facial components.
Consequently, our method can correctly and stably detect human heads and es-
timate face orientations even under environments such as illumination changes or
face-orientation changes. Our intensive experiments using a face-image database
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
Human beings have almost the same contour in shape of the head and an ellipse
approximates the appearance of the contour. These observations remain invariant
against changes in face orientation. We, therefore, model the contour of human-
head appearances by the ellipse.1,12,19
An ellipse has five parameters in the image (Fig. 15.1): the 2D coordinates
(x, y) of the ellipse center, the length a of the semiminor axis, the oblateness r,
and the slant ψ of the ellipse.
ψ
ar
(x,y)
a
z2
x2 + y 2 + = 1, (15.1)
r2
where r ≥ 1. We then derive an ellipse as the contour model of human-head
appearances depending on the angle of depression of the camera (Fig. 15.2).
When we set up a camera with any angle of depression, the ellipsoid (15.1) is
observed as an ellipse. The length of the semiminor axis of the ellipse is always
one. The length of the semimajor axis, on the other hand, is between one and r
depending on the angle of depression of the camera.
Now we determine the oblateness, r (1 ≤ r ≤ r), of the ellipse observed by
a camera with ϕ angle of depression providing that the distance of the camera po-
sition from the ellipsoid is large enough. We consider the ellipse obtained through
the projection of (15.1) onto the xz−plane and its tangential line (Fig. 15.3).
We see that the ellipse, the projection of (15.1) onto the xz−plane, is repre-
sented by
z2
x2 + = 1. (15.2)
r2
Let its tangential line with slant ϕ from the x−axis be
z = sin ϕx + b, (15.3)
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
z l
r’
ϕ 1 x
0
projection screen
For evaluating an ellipse, we construct two other ellipses (Fig. 15.5). One is a
smaller size ellipse with the identical center and the other is a larger size ellipse
with the identical center. In Fig. 15.5, the red ellipse is to be evaluated and the blue
ellipse is the smaller size one and the green is the larger size one. We denote by
orbit(i) the intensity of the intersection point of the (red) ellipse to be evaluated
and ray i whose end point is the ellipse center. We remark that we have N rays
with the same angle-interval and they are sorted by the angle from the horizontal
axis in the image. outer(i) and inner(i) are defined in the same way for the cases
of the larger size ellipse (green ellipse) and the smaller size ellipse (blue ellipse),
respectively.
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
1
N
f (p) = k {G(i) − O(i) − I(i)} , (15.4)
N i=1
Note that k is the constant making the value dimensionless.1 (15.5), (15.6), and
(15.7) evaluate the gradient magnitude of intensity at the ellipse perimeter, inten-
sity changes along the ellipse perimeter and intensity changes from just inside the
ellipse, respectively. Ellipses having a small value of (15.4) are then regarded as
applicants of human-head appearances. We remark that our ellipse evaluation is
effective even if a face region is darker than the surrounding background. This is
because our evaluation is based on not intensity itself but the gradient magnitude
of intensity.
i+1
expand
i
orbit(i)
outer(i)
inner(i)
i -1
orbit(i-1)
We investigate inside the ellipse in more detail to detect human heads and face
orientations providing that applicants of the human heads are already detected
as ellipses. In detection, these pre-obtained applicants facilitate determination of
parameters such as scale or direction.
Eyebrows, eyes, the mouth, the nose and cheeks are the features inherent in the
human face. Here we focus on eyes, the mouth and cheeks, and characterize tex-
tures around such facial components. We remark that textures are robust against
illumination changes.
In oriental countries, we observe around eyes (1) a dark area due to eyebrows,
(2) a bright area due to eyelids, and (3) a dark area due to the pupil (see Fig. 15.6
(a)). These are observations along the vertical direction of the human face and
these characterize the texture of an eye area. We also observe that the eye area is
symmetrical with respect to the pupil. As for an area around the mouth, on the
other hand, we observe (1) a bright area due to the upper lip, (2) a dark area due to
the mouth, and (3) a bright area due to the lower lip (see Fig. 15.6 (b)). In addition,
the mouth area is also symmetrical with respect to the vertical center of the face.
These observations characterize the texture of a mouth area. We see no complex
textures in a cheek area. These observations are almost invariant and stable under
changes in illumination, in face-orientation and in scale.
The geometric configuration of the facial components, i.e., the relative posi-
tion between eyes, the mouth and cheeks is also invariant. Combining the char-
acteristic of textures of the facial components with their geometric configuration
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
’gabor_real.dat’
z
0.003
0.003 0.002
0.0025
0.002 0.001
0.0015
0.001
0.0005 0
0
-0.0005 -0.001
-0.001
-0.0015 -0.002
-0.002
-0.0025
20
15
10
-20 5
-15
-10 0
-5 -5 y
0
5 -10
x 10 -15
15
20 -20
We characterized in Section 15.3.1 textures around eyes and the mouth along
the vertical direction of the human face. To detect these textures we only have
to select the parameters in the Gabor-Wavelets filter so that the filter detects the
textures along the semimajor axis of the ellipse. Points with maximal values in
the response ellipse-region can be eyes and those with minimal values can be a
mouth. The area with no singularity, on the other hand, can be cheeks.
k = cos ϕ cos θ,
l = cos ϕ sin θ,
m = sin ϕ.
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
z
S
camera
D = (k,l,m)
y
ϕ P = (xp, yp, zp)
p = (xs, ys)
In this way, when depression angle ϕ and rotation angle θ are specified, we
can project a facial area of the ellipsoid onto the image plane to obtain an inner
model of the human-head appearance that represents the facial components with
their geometric configuration.
Figure 15.9 shows the inner models of human-head appearances with ϕ = 0
(upper: θ = 0o , 30o , 60o , 90o , lower: θ = 180o , 270o , 300o, 330o ). R1 and R2
denote the eye areas. R3 denotes the mouth area, and R4 and R5 denote the cheek
areas.
To the response ellipse-region of the Gabor-Wavelets filter, we apply the inner
model matching to detect human-head appearances and face orientations. To be
more concrete, if we find eyes, a mouth and cheeks in a response ellipse, we then
identify that the ellipse is a human-head appearance and that the orientation of
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
R1 R2 R5
R4
R3
Fig. 15.9. Inner models of human-head appearances with the facial components.
15.4. Algorithm
Based on the discussion above, we describe here the algorithm for detecting
human-head appearances with face orientations.
To reduce the computational cost in generating applicants of human-head ap-
pearances, we introduce the coarse-to-fine sampling of the parameters represent-
ing ellipses. Namely, we first coarsely sample points in the parameter space for
the ellipse and then minutely sample the area around the points that are selected
based on plausibility of the human-head appearance. Moreover, in the coarse sam-
pling, we fixate parameters depending only on poses of a human head to enhance
position identification of the human head. In the fine sampling, we sample all the
parameters. The following algorithm effectively detects human heads and, at the
same time, estimates their orientations.
2.3: (Fine sampling): more minutely sample points in the area around each
entry of {pi∗ } (more specifically, more minutely sample parameters
(x, y, a, r, ψ) around each entry of (xi∗ , yi∗ , ai∗ ) where (xi∗ , yi∗ , ai∗ ) is
the position parameters of {pi∗ }); let {p∗j } be the sampled set. (Note that
{p∗j } is applicants of human-head appearances.)
Step 3: To each entry of {p∗j }, generate inner models of human-head appear-
ances.
Step 4: Apply the Gabor-Wavelets filter to each entry of {p∗j } to detect facial
feature points.
Step 5: To each p∗j , apply the matching with the corresponding inner models.
Step 6: If p∗j matches one of its corresponding inner models with a high score,
then recognize p∗j as a human-head appearance and the face orientation as that
of the matched inner-model. If p∗j does not match any of its corresponding
inner models with a high score, then eliminate p∗j .
We remark that iterating the steps above enables us to track human heads
with their orientations. Though introducing a transition model of motion to our
detection algorithm leads to more effective tracking, it is beyond the scope of this
paper.
We first evaluated our algorithm using a face-image database. The database con-
tains face images of 300 persons with the ages ranging uniformly from 15 to
65 years old including men and women. Each person is taken his/her face im-
ages from different directions as shown in Fig. 15.10. To each face image in the
database, attached is the ground truth of the direction from which the image is
taken.
We used 9600 (= 32 × 300) face images in the database where 32 directions
are used in taking images of each person: the angles of depression of the camera
were ϕ = 0o , 15o , 30o , 45o and the rotation angles with respect to the horizon-
tal axis, i.e., face orientations, were 0o , 30o , 60o , 90o , 180o , 270o , 300o, 330o .
Fig. 15.11 shows samples of the face images of one person in the database.
We applied our algorithm to the 9600 images to detect face orientations. Table
15.1 shows the recognition rates of the estimated face-orientations.
Table 15.1 shows that face orientations are in general recognized with high
scores. We see low accuracy in orientations with 90o and 270o . This is because
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
one eye and one cheek do not appear in the face with such orientations and thus
the inner model matching becomes unstable. We also see that accuracy becomes
higher as the angle of depression of the camera becomes smaller. The small angle
of depression of the camera means that the face is captured from the horizontal
direction of the face and that the facial components clearly appear in the image.
It is understood that clearly appearing facial components improves the estimation
accuracy of face orientations. A large angle of depression, on the other hand,
causes great changes not only in human-head appearance but also in face appear-
ance. Handling such great changes with our models has limitation. This is because
we generate a contour model and inner models of human-head appearances from
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
face orientations
angles of depression
0o 30o 60o 90o 180o 270o 300o 330o
0o 84.7 86.3 83.7 31.0 97.0 34.7 80.0 79.3
15o 64.7 86.3 75.3 27.7 97.7 21.0 71.7 71.7
30o 23.7 75.3 70.3 14.0 99.0 10.0 51.0 51.7
45o 17.7 61.7 51.0 16.0 94.7 8.3 27.0 27.0
only one ellipsoid. On the other hand, we see that face images from the orienta-
tion with 180o , back images of human heads, are recognized stably and accurately
independent of the change in angle of depression. This is due to stableness of the
Gabor-Wavelets filter in face-feature detection.
an ellipse onto the head appearance in each image to obtain the true ellipse as the
reference. We then computed the distance (the position error) between the center
of the detected ellipse and that of the true ellipse. We also computed the ratio (the
size error) of the semiminor length of the detected ellipse to that of the true ellipse.
These results are shown in Fig.15.13. The average and the standard deviation of
errors in position were respectively, 8.85pixels and 22.4pixels. Those in size were,
on the other hand, 0.0701 and 0.0485, respectively. Note that difference of the size
errors from 1.0 was employed in this computation.
We see that our method for detecting human heads and face orientations is
practical overall in the real situation.
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
140 1.1
120
1.05
100
1
80
0.95
60
0.9
40
0.85
20
0 0.8
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
magnitude of intensity at the ellipse perimeter. For the position error and the
difference of the size error from 1.0, the average and standard deviation over the
image sequence were calculated, which is shown in Table 15.2.
Figures 15.15,15.16 and Table 15.2 show the effectiveness of our method. Su-
periority of our method to the simple-evaluation method indicates that introducing
the smaller- and larger-size ellipses to ellipse evaluation improves the accuracy in
detecting the positions of human-head appearances.
a
b 25
ground truth a
b
y 20
200
error [pixel]
150
100 15
50
0
0 10
20
40 300
60
250
80 5
200
100
frame 120 150
140 100 x
160 50 0
0 50 100 150 200
frame
1.3
a
1.25 b
1.2
1.15
error
1.1
1.05
0.95
0.9
0.85
0 50 100 150 200
frame
Fig. 15.16. Size errors in human-head detection (a: our method, b: simple evaluation).
15.6. Conclusion
We proposed a two-step method for detecting human heads and estimating face
orientations by a monocular camera. In the both steps, we employ models of the
human-head contour and face orientations to enhance robustness and stableness in
detection. We also introduced model evaluation with only image-features robust
against lighting conditions.
The first step employs an ellipse as the contour model of human-head appear-
ances to deal with wide variety of appearances. The ellipse was generated from
one ellipsoid based on a camera position with its angle of depression in the en-
vironment. We then evaluated the ellipse over a given image to detect possible
human-head appearances where we generated two other ellipses inside and out-
side of the ellipse to improve accuracy in detection of human-head appearances.
The second step, on the other hand, focuses on facial components such as
eyes, the mouth or cheeks to generate inner models for face-orientation estima-
tion. We evaluated not only such components themselves but also their geometric
configuration to eliminate false positives in the first step and, at the same time,
Acknowledgements
The authors are thankful to Naoya Ohnishi for helping to perform an experiment
in the real situation. The facial data in this paper are used permission of Softopia
Japan, Research and Development Division, HOIP Laboratory. It is strictly pro-
hibited to copy, use, or distribute the facial data without permission. This work is
in part supported by Grant-in-Aid for Scientific Research of the Ministry of Ed-
ucation, Culture, Sports, Science and Technology of Japan under the contract of
13224051.
References
1. S. Birchfield: Elliptical Head Tracking Using Intensity Gradients and Color His-
tograms, Proc. of CVPR, pp. 232–237, 1998.
2. T. J. Cham and J. M. Rehg: A Multiple Hypothesis Approach to Figure Tracking,
Proc. of CVPR, Vol. 2, pp. 239–245, 1999.
3. R. Chellappa, C. L. Wilson and S. Sirohey: Human and Machine Recognition of Faces,
A Survey, Proc. of IEEE, 83 (1995), pp. 705–740.
4. Y. Cui, S. Samarasekera, Q. Huang and M. Greiffenhagen: Indoor Monitoring via the
Collaboration between a Peripheral Sensor and a Foveal Sensor, Proc. of the IEEE
Workshop on Visual Surveillance, pp. 2–9, 1998.
5. G. Donato, M. S. Bartlett, J. C. Hager, P. Ekman and T. J. Sejnowski: Classifying
Facial Actions, IEEE Trans. on PAMI, 21 (1999), 10, pp. 974–989.
6. L. Davis, S. Fejes, D. Harwood, Y. Yacoob, I. Hariatoglu and M. J. Black: Visual
Surveillance of Human Activity, Proc. of the 3rd ACCV, Vol.2, pp. 267–274, 1998.
7. D. M. Gavrila: The Visual Analysis of Human Movement: A Survey, Computer Vision
and Image Understanding, 73 (1999), 1, pp. 82–98.
8. I. Haritaoglu, D. Harwood and L. S. Davis: W4 S: A Real-Time System for Detecting
and Tracking People in 2 12 D, Proc. of the 5th ECCV, Vol. 1, pp. 877–892, 1998.
9. I. Haritaoglu, D. Harwood and L. S. Davis: An Appearance-based Body Model for
Multiple People Tracking, Proc. of the 15th ICPR, Vol. 4, pp. 184–187, 2000.
10. T. -K. Kim, H. Kim, W. Hwang, S. -C. Kee and J. Kittler: Independent Component
Analysis in a Facial Local Residue Space, Proc. of CVPR, 2003.
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
11. T. S. Lee: Image Representation Using 2D Gabor Wavelets, IEEE Trans. on PAMI, 18
(1996), 10, pp. 959–971.
12. A. Sugimoto, K. Yachi and T. Matsuyama: Tracking Human Heads Based on Inter-
action between Hypotheses with Certainty, Proc. of the 13th Scandinavian Conf. on
Image Analysis, (J. Bigun and T. Gustavsson eds: Image Analysis, Lecture Notes in
Computer Science, Vol. 2749, Springer), pp. 617–624, 2003.
13. M. Swain and D. Ballard: Color Indexing, Int. J. of Computer Vision, 7 (1991), 1, pp.
11–32.
14. Y. Tian, T. Kanade and J. F. Cohn: Recognizing Facial Actions by Combining Geo-
metric Features and Regional Appearance Patterns, CMU-RI-TR-01-0, Robotics In-
stitute, CMU, 2001.
15. Y. Tian, T. Kanade and J. F. Cohn: Evaluation of Gabor-Wavelets Based Facial Action
Unit Recognition in Image Sequences of Increasing Complexity, Proc. of the 5th Int.
Conf. on Automatic Face and Gesture Recognition, pp. 229–234, 2002.
16. P. Viola and M. Jones: Rapid Object Detection using a Boosted Cascade of Simple
Features, Proc. of CVPR, Vol. I, pp. 511-518, 2001.
17. P. Viola and M. Jones: Robust Real-Time Face Detection, Int. J. of Computer Vision,
Vol. 57, No. 2, pp. 137–154, 2004.
18. Y. Wu and K. Toyama: Wide-Range, Person- and Illumination-Insensitive Head Ori-
entation Estimation, Proc. of the 4th IEEE Int. Conf. on Automatic Face and Gesture
Recognition, pp. 183–188, 2000.
19. K. Yachi, T. Wada and T. Matsuyama: Human Head Tracking using Adaptive Appear-
ance Models with a Fixed-Viewpoint Pan-Tilt-Zoom Camera Proc. of the 4th IEEE
Int. Conf. on Automatic Face and Gesture Recognition, pp. 150–155, 2000.
20. Z. Zeng and S. Ma: Head Tracking by Active Particle Filtering, Proc. of the 5th IEEE
Int. Conf. on Automatic Face and Gesture Recognition, pp. 89–94, 2002.
21. L. Zhang and D. Samaras: Face Recognition under Variable Lighting using Harmonic
Image Exemplars, Proc. of CVPR, 2003.
22. Z. Zhang, M. Lyons, M. Schuster and S. Akamatsu: Comparison between Geometry-
based and Gabor-Wavelets-based Facial Expression Recognition using Multi-layer
Perception, Proc. Int. Workshop on Automatics Face and Gesture Recognition, pp.
454–459, 1998.
23. W. Y. Zhao, R. Chellappa, A. Rosenfeld and P. J. Phillips: Face Recognition: A Liter-
ature Survey, CAR-TR-984, UMD, 2000.
24. S. Zhou, V. Krueger and R. Chellappa: Face Recognition from Video: A Conden-
sation Approach, Proc. of the 5th IEEE Int. Conf. on Automatic Face and Gesture
Recognition, pp. 221–226, 2002.
June 25, 2009 17:8 World Scientific Review Volume - 9in x 6in ws-rv9x6
CHAPTER 16
This paper presents a new approach for human walking modeling from monoc-
ular image sequences. A kinematics model and a walking motion model are
introduced in order to exploit prior knowledge. The proposed technique consists
of two steps. Initially, an efficient feature point selection and tracking approach
is used to compute feature points’ trajectories. Peaks and valleys of these tra-
jectories are used to detect key frames—frames where both legs are in contact
with the floor. Secondly, motion models associated with each joint are locally
tuned by using those key frames. Differently than previous approaches, this tun-
ing process is not performed at every frame, reducing CPU time. In addition,
the movement’s frequency is defined by the elapsed time between two consecu-
tive key frames, which allows handling walking displacement at different speed.
Experimental results with different video sequences are presented.
16.1. Introduction
283
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
poor quality images. This approach incorporates physical forces to each rigid part
of a kinematics 3D human body model consisting of truncated cones. These forces
guide each 3D model’s part towards a convergence with the body posture in the
image. The model’s projections are compared with the silhouettes extracted from
the image by means of a novel approach, which combines the Maxwell’s demons
algorithm with the classical ICP algorithm. Although stereoscopic systems pro-
vide us with more information for the scanned scenes, 3D human motion systems
with only one camera-view available is the most frequent case.
Motion modeling using monocular image sequences constitutes a complex and
challenging problem. Similarly to approach,3 but in a 2D space and assuming a
segmented video sequence is given as an input,4 proposes a system that fits a
projected body model with the contour of a segmented image. This boundary
matching technique consists of an error minimization between the pose of the
projected model and the pose of the real body—all in a 2D space. The main
disadvantage of this technique is that it needs to find the correspondence between
the projected body parts and the silhouette contour, before starting the matching
approach. This means that it looks for the point of the silhouette contour that
corresponds to a given projected body part, assuming that the model posture is not
initialized. This problem is still more difficult to handle in those frames where
self-occlusions appear or edges cannot be properly computed.
Differently than the previous approaches, the aspect ratio of the bounding box
of the moving silhouette has been used in.5 This approach is able to cope with both
lateral and frontal views. In this case the contour is studied as a whole and body
parts do not need to be detected. The aspect ratio is used to encode the pedestrian’s
walking way. However, although shapes are one of the most important semantic
attributes of an image, problems appear in those cases where the pedestrian wears
clothes not so tight or carries objects such as a suitcase, handbag or backpack.
Carried objects distort the human body silhouette and therefore the aspect ratio of
the corresponding bounding box.
In order to be able to tackle some of the problems mentioned above, some
authors propose simplifying assumptions. In6 for example, tight-fitting clothes
with sleeves of contrasting colors have been used. Thus, the right arm is depicted
with a different color than the left arm and edge detection is simplified especially
in case of self-occlusions.7 proposes an approach where the user selects some
points on the image, which mainly correspond to the joints of the human body.
Points of interest are also marked in8 using infrared diode markers. The authors
present a physics-based framework for 3D shape and non-rigid motion estimation
based on the use of a non-contact 3D motion digitizing system. Unfortunately,
when a 2D video sequence is given, it is not likely to affect its content afterwards
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
We may safely assume that the center of gravity of a walking person moves with
approximately constant velocity. However, the speed of other parts of the body
fluctuates. There is one instant per walking cycle (without considering the start-
ing and ending positions) in which both feet are in contact with the floor, in other
words with null velocity. This happens when the pedestrian changes from one
pivot foot to the other. At that moment the articulated structure (Fig. 16.1) reaches
the maximum hip angles. Frames containing these configurations will be called
key frames and can be easily detected by extracting static points (i.e., pixels defin-
ing the boundary of the body shape contained in the segmented frames that remain
static at least in three consecutive frames) through the given video sequence. In-
formation provided by these frames is used to tune motion model parameters. In
addition, the elapsed time between two consecutive key frames defines the du-
ration of a half walking period—indirectly the speed of the movement. Motion
models are tuned and used to perform the movement from the current key frame
to the next one. This iterative process is applied until all key frames are covered
by the walking model (input video sequence).
Human motion modeling based on tracking of point features has recently been
used in.9 Human motion is modeled by the joint probability density function of
the position and velocity of a collection of body parts, while no information about
kinematics or dynamics of the human body structure is considered. This technique
has been tested only on video sequences containing pedestrians walking on a plane
orthogonal to the camera’s viewing direction.
Assuming that a segmented video sequence is given as an input (in the cur-
rent implementation some segmented images were provided by the authors of10
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
and others were computed by using the algorithm presented in11 ) the proposed
technique consists of two stages. In the first stage feature points are selected and
tracked throughout the whole video sequence in order to find key frames’ posi-
tions. In the second stage a generic motion model is locally tuned by using kine-
matics information extracted from the key frames. The main advantage comparing
with previous approaches is that matching between the projection of the 3D model
and the body silhouette image features is not performed at every frame (e.g., hip
tuning is performed twice per walking cycle). The algorithm’s stages are fully
described below together with a brief description of the 3D representation used to
model the human body.
Modeling the human body implies firstly the definition of a 3D articulated struc-
ture, which represents the body’s biomechanical features; and secondly the defi-
nition of a motion model, which governs the movement of that structure.
Several 3D articulated representations have been proposed in the literature.
Generally, a human body model is represented as a chain of rigid bodies, called
links, interconnected to one another by joints. Links are generally represented
by means of sticks,7 polyhedron,12 generalized cylinders13 or superquadrics.6 A
joint interconnects two links by means of rotational motions about the axes. The
number of independent rotation parameters defines the DOFs associated with that
joint.
Considering that a human body has about six hundred muscles, forty just for
a hand, the development of a highly realistic model is a computational expensive
task, involving a high dimensionality problem. In computer vision, where mod-
els with only medium precision are required, articulated structures with less than
thirty degrees of freedom are generally adequate. (e.g.3,6 ). In this work, an artic-
ulated structure defined by 16 links is initially considered. This model consists
of 22 DOF, without modeling the palm of the hand or the foot and using a rigid
head-torso approximation (four for each arm and leg and six for the torso, which
are three for orientation and three for position). However, in order to reduce the
complexity, a simplified model of 12 DOF has been finally chosen. This sim-
plification assumes that in walking, legs’ and arms’ movements are contained in
parallel planes (see illustration in Fig. 16.1). In addition, the body orientation is
always orthogonal to the floor, thus the orientation is reduced to only one DOF.
Hence, the final model is defined by two DOF for each arm and leg and four for
the torso (three for the position plus one for the orientation).
The simplest 3D articulated structure is a stick representation with no asso-
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
Fig. 16.1. Simplified articulated structure defined by 12 DOFs, arms and legs rotations are contained
in planes parallel to the walking’s direction.
⎡ ⎤
α1 cos 1 (θ) cos 2 (φ)
x(θ, φ) = ⎣ α2 cos 1 (θ) sin 2 (φ) ⎦ (16.1)
α3 sin 1 (θ)
Feature point selection and tracking approaches were chosen because they allow
capturing the motion’s parameters by using as prior knowledge the kinematics of
the body structure. In addition, point-based approaches seem to be more robust
in comparison with silhouette based approaches. Next, a brief description of the
techniques used is given.
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
In this work, the feature points are used to capture human body movements and
are selected by using a corner detector algorithm. Let I(x, y) be the first frame
of a given video sequence. Then, a pixel (x, y) is a corner feature if at all pixels
in a window WS around (x, y) the smallest singular value of G is bigger than a
predefined σ; in the current implementation WS was set to 5×5 and σ = 0.05. G
is defined as:
ΣIx2 ΣIx Iy
G= (16.3)
ΣIx Iy ΣIy2
and (Ix , Iy ) are the gradients obtained by convolving the image I with the deriva-
tives of a pair of Gaussian filters. More details about corner detection can be found
in.16 Assuming that at the beginning there is no information about the pedestrian’s
position in the given frame, and in order to enforce a homogeneous feature sam-
pling, input frames are partitioned into 4 regular tiles (2×2 regions of 240×360
pixels each in the illustration presented in Fig. 16.3).
After selecting a set of feature points and setting a tracking window WT (3×3
in the current implementation) an iterative feature tracking algorithm has been
used.16 Assuming a small interframe motion, feature points are tracked by mini-
mizing the sum of squared differences between two consecutive frames.
Points, lying on the head or shoulders, are the best candidates to satisfy the
aforementioned assumption. Most of the other points (e.g., points over the legs,
arms or hands, are missed after a couple of frames). Fig. 16.3(top) illustrates
feature points detected in the first frame of the video sequence used in.17 Fig.
16.3(bottom − lef t) depicts the trajectories of the feature points when all frames
are considered. On the contrary, Fig. 16.3(bottom − right) shows the trajecto-
ries after removing static points. In the current implementation we only use one
feature point’s trajectory. Further improvements could be to merge feature points’
trajectories in order to generate a more robust approach.
The outcome of the previous stage is the trajectory of a feature point (Fig.
16.4(top)) consisting of peaks and valleys. Firstly, the first-order derivative of the
curve is computed to find peaks’ and valleys’ positions by seeking the positive-
to-negative zero-crossing points. Peaks correspond to those frames where the
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
Fig. 16.3. (top) Feature points from the first frame of the video sequence used in.17 (bottom−lef t)
Feature points’ trajectories. (bottom−right) Feature points’ trajectories after removing static points.
pedestrian reaches the maximum height, which happens in that moment of the
half walking cycle when the hip angles are minimum. On the contrary, the valleys
correspond to those frames where the two legs are in contact with the floor and
then, the hip angles are maximum. So, the valleys are used to find key frames,
while the peaks are used for footprint detection. The frames corresponding to
each valley of Fig. 16.4(top) are presented in Fig. 16.4(middle) and (bottom).
An interesting point of the proposed approach is that in this video sequence, in
spite of the fact that the pedestrian is carrying a folder, key frames are correctly
detected and thus, the 3D human body configuration can be computed. On the
contrary, with an approach such as,4 it will be difficult since the matching error
will try to minimize the whole shape (including folder).
After detecting key frames, which correspond to the valleys of the trajectory,
it is necessary to define also the footprints of the pedestrian throughout the se-
quence. In order to achieve this, body silhouettes were computed using an image
segmentation algorithm.11 For some video sequences the segmented images were
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
Walking Direction
(a) (b)
(c) (d)
Fig. 16.4. (top) A single feature point’s trajectory. (middle and bottom) Key frames associated
with the valleys of a feature point’s trajectory.
The result of the previous stage is a set of static points distributed along the
pedestrian’s path. Now, the problem is to cluster those points belonging to the
same foot. Static points defining a single footprint are easily clustered by studying
the peaks’ positions in the feature point’s trajectory. All those static points in a
neighborhood of F ± 3 from the frame corresponding to a peak position (F ) will
be clustered together and will define the same footprint (f pi ). Fig. 16.5(right)
shows an illustration of static points detected after processing consecutive frames.
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
Footprints
Fig. 16.5. (lef t) Five consecutive frames used to detect static points. (right) Footprints computed
after clustering static points generated by the same foot (peaks in a feature point’s trajectory (Fig.
16.4(top)).
Knee-Joint angle
Hip-Joint angle
Fig. 16.6. Motion curves of the joints at the shoulder, elbow, hip and knee (computed from2 ).
As it was introduced above, key frames are defined as those frames where both
feet are in contact with the floor. At every key frame, the articulated human body
structure reaches a posture with maximum hip angles. In the current implemen-
tation, hip angles are defined by the legs and the vertical axis containing the hip
joints. This maximum value, together with the maximum value of the hip motion
model (Fig. 16.6) are used to compute a scale factor κ. This factor is utilized to
adjust the hip motion model to the current pedestrian’s walking. Actually, it is
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
Fig. 16.7. Half walking cycle executed by using scale factors (κ1 , κ2 ) over the hip motion curve
presented in Fig. 16.6 (knee motion curve is not tuned at this stage). Spatial positions of points (D, H,
C and B) are computed by using angles from the motion curves and trigonometric relationships.
used for half the walking cycle, which does not start from the current key frame
but from a quarter of the walking cycle before the current key frame until halfway
to the next one. The maximum hip angle in the next key frame is used to update
this scale factor.
This local tuning, within a half walking cycle, is illustrated with the 2D artic-
ulated structure shown in Fig. 16.7, from Posture 1 to Posture 3. A 2D articulated
structure was chosen in order to make the understanding easier, however the tun-
ing process is carried out in the 3D space. The two footprints of the first key frame
are represented by the points A and B, while the footprints of the next key frame
are the corresponding points A” and B”. During this half walking cycle one foot
is always in contact with the floor (so points A = A’ = A”), while the other leg is
moving from point B to point B”. In halfway to B”, the moving leg crosses the
other one (null hip angle values). Points C, C’, C” and D, D’, D” represent the
left and right knee, while the points H, H’, H” represent the hip joints.
Given the first key frame, the scale factor κ1 is computed and used to perform
the motion (βi1(t) ) through the first quarter of the walking cycle. The second key
frame (A”, B”) is used to compute the scale factor κ2 . At each iteration of this
half walking cycle, the spatial positions of the points B, C, D and H are calculated
using the position of point A, which remains static, the hip angles of Fig. 16.6
scaled by the corresponding factor κi and the knee angles of Fig. 16.6. The
number of frames in between the two key frames defines the sampling rate of
the motion curves presented on Fig. 16.6. This allows handling variations in the
walking speed.
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
Fig. 16.8. (top) Three different frames of the video sequence used in.17 (bottom) The correspond-
ing 3D walking models.
As aforementioned, the computed factors κi are used to scale the hip an-
gles. The difference in walking between people implies that all the motion curves
should be modified by using an appropriate scale factor for each one. In order
to estimate these factors an error measurement (registration quality index: RQI)
is introduced. The proposed RQI measures the quality of the matching between
the projected 3D model and the corresponding human silhouette. It is defined
as: RQI = overlappedArea/totalArea, where total area consists of the surface
of the projected 3D model plus the surface of the walking human figure less the
overlapped area, while the overlapped area is defined by the overlap of these two
surfaces. Firstly, the algorithm computes the knee scale factor that maximizes
the RQI values. In every iteration, an average RQI is computed for all the se-
quence. In order to speed up the process the number of frames was subsampled.
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
Fig. 16.10. 3D models corresponding to the frames presented in Fig. 16.9 (top) and (bottom)
respectively.
Afterwards, the elbow and shoulder scale factors are estimated similarly. They are
computed simultaneously using an efficient search method.
The proposed technique has been tested with video sequences used in17 and,10
together with our own video sequences. Despite that the current approach has been
developed to handle sequences with a pedestrian walking over a planar surface, in
a plane orthogonal to the camera direction, the technique has been also tested
with an oblique walking direction (see Fig. 16.11) showing encouraging results.
The video sequence used as an illustration throughout this work consists of 85
frames of 480×720 pixels each, which have been segmented using the technique
presented in.11 Some of the computed 3D walking models are presented in Fig.
16.8(bottom), while the original frames together with the projected boundaries
are presented in Fig. 16.8(top).
Fig. 16.9(top) presents a few frames of a video sequence defined by 103
frames (240×320 pixels each), while Fig. 16.9(bottom) corresponds to a video
sequence defined by 70 frames (240×320 pixels each). Although the speed and
walking style is considerably different, the proposed technique can handle both
situations. The corresponding 3D models are presented in Fig. 16.10(top) and
(bottom) respectively.
Finally, the proposed algorithm was also tested on a video sequence, consist-
ing of 70 frames of 240×320 pixels each, containing a diagonal walking displace-
ment (Fig. 16.11). The segmented input frames have been provided by the authors
of.10 Although the trajectory was not on a plane orthogonal to the camera direc-
tion, feature point information was enough to capture the pedestrian’s attitude.
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
Fig. 16.11. (top − lef t) Feature points of the first frame. (top − right) Feature points’ trajectories.
(bottom) Some frames illustrating the final result (segmented input has been provided by10 ).
A new approach towards human motion modeling has been presented. It ex-
ploits prior knowledge regarding a person’s movement as well as human body
kinematics constraints. At this paper only walking has been modeled. Although
constraints about walking direction and planar surfaces have been imposed, we
expect to extend this technique in order to include frontal and oblique walking
directions.18 A preliminary result has been presented in Fig. 16.11.
Modeling other kinds of human body cyclic movements (such as running or
going up/down stairs) using this technique constitutes a possible extension and
will be studied. In addition, the use of a similar approach to model the displace-
ment of other articulated beings (animals in general19 ) will be studied. Animal
motion (i.e., cyclic movement) can be understood as an open articulated structure,
however, when more than one extremity is in contact with the floor, that struc-
ture becomes a closed kinematics chain with a reduced set of DOFs. Therefore a
motion model could be computed by exploiting these particular features.
Further work will also include the tuning of not only motion model’s parame-
ters but also geometric model’s parameters in order to find a better fitting. In this
way, external objects attached to the body (like a handbag or backpack) could be
added to the body and considered as a part of it.
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
Acknowledgements
This work was supported in part by the Government of Spain under MEC Re-
search Project TRA2007-62526/AUT and Research Program Consolider Inge-
nio 2010: Multimodal Interaction in Pattern Recognition and Computer Vision
(CSD2007-00018).
References
human actions from multiple views. In Proc. IEEE Int. Conf. on Computer Vision and
Pattern Recognition, Santa Barbara, CA, USA (June, 1998).
13. I. Cohen, G. Medioni, and H. Gu. Inference of 3D human body posture from mul-
tiple cameras for vision-based user interface. In World Multiconference on Systemic,
Cybernetics and Informatics, Orlando, Florida, USA (July, 2001).
14. F. Solina and R. Bajcsy, Recovery of parametric models from range images: The case
for superquadrics with global deformations, IEEE Transactions on Pattern Analysis
and Machine Intelligence. 12(2), 131–147 (February, 1990).
15. A. Sappa, N. Aifanti, S. Malassiotis, and M. Strintzis. Monocular 3D human body
reconstruction towards depth augmentation of television sequences. In Proc. IEEE
Int. Conf. on Image Processing, Barcelona, Spain (September, 2000).
16. Y. Ma, S. Soatto, J. Kosecká, and S. Sastry, An Invitation to 3-D Vision: From Images
to Geometric Models. (Springer-Verlag, New York, 2004).
17. P. Phillips, S. Sarkar, I. Robledo, P. Grother, and K. Bowyer. Baseline results for the
challenge problem of human id using gait analysis. In Proc. IEEE Int. Conf. on Auto-
matic Face and Gesture Recognition, Washington, USA (May, 2002).
18. L. Wang, T. Tan, H. Ning, and W. Hu, Silhouette analysis-based gait recognition for
human identification, IEEE Transactions on Pattern Analysis and Machine Intelli-
gence. 25(12), 1505–1518 (December, 2003).
19. P. Schneider and J. Wilhelms. Hybrid anatomically based modeling of animals. In
IEEE Computer Animation’98, Philadelphia, USA (June, 1998).
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
Table 17.3. Performance Factor obtained for 1LPF, PRPF and SSPF for two motion sequences
P RP F SSP F
SEQUENCES 1LPF PRPF SSPF 1LP F 1LP F
Jump Pf Knee 5.9× 10−5 1.4× 10−4 1.1× 10−4 2.4 1.9
Angle
Pf Hip 9.2× 10−5 1.4× 10−4 1.7× 10−4 1.5 1.9
Angle
Frontal Pf Right 1.4× 10−5 5.0× 10−5 1.2× 10−5 3.5 8.7
Movement Elbow Angle
Pf Left 5.5× 10−6 3.7× 10−5 1.1× 10−4 6.7 20.8
Elbow Angle
shown in both figures. Right knee (left) and hip (right) angle estimation using
PRPF, SSPF, 1LPF and manual digitizing curves are shown in Figure 17.12.
17.7. Conclusion
The main contribution of this work is the application of the Path Relinking Parti-
cle Filter (PRPF) and the Scatter Search Particle Filter (SSPF) algorithms to the
model-based human motion tracking. Both algorithms were originally developed
for general dynamic optimization and complicated sequential estimation prob-
lems. Experimental results have shown that PRPF and SSPF frameworks can be
very efficiently applied to the 2D human pose estimation problem. We have es-
timated a performance factor taking into account the number of particles and the
MSE of the corresponding methods against the manual digitizing. By means of
this factor we observe that the SSPF algorithm has the best performance hit in
terms of MSE and computational load. The proposed geometrical human model
is flexible and easily adaptable to the different analyzed human motion activities.
However, it depends on the view-point and it is only suitable for planar move-
ments. In this way, quite energetic planar activities such as running and jumping
in different environment have been effectively tracked.
References
Fig. 17.12. Right hip (left) and knee (right) angle estimation in the jump sequence shown in Figure
17.10 using PRPF (Npart /f rame = 2838), SSPF (Npart /f rame = 1626), 1LPF (Npart /f rame
= 4000) and manual digitizing
Table 17.2. M SE/f rame values with respect to manual digitizing and
Npart /f rame of one-layered PF, PRPF and SSPF for two motion sequences
Figure 17.11 shows a runner tracked with a ten limbs body model using PRPF
and SSPF algorithm. Both sequences demonstrate an accurate model adjust-
ment. Right arm is not included into the geometrical model because it remains
completely occluded during most video sequence. Figure 17.10 shows the same
countermovement jump sequence tracked by PRPF and SSPF. A full-body model
formed by only five limbs is employed. Selected non-consecutive frames are
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
(a)
(b)
Fig. 17.11. Visual model adjustment for a running man using (a) PRPF and (b) SSPF
To analyze the performance of the proposed model-based PRPF and SSPF algo-
rithms, people performing different activities were recorded in several scenarios.
These algorithms were implemented using MATLAB 6.1. Figure 17.8 shows the
model adjustment for a subject performing planar movements. Upper-body model
consists of eight limbs. A visual comparison leads to a very good estimation be-
tween the PRPF and SSPF results. Right elbow angle estimation using PRPF,
SSPF are compared against the One-Layered Particle FIlter (1LPF) and manual
digitizing curves in Figure 17.9. One-layered Particle Filter algorithm is an im-
proved version of classical Particle Filter. A description of this algorithm can be
found in.12
Table 17.2 shows the mean values of several angles from frontal (Figure 17.8)
and jump (Figure 17.10) sequences. In order to give a measure of the performance
CHAPTER 18
Michał Choraś
Institute of Telecommunications, University of Technology & LS
85-796, Bydgoszcz, Poland,
E-mail: [email protected]
321
322 M. Choraś
1. Introduction
Personal identification has lately become a very important issue in a still
evolving network society. Most of the traditional identification methods,
which are widespread in the commercial systems, have very many
disadvantages. Well known methods like entering Personal Identification
Number (PIN), typing logins and passwords, displaying identification
cards or using specific keys require users to take active part in the
process of identification. Moreover, those traditional methods are
unreliable because it is hard to remember all the PIN-s and passwords,
and it is fairly easy to loose ID cards and keys. The other drawback is the
lack of security, as the cards and keys are often stolen, and passwords
can be cracked.
Biometrics methods easily deal with those problems since users are
identified by who they are, not by something they have to remember or
carry with them. The passive methods of biometrics do not require any
action from users and can take place even without their knowledge.
There are many known methods of human identification based on
image analysis. In general, those biometrics methods can be divided into
behavioural and physiological regarding the source of data, and can be
divided into passive and invasive biometrics, regarding the way the data
is acquired (Figure 1).
The first class is based on the behavioural features of human actions
and it identifies people by how they perform something. The most
popular of such methods is voice verification. Other methods are
basically based on the dynamics of specific actions like making the
signature, typing on the keyboard and simply moving or walking. Those
methods are not that natural and they require users to take part in the
process of identification by repeating specific actions, every time they
are examined.
Physiological (anatomical) biometrics methods are based on the
physiological features of humans thus they measure and compare
features of specific parts of human body in the process of identification.
So far the main interest is in the head and the hand with face, eye and
fingerprint features being the most important discriminants of human
identity.
Ear Biometrics Based on Geometrical Feature Extraction 323
2. Ear Biometrics
Human ears have been used as major feature in the forensic science for
many years. Recently so called earprints, found on the crime scene, have
been used as a proof in over few hundreds cases in the Netherlands and
the United States [12]. However, still the automated system of ear
recognition hasn’t been implemented even though there are many
advantages of using ear as a source of data for person identification.
Firstly, ear does not change considerably during human life, and face
changes more significantly with age than any other part of human body.
Face can also change due to cosmetics, facial hair and hair styling.
Secondly, face changes due to emotions and expresses different states of
mind like sadness, happiness, fear or surprise. In contrast, ear features
are relatively fixed and unchangeable [14].
Moreover, the colour distribution is more uniform in ear than in
human face, iris or retina. Thanks to that fact, not much information is
lost while working with the greyscale or binarized images, as we do in
our method.
Figure 2 presents two more aspects of ear identification. Firstly, ear is
one of our sensors, therefore it is usually visible (not hidden underneath
anything) to enable good hearing. Ear is also smaller than face, which
means that it is possible to work faster and more efficiently with the
images with the lower resolution.
Figure 3. Ear visibility can be easily achieved in applications allowing interaction with
the user (for example access control systems).
⎧1 if S (i, j ) ≥ T (i, j )
g (i, j ) = ⎨ (5)
⎩0 if S (i, j ) < T (i, j )
In result we obtain the binary image g (i, j ) with the detected
contours. Moreover, the constant k allows to adjust and change the
sensivity of the edge detection algorithm [9].
An example of the edge detection algorithm is shown in the Figure 5.
3.2. Normalization
Given the binary image g (i, j ) , we search for the centroid which later
becomes the reference point for feature extraction. We obtain centroid
such as:
∑ ∑ ig (i, j )
i j
∑∑
i j
jg (i, j )
I= ,J = (6)
∑∑ g (i, j )
i j
∑∑ g (i, j )
i j
∑∑ i 2 g (i, j ) ∑∑ j 2 g (i, j )
σi = − I 2 ,σ j =
i j i j
−J2 (8)
∑∑ g (i, j )
i j
∑∑ g (i, j )
i j
Figure 6. Binary ear images with the extracted edges (2 values of k) and with the centroid
marked with a cross. Circles represent the radius values for calculation of number of
pixels intersecting each circle. The table below shows the centroid values for each binary
image.
The general rule for forming the first vector is presented below:
{ }
V = ⎡⎣ rmin , l r min , ∑ d r min ⎤⎦ " ⎡⎣ rmax , l r max , ∑ d r max ⎤⎦ (9)
where:
r – radius length,
l r – number of intersection points for each radius,
∑ d – sum of all the distances between the intersection points for the
considered radius.
⎛ − − − −
⎞
N c8 ( g o ) = ∑ ⎜ g k − g k g k +1 g k + 2 ⎟ , (10)
k =S ⎝ ⎠
−
where S = (1,3,5,7) and g k = g k − 1 .
Figure 9. The symbolic representation of the second step of feature extraction algorithm.
The general rule for forming the second vector for each contour is
presented below. First we store the number of endings, bifurcations and
intersection points, and then we store all the coordinates of those points
for all the extracted and traced contours. For C contours in a given image
we obtain:
[ (
F = { (N E , N B , N I ) e1 , " , e N E , b1 , " , b N B , i1 , " , i N I )] "
1
[ (
" ( N E , N B , N I ) e1 , " , e N E , b1 , " , bN B , i1 , " , i N I )] }
C
(11)
where:
N E - number of endings in each contour,
N B - number of bifurcations in each contour,
N I - number of points intersecting the circles,
e – coordinates of endings,
b - coordinates of bifurcations,
i - coordinates of intersections in each contour.
4. Classification
For each image stored in the database we have two vectors Fref and Vref .
For each input ear, we acquire many images under different angles to the
camera.
The algorithm for recognition of an input image is following:
2. for each radius, we search the database feature vectors Vref that
have the same number of intersections l r for the corresponding
radiuses
3. the vectors with the number of intersections (l r ± δ ) are also
accepted, allowing the difference of δ pixel on each circle
4. in the next step we check if the difference within the distance
sum ∑ d for all the extracted vectors is less than a certain
threshold value
5. if none of the vectors Vref are found for the input image, the
input image is rejected
6. if the number of intersecting points l r is accepted and the
difference within the distance sum ∑d is less than a certain
threshold value we check the contour-topology vector F
7. we first search for the same triples (N E , N B , N I ) of the input
contour-topology vector F with the reference contour vectors
Fref
8. then for the images with the same triples ( N E , N B , N I ) we
check if the coordinates of the stored points are the same
9. if the corresponding coordinates of those vectors refer to the
same points, the algorithm finds the winner of classification.
without illumination changes (Figure 10). For such “easy” images from
our database we obtained error-free recognition.
Figure 10. Some examples of “easy” ear images from our database.
Figure 11. Some examples of “difficult” ear images from our database.
We also think that the feature vectors should be enriched with more
geometrical features in order to better distinguish ear identity. We are
testing some new geometrical parameters describing shapes of ear
contours and we compare their effectiveness in ear identification.
Moreover, we search for other than geometrical features describing
ear images, such as energy and shape parameters. We try to discover,
which features are the most significant in determining ear identity, so
336 M. Choraś
6. Conclusions
In the article we proposed a human identification method based on
human ear images. We proposed invariant geometrical method in order
to extract features needed to classification. First we perform contour
detection algorithm, then coordinates normalization. Thanks to placing
the centre of the new coordinates system in the centroid, our method is
invariant to rotation, translation and scaling, which will allow RST
queries. The centroid is also a key reference point in the feature
extraction algorithm, which is divided into 2 steps. In the first step, we
create circles centred in the centroid and we count the number of
intersection points for each radius and the sum of all the distances
between those points. All those points are stored in the first feature
vector corresponding to the radiuses. In the second step, we use the
created circles, but hereby we count the intersection points for each
contour line. Moreover, while tracing the contour lines, we detect the
characteristic points like endings and bifurcations. Together with the
intersection points for each contour, we store them in the second feature
vector corresponding to contour topology. Then we perform
classification, basing on the simple comparison between the input image
feature vectors, and all the vectors from the database. So far we have
obtained very good results, however we still continue our research in
order to improve our method and add more parameters to the feature
vectors.
We believe that human ear is a perfect source of data for passive
person identification in many applications. In a growing need for security
in various public places, ear biometrics seem to be a good solution, since
ears are visible and its images can be easily taken, even without the
knowledge of the examined person. Then the robust feature extraction
method can be used to determine personality of some individuals, for
instance terrorists at the airports and stations. Access control to various
buildings and crowd surveillance are among other possible applications.
Ear Biometrics Based on Geometrical Feature Extraction 337
References
1. Ashbourn J., Biometrics - Advanced Identity Verification, Springer-Verlag 2000.
2. Beveridge J.R., She R., Draper B.A., Givens G.H., Parametric and Nonparametric
Methods for the Statistical Evaluation of Human Id Algorithms, Workshop on
Evaluation Methods in Computer Vision, 2001.
3. Bowman E., Everything You Need to Know about Biometrics, Technical Report,
Identix Corporation, 2000.
4. Burge M., Burger W., Ear Biometrics, Johannes Kepler University, Linz, Austria
1999.
5. Burge M., Burger W., Ear Biometrics for Machine Vision, 21 Workshop of the
Austrian Association for Pattern Recognition, Hallstatt, 1997.
6. Canny J., A Computational Approach to Edge Detection, IEEE Trans. on Pattern
Analysis and Machine Intelligence, vol. 8, no. 6, 679-698, 1986.
7. Choraś M., Human Identification Based on Image Analysis – New Trends, Proc.
Int. IEEE Workshop Signal Processing'03, pp. 111-116, Poznan 2003.
8. Choraś M., Human Ear Identification Based on Image Anlysis, in L. Rutkowski et
al. (Eds): Artificial Inteligence and Soft Computing, ICAISC 2004, 688-693, LNAI
3070, Springer-Verlag 2004.
9. Choraś M., Feature Extraction Based on Contour Processing in Ear Biometrics,
IEEE Workshop on Multimedia Communications and Services, MCS’04, 15-19,
Cracow.
10. Danielsson P. E., Ye Q. Z., Rotation-Invariant Operators Applied to Enhancement
of Fingerprints, Proc. 8th ICPR, Rome 1988.
11. Hong L, Jain A.K., Pankanti S., Can Multibiometrics Improve Performance?, Proc.
of AutoID’99, 59-64, 1999.
12. Hoogstrate A.J., Heuvel van den H., Huyben E., Ear Identification Based on
Surveillance Camera’s Images, Netherlands Forensic Institute, 2000.
13. Hurley D.J., Nixon M.S., Carter J.N., Force Field Energy Functionals for Image
Feature Extraction, Image and Vision Computing Journal, vol. 20, no. 5-6, 311-318,
2002.
338 M. Choraś
14. Iannarelli A., Ear Identification, Forensic Identification Series, Paramont Publishing
Company, California 1989.
15. Jain A., Bolle R., Pankanti S., Biometrics: Personal Identification in Networked
Society, Kluwer Academic Publishers, 1999.
16. Jain A.K., Ross A., Multibiometric Systems, Comm. ACM, Special Issue on
Multimodal Interfaces, vol. 47, no. 1, 34-40, 2004.
17. Jain L. C., Halici U., Hayashi I., Lee S. B., Tsutsui S., Intelligent Biometric
Techniques in Fingerprint and Face Recognition, CRC Press International Series on
Computational Intelligence, 1999.
18. Kouzani A.Z., He F., Sammut K., Towards Invariant Face Recognition, Journal of
Information Sciences 123, Elsevier 2000.
19. Lai K., Chin R., Deformable Contours: Modeling and Extraction, IEEE Trans. on
Pattern Analysis and Machine Intelligence, vol. 17, no. 11, 1084-1090, 1995.
20. Moreno B., Sanchez A., Velez J.F., On the Use of Outer Ear Images for Personal
Identification in Security Applications, IEEE Conf. On Security Technology, 469-
476, 1999.
21. Safar M., Shahabi C., Sun X., Image Retrieval By Shape: A Comparative Study,
University of Southern California, November 1999.
22. Victor B., Bowyer K.W., Sarkar S., An Evaluation of Face and Ear Biometrics,
Proc. of Intl. Conf. on Pattern Recognition, I: 429-432, 2002.
23. Zhang D., Automated Biometrics – Technologies and Systems, Kluwer Academic
Publishers, 2000.
CHAPTER 19
1. Introduction
In several areas of Computational Vision, one of the main problems
consists in the determination of correspondences between objects
339
340 J. M. Tavares and L. Ferreira Bastos
1.1. Background
2. Dynamic Pedobarography
Pedobarography refers to measuring and visualizing the distribution of
pressure under the foot sole. The recording of pedobarographic data
along the duration of a step, in normal walking conditions, permits the
dynamic analysis of the foot behavior. This introduction of the time
dimension augments the potential of this type of clinical examination as
an auxiliary tool for diagnostics and therapy planning12.
The basic pedobarography system consists of a transparent plate
trans-illuminated through its borders in such a way that the light is
internally reflected. The plate is covered on its top by a single or dual
thin layer of soft porous plastic material where the pressure is applied
(see Fig. 1).
342 J. M. Tavares and L. Ferreira Bastos
opaque
foot pressure layer
lamp
glass or
foot pressure
acrylic plate reflected light
Figure 1. Basic (pedo)barography principle.
pedobarography table
glass + contact
layer
camera
mirror
Figure 2. Basic setup of a pedobarography system.
Improvement of Modal Matching Image Objects in Dynamic Pedobarography 343
3. Object Models
In the initial stages of the work, the object contours in each image were
extracted and the matching process was oriented to the contours’
pixels2,13. A practical difficulty arising from this approach is the possible
existence of more than one contour for the object represented in each
image (i. e. see Fig. 7). To find the correspondence between each
contours pair along the image sequence, two possible solutions were
considered: i) use of a Kalman filtering (see13, for example) approach to
estimate and track the location of the contours’ centroids along the image
sequence; ii) use of a measure of the deformation energy necessary to
align each contour pairs, selecting the lower energy pairs.
However, an additional problem is still present: the possibility that
along the image sequence various contours will merge or split. In order
to accommodate this possibility, a new model has been developed,
similar to the one used in various applications working with controlled
environment, such as in face analysis and recognition14, 15: The brightness
level of each pixel is considered as the third coordinate of a 3D surface
point. The resulting single surface model solves the two aforementioned
problems.
The use of the surface model, also simplifies the consideration of
isobaric contours, which are important in pedobarographic analysis,
either for matching contours of equal pressure along the time sequence or
for matching contours of different pressure in a single image12, 13.
The following sections describe the object models used in this work
and briefly describe their construction. Each model has its own
advantages and shortcomings; for every particular problem, the best
choice must be made12, 13.
344 J. M. Tavares and L. Ferreira Bastos
0 1 2
3 4 5
6 7 8
9 10 11
12
e1
e8 1 e2
2
8 3
e7 e3
4
7
6
e6 5 e4
e5
Figure 4. Modeling a contour by using a set ei of axial finite elements.
346 J. M. Tavares and L. Ferreira Bastos
Figure 5. Image (negated) where contours Figure 6. Result image after edge
must be found. enhancement.
0 1 2
3 4 5
6 7 8
9 10 11
12
Figure 8. Sampled contours obtained from the original image sequence of Fig. 3.
348 J. M. Tavares and L. Ferreira Bastos
Figure 10. Image (negated) after noise removal Figure 11. Object sampling.
and Gaussian filtering.
The surfaces obtained from the original images presented in Fig. 3 are
visible in Fig. 13. The original images with identification (ID) 0 (zero)
and 1 (one) were not considered.
2 3
4 5
6 7
8 9
10 11
Figure 13. Surfaces obtained from the last eleven images of the example sequence
(continue).
Improvement of Modal Matching Image Objects in Dynamic Pedobarography 351
12
Figure 13. Surfaces obtained from the last eleven images of the example sequence
(conclusion).
Figure 14. Ten isobaric contours extracted from the surface of Fig. 12.
4. Matching Methodology
Fig. 15 displays a diagram of the adopted physical matching
methodology. The locations of the objects data points in each image,
X = [ X 1 … X m ] , are used as the nodes of a finite elements model made of
an elastic material. Next, the eigenmodes {φ }i of the model are
computed, providing an orthogonal description of the object and its
natural deformations, ordered by frequency. Using a matrix based
notation, the eigenvectors matrix [ Φ ] and the eigenvalues diagonal
matrix [ Ω ] can be written as in equation (1) for 2D objects and as in
equation (2) for 3D objects.
The eigenvectors, also called shape vectors1, 2, 16, 17, describe how each
vibration mode deforms the object by changing the original data point
locations: X deformed = X + a {φ}i . The first three (in 2D) or six (in 3D)
vibration modes are the rigid body modes of translation and rotation; the
352 J. M. Tavares and L. Ferreira Bastos
⎡ {u}T ⎤
⎢ 1 ⎥
⎢ ⎥
⎢ T⎥
⎢ {u}m ⎥
⎢ T⎥
⎢ {v}1 ⎥ ⎡ω12 0 ⎤
[Φ ] = ⎡⎣{φ}1 {φ}3m ⎤⎦ = ⎢ ⎥ and [ Ω ] = ⎢ ⎥
⎢ T ⎥ ⎢ ⎥. (2)
⎢ {v}m ⎥ ⎢0 2 ⎥
ω3m ⎦
⎣
⎢ ⎥
⎢{w}1 ⎥
T
⎢ ⎥
⎢ ⎥
⎢⎢{w}T ⎥⎥
⎣ m⎦
5. Results
The presented matching methodology was integrated in a generic
software platform for deformable objects2, 23, previous developed using
Microsoft Visual C++, the Newmat24 library for matrix computation and
the VTK - The Visualization Toolkit25, 26 for 3D visualization, mesh
triangulation, simplification and smoothing, and for isobaric contours
extraction.
This section presents some experimental results obtained on objects
extracted from dynamic pedobarography images, using the adopted
matching methodology. First, are presented results considering contour,
then surface, and finally isocontour models. In all cases, the object pairs
considered were just selected for example purposes.
All the results presented in this section, were obtained considering in
the objects modeling the Sclaroff’s isoparametric finite element and
rubber as the virtual elastic material, and using 25% of the models
vibration modes in the matching phase.
Considering the pairs of contours with ID 2/3, 3/4 and 10/11, previously
presented in Fig. 8, using the matching methodology adopted, we obtain
the matches shown in Fig. 16, 17, and 18, respectively. In these figures,
Improvement of Modal Matching Image Objects in Dynamic Pedobarography 357
the contours data points, and also the matched pairs, are connected for
better visualization.
Figure 16. Matches obtained between contours with ID 2 (64 nodes) and 3 (64 nodes).
Figure 17. Matches obtained between contours with ID 3 (64 nodes) and 4 (67 nodes).
The matches found between the contours extracted from the last
eleven images (ID 2 to 12) of the example sequence, are present in Fig.
19. The number and the percentage of matches obtained during the same
sequence are indicated in Table 1.
The results obtained using the local search strategy, were in range of
50.2% to 82.8% and present all good quality; instead, the optimization
search procedure had always 100% of matching success, and in generally
the found matches also have good quality and the excess nodes were
reasonably matched.
358 J. M. Tavares and L. Ferreira Bastos
Figure 18. Matches obtained between contours with ID 10 (48 nodes) and 11 (48 nodes).
Figure 19. Matches obtained between all eleven contours (from ID 2 to 12).
Table 1. Number (Nº Match) and percentage of matches (% Match) obtained between the
contours extracted from the last eleven images (ID 2 to 12) of the example sequence.
Excess
Local Optimization
Nodes
Nº Nº
contours ID Nº Match % Match % Match Nº Match
Nodes Match
2, 3 64/64 53 82,8% 64 100%
3, 4 64/67 44 68,8% 64 100% 67
4, 5 67/67 55 82,1% 67 100%
5, 6 67/64 47 73,4% 64 100% 67
6, 7 64/58 46 79,3% 58 100% 64
7, 8 58/51 30 58,8% 51 100% 58
8, 9 51/50 33 66,0% 50 100% 51
9, 10 50/48 26 54,2% 48 100% 50
10, 11 48/48 24 50,0% 48 100%
11, 12 48/46 25 54,3% 46 100% 48
The matches found between the surfaces models with ID 2/3, 4/5 and
11/12, each one build from the correspondent image of the example
sequence, using the adopted matching methodology, are shown in the
Fig. 20, 21, and 22, respectively. In these figures, the matched nodes are
connected for better viewing and two different views are presented.
The number and the percentage of matches obtained considering the
surfaces models build from the last eleven images of the example
sequence are indicated in Table 2.
The results of the local search strategy were in range of 33.3% to
87.8%, and, attending to the high objects deformation, we can consider
that the found matches have good quality. In other hand, the search
strategy based on optimization techniques had always 100%, and
generally the matches present also good quality.
As in the contour case, all the parameters of the matching
methodology were considered constant along the sequence.
360 J. M. Tavares and L. Ferreira Bastos
Local matching: 48
Figure 20. Matches obtained between surfaces with ID 2 (116 nodes) and 3 (131 nodes).
Local matching: 64
Figure 21. Matches obtained between surfaces with ID 4 (131 nodes) and 5 (125 nodes).
the surface case, in these figures the matched nodes are connected and
two different views are presented.
Local matching: 32
Optimization matching: 85
Figure 22. Matches obtained between surfaces with ID 11 (99 nodes) and 12 (85 nodes).
Table 2. Number (Nº Match) and percentage of matches (% Match) obtained with the
surfaces build from the last eleven images of the example sequence.
Local Optimization
Surfaces ID Nº Nodes Nº Match % Match Nº Match % Match
2, 3 116/131 48 41,3% 116 100%
3, 4 131/131 115 87,8% 131 100%
4, 5 131/125 64 51,2% 125 100%
5, 6 125/109 67 61,5% 109 100%
6, 7 109/107 51 47,7% 107 100%
7, 8 107/96 32 33,3% 96 100%
8, 9 96/98 52 54,1% 96 100%
9, 10 98/95 56 58,9% 95 100%
10, 11 95/99 52 54,7% 95 100%
11, 12 99/85 32 37,4% 85 100%
Local matching: 52
Optimization matching: 76
Figure 23. Matches obtained between isocontours with ID 1 (76 nodes) and 2 (76 nodes).
Local matching: 40
Optimization matching: 70
Figure 24. Matches obtained between isocontours with ID 4 (76 nodes) and 5 (70 nodes).
Local matching: 24
Optimization matching: 46
Figure 25. Matches obtained between isocontours with ID 6 (54 nodes) and 7 (46 nodes).
Using the local search strategy, the results obtained were in range of
50.0% to 94.1%, and the matches found could be considered of good
quality. Instead, the search based on optimization techniques had always
100% of matching success, and in generally the matches found have also
good quality and the excess nodes were reasonably matched.
As in the contour and surface cases, the matching results obtained
could be improved if the parameters of the adopted methodology were
adjusted for each isocontour pair.
6. Conclusions
The several experimental tests carried through this work, some reported
in this paper, allow the presentation of some observations and
conclusions.
The physical methodology proposed, for the determination of
correspondences between two objects, using optimization techniques on
364 J. M. Tavares and L. Ferreira Bastos
the matching phase, when compared with the one previously developed
that considers local search, obtained always an equal or higher number of
satisfactory matches. It was also verified that the number of matches
found is independent from the optimization algorithm considered.
Local matching:
Optimization matching:
Figure 26. Matches obtained between the eleven isocontours extracted from the
image with ID 6 of the example sequence.
Table 3. Number (Nº Match) and percentage of matches (% Match) obtained with the
eleven isocontours extracted from the image with ID 6 of the example sequence.
Excess
Local Optimization
Nodes
Nº Nº %
Isocontours ID Nº Match % Match Nº Match
Nodes Match Match
1, 2 76/76 52 68,4% 76 100%
2, 3 76/74 58 78,4% 74 100% 76
3, 4 74/76 50 67,6% 74 100% 76
4, 5 76/70 40 57,1% 70 100% 76
5, 6 70/54 32 59,3% 54 100% 70
6, 7 54/46 24 52,2% 46 100% 54
7, 8 46/38 22 57,9% 38 100% 46
8, 9 38/34 17 50,0% 34 100% 38
9, 10 34/34 32 94,1% 34 100%
10, 11 34/34 17 50,0% 34 100%
Acknowledgments
This work was partially done in the scope of project “Segmentation,
Tracking and Motion Analysis of Deformable (2D/3D) Objects using
Physical Principles”, with reference POSC/EEA-SRI/55386/2004,
financially supported by FCT – Fundação para a Ciência e a Tecnologia
from Portugal.
References
1. S. E. Sclaroff and A. Pentland, “Modal Matching for Correspondence and
Recognition”, IEEE Transactions on Pattern Analysis and Machine Intelligence,
vol. 17, pp. 545-561, 1995.
2. J. M. R. S. Tavares, PhD Thesis, “Análise de Movimento de Corpos Deformáveis
usando Visão Computacional”, in Faculdade de Engenharia da Universidade do
Porto, Portugal, 2000.
3. Y. Ohta and T. Kanade, “Stereo by Intra- and Inter-Scanline Search using Dynamic
Programming”, IEEE Transactions on Pattern Analysis and Machine Intelligence,
vol. 7, pp. 139-154, 1985.
4. S. Roy and I. J. Cox, “A Maximum-Flow Formulation of the N-camera Stereo
Correspondence Problem”, presented at International Conference on Computer
Vision (ICCV'98), Bombay, India, 1998.
Improvement of Modal Matching Image Objects in Dynamic Pedobarography 367
CHAPTER 20
In video surveillance and sports analysis applications, object trajectories offer the
possibility of extracting rich information on the underlying behavior of the mov-
ing targets. To this end we introduce an extension of Point Distribution Models
(PDM) to analyze the object motion in their spatial, temporal and spatiotemporal
dimensions. These trajectory models represent object paths as an average trajec-
tory and a set of deformation modes, in the spatial, temporal and spatiotemporal
domains. Thus any given motion can be expressed in terms of its modes, which
in turn can be ascribed to a particular behavior.
The proposed analysis tool has been tested on motion data extracted from a
vision system that was tracking radio-guided cars running inside a circuit. This
affords an easier interpretation of results, because the shortest lap provides a ref-
erence behavior. Besides showing an actual analysis we discuss how to normalize
trajectories to have a meaningful analysis.
20.1. Introduction
369
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
1
K
S= (τi − τ )(τi − τ )T = P · Λ · P −1 (20.4)
K − 1 i=1
and K is the number of trajectories available in the training set and r =
min(3N, K) − 1 is the number of degrees of freedom in the set.
The computation of matrix P from a set of representative trajectories is known
as Principal Component Analysis (PCA)9 or Karhunen-Love Transform (KLT). It
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
that indicates the contribution of each deformation mode pi toward the actual
shape. For a given trajectory τk this contribution Bk is computed as
Bk = P −1 (τk − τ ). (20.6)
However, if a given trajectory is very different from the ones in the training set,
it will require a large amount of deformation to fit the resulting model. Therefore
the deformation coefficients Bk can be used to detect outlier trajectories by using
Hotelling’s T 2 statistic,10 which is a multivariate analogue of the t-distribution.
To use this statistic, the deformation modes have to be normalized so that they all
have the same unit variance. Mathematically it means that we define a new set of
normalized modes and coefficients
1
P̃ = P Λ− 2 (20.7)
− 12
B̃k = Λ Bk . (20.8)
where Λ is the diagonal matrix that contains the eigenvalues of the covariance
matrix. In this normalized space we can define for each trajectory a scalar value
M
Tk2 = B̃kT B̃k = (b̃kj )2 (20.9)
j=1
where M is the number of principal modes retained. This scalar Tk2 is the Ma-
halanobis distance of the trajectory, and it can be interpreted as the normalized
deformation energy of the related trajectory. If deformation coefficients bkj in the
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
training set were normally distributed, then (1 − α)% of times the deformation
energy would be bounded by the scalar
2 M · (K − 1)
Tα,M,K = FM,K−M;α (20.10)
K −M
where FM,K−M;α stands for the Fisher distribution with M and K −M degrees of
freedom and (1 − α)% confidence interval. Therefore a trajectory can be defined
as statistically conforming to the set if
Tk2 ≤ Tα,M,K
2
. (20.11)
20.3. Experiments
300
Mean trajectory
All trajectories
250
200
Y coordinate
150
100
50
0 50 100 150 200 250 300 350
X coordinate
(a) (b)
Fig. 20.1. On the left, the circuit used for the radio-controlled cars, with the cross markings used
for camera calibration purposes. On the right the spatial profile of a set of trajectories for one player
(9 trajectories). The average trajectory is indicated with a thicker line.
3 minutes, which implied typically 8-9 laps, which will be the trajectories under
study (cf. figure 20.1(b)).
240
250
220
200 200
Y coordinate
Y coordinate
180
160 150
140
100
120
Reference Trajectory
100 Trajectory to be re−sampled
50
80 0 50 100 150 200 250 300 350
0 50 100 150 200 250 300 350 X coordinate
X coordinate
(a) (b)
Fig. 20.2. On the left the problem of correspondence between the points in two trajectories is shown.
On the right, the selected resampling technique is shown: trajectories are resampled along orthogonal
lines to the points in a reference trajectory.
To solve this problem the trajectories are fitted to cubic splines and resampled
along orthogonal positions to a reference trajectory (cf. figure 20.2(b)). The ref-
erence trajectory is chosen so that it is smooth enough to afford the resampling of
the maximum number of trajectories. Indeed, if the trajectory has many bends, the
intersections with the orthogonal lines might not respect the original order or they
might not even exist. In any case, sensitivity analysis has shown that, for this data
set, the choice of the reference trajectory does not have a noticeable impact on the
subsequent modes. Once the trajectories are resampled they can be superposed
and an average trajectory and modes can be computed.
To perform a purely-spatial analysis we deal only with the xki and yik coordinates
of the trajectory, removing the temporal values tki . Equation (20.2) becomes
k T
τk = xk1 . . . xkN y1k . . . yN (20.15)
and the analysis is performed as in the original PDM formulation,8 providing in-
formation about the spatial shape of the trajectories.
Y coordinate
Y coordinate
Y coordinate
Y coordinate
0 0 0
0 0 0 0 200 400 0 200 400 0 200 400
0 200 400 0 200 400 0 200 400 X coordinate X coordinate X coordinate
X coordinate X coordinate X coordinate Mode 4 Mode 5 Mode 6
Lap 4 Lap 5 Lap 6 300 300 300
300 300 300
Y coordinate
Y coordinate
Y coordinate
Y coordinate
Y coordinate
0 0 0
0 0 0 0 200 400 0 200 400 0 200 400
0 200 400 0 200 400 0 200 400 X coordinate
X coordinate X coordinate
X coordinate X coordinate X coordinate
Lap 7 Lap 8 Mode 7 Mode 8
300 300 Lap 9 300 300
300
Y coordinate
Y coordinate
Y coordinate
Y coordinate
200 200
200
Positive deformation
100 100 of each mode
100 100 100
0 0 0
0 0
0 200 400 0 200 400 0 200 400 0 200 400 0 200 400
X coordinate X coordinate X coordinate X coordinate X coordinate
Mean trajectory
Individual Lap
(a) (b)
Fig. 20.3. On the left side each individual lap is plotted together with the average trajectory τ . The
right image shows the synthetic trajectory corresponding to each mode pj with its greatest contribution
max(bkj ) found in the set.
k
Figure 20.3(a) shows the individual laps together with the average trajectory τ .
Figure 20.3(b) represents the different modes of the spatial analysis. Each mode
is plotted with a coefficient corresponding to the greatest contribution max(bkj )
k
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
found in the set. The modes are ordered by decreasing amount of shape variation
(energy). The cumulative energy is plotted in figure 20.4(a), where it can be seen
that the first 4 modes contain 85% of the energy in the training set. The last mode
can be neglected.
The spatial representation of each mode can be directly linked to the tra-
jectories. Figure 20.4(b) shows mode 2 with different amounts of deformation
by generating synthetic trajectories τ = τ + B · P with a deformation vector
B = [0 d 0 0 0 0 0 0] where only the second component is nonzero. It can be
interpreted that this 2nd mode encodes the variability in the way the bottom curves
are negotiated, and most particularly the bottom-right curve. Indeed lap 2, which
has a variation in that same curve, shows the greatest contribution from mode 2,
as seen in figure 20.5(a). A similar relationship can be seen between mode 3 and
lap 9. However, the most occurring source of variation, mode 1, represents the
global variation among trajectories that follow the inner wall or the outer wall of
the circuit.
Mode 2
300
100
90
250
80
Cumulative energy of the modes
Y coordinate
70 200
60
150
50
40
100
30
20 50
0 50 100 150 200 250 300 350
10 X coordinate
Reference trajectory
0 Positive deformation : b =50, 100
2
1 2 3 4 5 6 7 8
Negative deformation : b2=−50, −100
Mode number
(a) (b)
Fig.
j 20.4. On the left, the cumulative energy of the deformation modes Ce(j) =
r
i=1 λi / k=1 λk . It can be seen that that 85% of the total energy is contained in the first 4
modes. The right image shows some synthetic trajectories corresponding to mode 2: τ = τ + B · P
with B = [0 d 0 0 0 0 0 0] and d = −100, −50, 0, 50, 100. It can be interpreted that this 2nd
mode seems to encode mainly the variations in the way the bottom-right curve is negotiated.
2
able deformation with a 90%√confidence level is given by T0.1,2,9 = 7.45. This is
plotted as a circle of radius 7.45 = 2.73, because in that case equations (20.9)
and (20.11) become
Tk2 = (b̃k1 )2 + (b̃k2 )2 < T0.1,2,9
2
= 7.45 (20.16)
It can be seen from figure 20.5(b) that none of the trajectories are statistically
outliers, although trajectories 2 and 3 do stand out. Indeed, trajectory 2 happens
to be the slowest one.
1 3 2.5
5 8 9 71 6 2 4
2 2
2 574 8 69 1 3 2
1.5
Second mode contribution
3 1 7 63 5 28 4 9 1
3
0.5 1
Mode no
4 9 1 7 563 4 2 8
9
0 6
5 45 3 62 1 9 7 8 8
−0.5
6 5 2 6 7 9 8 1 3 4 −1 5 7 4
−1.5
7 7 2 4 35 918 6
−2
8 7 6 3 24 9 8 5 1 −2.5
(a) (b)
Fig. 20.5. On the left figure, the contribution of each mode to the different trajectories is plotted. A
great contribution from a mode to a trajectory implies that the shape of the mode can be found in the
shape of the trajectory. On the right figure the normalized deformation vector B̃k = [b̃k1 b̃k2 ] for the
$
2
two first modes is plotted on a circle of radius T0.1,2,9 = 2.73 (cf. equation 20.10). In this case,
no outlier has been found. Indeed the original variability is so high that none of the trajectories can be
considered as an outlier.
5 5
3
Time
Time
2.5
2
0 0
5 5
4 4 4
3 3 3
2 2
2
1
Y Value 1 0 Y Value 1 0
X Value X Value
(a) (b)
Fig. 20.6. The left figure shows the trajectories in spatiotemporal dimension superposed together
with the mean trajectory. The right figure plots the shape of the maximum contribution of the first
mode, showing that it produces a more jumpy, slower trajectory.
In their spatiotemporal domain, the trajectories that have been analyzed seem
to have a smaller number of deformation modes. Indeed, the first mode, pictured
above, accounts for 80% of the total energy in the set (see figure 20.7(a)). From
figure 20.7(b) it can be seen that mode 1 contributes the most, with a positive
coefficient, to trajectory or lap 2. So much
$ so that the corresponding coefficient
2 2
b̃1 is greater than the expected threshold T0.1,1,9 , as per eq. (20.10), indicating
that the second trajectory is a statistical outlier. It is indeed this trajectory that
happens to be the slowest in the set.
100
1st spatiotemporal mode
3
90 Normalized contribution
2.5 2
T0.1,1,9 limit
Cumulative energy of the modes
80
2
Normalized mode contribution (b̃k1 )
70
1.5
60
1
50
0.5
40
0
30
−0.5
20 −1
10 −1.5
0 −2
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 9
Mode number Trajectory number (k)
(a) (b)
Fig. 20.7. The left figure shows the cumulative energy in the spatiotemporal analysis of the trajectory
set. The first deformation mode accounts for 80% of the total energy. The right figure shows the
normalized deformation coefficients b̃k1 corresponding to the first mode in the spatiotemporal analysis
of the trajectory set.
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
20.4. Conclusion
20.4.1. Perspectives
To better correlate the analyzed trajectories with the observed behaviors, we have
started a collaborative project with the Swarm-Intelligent Systems group (SWIS),
a research team at the EPFL focusing on collective embedded systems involving
multi-robot platforms. The point is that mobile robots provide an experimental
platform the behavior of which can be programmed and yet provide a natural vari-
ability to their trajectories. From the roboticists’ point of view, trajectory analysis
tools such as the one described in this paper, provide a means of quantifying the
robot’s behavior and hence predict their performance (in terms of time, energy,
work done) for tasks where trajectories play a role.
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
The first work to be done is to recreate the circuit experiment with mobile
robots and to generate a huge number of trajectories, in order to classify the dif-
ferent behaviors. This experiment could demonstrate the validity of these first
results.
Acknowledgments
The continuation of this project is supported by the Swiss National Science Foun-
dation grant “Trajectory Analysis and Behavioral Identification in Multi-Robot
Systems”, No. 200021–105565.
References
1. N. Johnson and D. Hogg, Learning the distribution of object trajectories for event
recognition, Image and Vision Computing. 14(8), 609–615, (1996).
2. N. N. Gehrig, V. Lepetit, and P. Fua. Golf club visual tracking for enhanced swing anal-
ysis tools. In British Machine Vision Conference, Norwich, UK (September, 2003).
3. A. F. Bobick and J. W. Davis, The recognition of human movement using temporal
templates, IEEE Transactions on Pattern Analysis and Machine Intelligence. 23(3),
257–267, (2001). URL citeseer.nj.nec.com/bobick01recognition.
html.
4. R. Vaughan, N. Sumpter, J. Henderson, A. Frost, and S. Cameron, Experiments in
automatic flock control, Robotics and Autonomous Systems. 31, 109–117, (2000).
5. R. Jeanson, S. Blanco, R. Fournier, J. Deneubourg, V. Fourcassie, and G. Theraulaz, A
model of animal movements in a bounded space, Journal of Theoretical Biology. 225,
443–451 (Dec, 2003).
6. Z. Khan, T. Balch, and F. Dellaert. Efficient Particle Filter-Based Tracking of Multiple
Interacting Targets Using an MRF-based Motion Model. In Proceedings of the 2003
IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS’03),
(2003).
7. T. Moeslund and E. Granum, A survey of computer vision-based human motion cap-
ture, Computer Vision and Image Understanding. 81(3), 231–268, (2001).
8. T. Cootes, C. Taylor, and D. Cooper, Active shape-models - their training and applica-
tions, Computer Vision and Image Understanding. pp. 38–59, (1995).
9. J. Jackson, Principal components and factor analysis: part I, Journal of Quality Tech-
nology. 12(4), 201–213 (October, 1980).
10. D. Montgomery, Introduction to statistical quality control. (Wiley, New York, 2001),
4th edition.
11. M. Unser, Splines: A perfect fit for signal and image processing, IEEE Signal Pro-
cessing Magazine. 16(6), 22–38 (November, 1999).
CHAPTER 21
1. Introduction
Deformable objects seem to have gained increasing interest during recent
years. Part of this success comes from a desire to interact with objects
that resemble those in real life, which all seem to be deformable at some
level. The next step in interactive applications, such as computer games,
is a more expansive integration of complex physical objects such as
deformable objects. Because CPUs and GPUs today are both advanced
and powerful, it is possible to simulate and animate deformable objects
interactively.
This paper builds on work done by Terzopoulos et al. in 1987 [16],
which focused on a generic model for simulating elastically deformable
381
382 M. Kelager, A. Fleron and K. Erleben
objects. The application is mainly objects of a very soft nature due to the
elastic properties of the constraint structure. In this model problems with
keeping integrity arise when simulating deformable solids. We will
explain the origins of the instabilities that cause the solids to collapse. An
introduction of area and volume restoration to the model is made that
deal with the integrity issues. The result is an improved model that is
suitable for a satisfactory simulation of solids.
1.1. Background
1.2. Motivation
1.3. Overview
∂r ∂r
Gij (r (a)) = ⋅ , 1 ≤ i, j ≤ 3 , (2)
∂ ai ∂ a j
which is a symmetric tensor. The diagonal of the tensor represents length
measurements along the coordinate directions from the particle in
question. The off-diagonal elements represent angle measurements
between the coordinate directions. When measuring deformation energy
in a solid, we are interested in looking at the change of the shape, with
respect to the natural rest shape, which is described by Gij0 . The energy
of deformation, ε (r ) , can be described by the weighted Hilbert-Schmidt
matrix norm of the difference between the metric tensors in the deformed
and rest states,
3
ε (r ) = ∫ S (r (a, t )) da1da2 da3 , where S (r ) = ∑ ηij (Gij − Gij0 ) , (3)
2
Ω i , j =1
2.2. Discretization
where
p [l , m, n] = α ij [l , m, n]D +j (r )[l , m, n],
To solve the equations for all particles at the same time, the values in
the positional grid, r , and the energy grid, e , can be unwrapped into
LMN -dimensional vectors, r and e . With these vectors, the entire
system of equations can be written as
e = K (r ) r , (7)
where K ( r ) is an LMN × LMN sized stiffness matrix, which has
desirable computational properties such as sparseness and bandedness.
We introduce the diagonal LMN × LMN mass matrix M , and damping
matrix C , assembled from the corresponding discrete values of
μ[l , m, n] and γ [l , m, n] , respectively. The equations of the elastically
deformable objects (1) can now be expressed in grid vector form, by the
coupled system of second-order ordinary differential equations,
∂2 r ∂r
M +C + K (r ) r = f . (8)
∂t 2
∂t
With these equations it is possible to implement real-time dynamic
simulations of deformable solids. To evolve the solid through time we
use the semi-implicit integration method described in [16]. The time
Area and Volume Restoration in Elastically Deformable Solids 387
(a) The surface patch will collapse to a (b) Missing spatial diagonal constraint.
curve, when a particle crosses the
opposite diagonal.
3. Instabilities
Notions from differential geometry are used as a tool to measure
deformation of an elastic object. For solids the 3 × 3 metric tensors are
sufficient to distinguish between the shapes of two objects. However, the
metric tensor of a solid is not sufficient to compute the complex particle
movements of a deformed solid, seeking towards its resting shape. The
discrete off-diagonal components of (2) are the cosine to the angle
between directions through the dot product,
v ⋅ w = v w cos θ , 0 ≤ θ ≤ π . (9)
The angle between two vectors is not dependent on their mutual
orientation, as verified by the domain of θ in (9). This leads to problems
with area restoration on the sides of grid cubes. Figure 1(a) illustrates
this instability. The bold lines and the angle between them form the
natural condition. If particle A is moved towards particle B , it will only
be forced back towards its resting position when 0 < θ < π , as depicted
388 M. Kelager, A. Fleron and K. Erleben
4. Improvements
To handle the integrity instabilities of the discrete grid cubes we extend
the elasticity constraints in order to improve their ability to prevent
collapsing. Basically, the extension will be done by both replacing and
adding new constraints. The metric tensor is redesigned to stabilize the
area restoration while we introduce a new spatial diagonal metric to
handle volume restoration.
Figure 2. (a) The angular constraints are replaced by (b) two new diagonal constraints
that will define the angular constraints implicitly.
where
p[m, n] = ηii [ m, n]Di+ (r )[m, n],
p1[m, n] = η12 [ m, n]Dd+1 (r )[m, n], (12)
p 2 [ m, n] = η21[m, n]Dd+2 (r )[ m, n].
Notice that new difference operators arise with the new directions.
These operators work exactly as the operators in the original directions.
E.g. the new first order forward difference operators on the positional
field r becomes
390 M. Kelager, A. Fleron and K. Erleben
Area restoration can keep grid cube patches from collapsing. However,
this is not always enough to keep the cubes from collapsing. If a particle
is being forced along its spatial diagonal, the result of the area restoration
will normally push the particle back to its relative point of origin. Yet, if
the force is strong enough to push the particle beyond the center of the
cube, the area restoration will still succeed, but the restoring of the grid
patches will now push the particle further along the diagonal. This is an
analogy to the instability problem discussed in section 3.
To implement volume restoration, we introduce the spatial diagonal
metric, V , which is a 2 × 2 tensor. The four elements of V represent
length constraints that will be spatially diagonal, meaning they will span
grid cubes volumetrically, as depicted in Figure 3(a),
Area and Volume Restoration in Elastically Deformable Solids 391
(a) (b)
Figure 3. Spatial length constraints for solids. (a) The four constraints reach out from the
center. (b) The constraint contribution from four particles on a single cube patch renders
symmetric behavior.
⎡ D+ ⋅ D+ Dv+2 ⋅ Dv+2 ⎤
V ≡ ⎢ v+1 v+1 ⎥, (14)
⎣Dv 3 ⋅ Dv 3 Dv+4 ⋅ Dv+4 ⎦
where Dv+1..4 ( u ) are the four new first order forward difference operators
along the new spatial diagonal directions. The spatial diagonal
constraints can be chosen to favor any directions, as long as the
contributions from the four particles on a grid cube patch will end up
covering the cube symmetrically, as depicted in Figure 3(b).
The difference operators are designed similarly to the two
dimensional case in (13). To implement volume restoration into the
model, the discrete elastic force e[l , m, n] must be extended to contain
the contributions provided by the spatial diagonal metric. This can
likewise be shown to be as straight forward as the addition of the
extended metric tensor.
5. Implosions
With the improved area and volume restorations we can restore the shape
of the discrete grid cubes after deformation. This is an important
improvement towards keeping the integrity of a deformable solid intact.
Another integrity issue still exists since a simulated solid is still unable to
prevent implosions. We define an implosion as when grid cubes enter
392 M. Kelager, A. Fleron and K. Erleben
Figure 4. (a) Grid cube implosion is avoided using (b) Central differences that bind
adjacent grid cubes together.
⎡D12 (r ) 0 0 ⎤
⎢ ⎥
P [l , m, n] = ⎢ 0 D2 (r )
2
0 ⎥, (15)
⎢ 0
⎣ 0 D32 (r )⎥⎦
where
(u [l + 1, m, n] − u [l − 1, m, n]) ,
D1 (u) [l , m, n] = ( 2h1 )
−1
D (u) [l , m, n] = ( 2h ) (u [l , m + 1, n] − u [l , m − 1, n]) ,
−1
2 2 (16)
D (u) [l , m, n] = ( 2h ) (u [l , m, n + 1] − u [l , m, n − 1]) .
−1
3 3
6. Results
We have implemented the original model from [16] with our
improvements of area and volume restoration and with the simple
prevention of implosion, as described in section 4 and 5, respectively.
The implementation is publicly available from [4]. Experiments have
revealed that the effects of the spatial diagonal metric do not always
succeed satisfactorily in moving particles back to their natural location.
394 M. Kelager, A. Fleron and K. Erleben
Figure 11, a large water lily is deformed when resting on pearls. The
improved model performs a great job in keeping the water lily fluffy.
7. Conclusion
The original model presented in [16] for simulating elastically
deformable solids turned out to be insufficient for achieving realism.
Even extremely modest external forces applied to the solids would ruin
their integrity. In this paper we have shown how replacements to the
metric tensor can be implemented to improve area restoration, and how
to implement the missing volume restoration. Furthermore we have
shown how to handle internal self-intersection using the framework from
the original model. Even though the original model is dated back to 1987
it is still competitive in the field of physically-based simulation. Visual
comparisons have revealed that our improvements to the model provide
deformable solids with the ability to keep their integrity, and thus the
ability to handle large deformations in real-time without collapsing. Our
improved model is a viable alternative to other methods for simulating
deformable solids.
Interesting challenges for future work include using unstructured
meshes instead of the regular 3D grid. However, this will complicate the
use of finite difference operators when approximating derivatives.
Working with solids gives the occasion to use tetrahedral meshes and the
finite element method. The problem of generating tetrahedral meshed
from closed 2D manifolds can be solved using the approach described in
[12]. The advantage of the regular grid approach, taken by this model,
compared to using a tetrahedral mesh is that fewer elements are needed
to represent the deformable solid. It is also possible that other integration
procedures can perform better in terms of numerical stability and thus an
analysis of this field might be beneficial.
396 M. Kelager, A. Fleron and K. Erleben
(a)
(b)
Figure 5. A small box is influenced by gravity and collides with a plane. (a) The three
stills illustrate the original model, and (b) the frames from the improved model are
shown.
(a) (b)
Figure 6. Rubber balls. (a) Illustrates the situation from the original model, where the ball
is unable to maintain its integrity, (b) the same situation is depicted, but simulated using
the improved model.
Area and Volume Restoration in Elastically Deformable Solids 397
Figure 9. Constraint strength is increased interactively and yields the effect of inflation.
398 M. Kelager, A. Fleron and K. Erleben
References
1. D. Baraff and A. Witkin. “Large Steps in Cloth Simulation”. Proceedings of the
Annual ACM SIGGRAPH'98 Conference, ACM Press, Vol. 33, pp. 43-54, 1998.
2. R. Bridson, R. Fedkiw, and J. Anderson. “Robust Treatment of Collisions, Contact
and Friction for Cloth Animation”. In Proceedings of ACM SIGGRAPH 2002, pp.
594-603, 2002.
3. D. Eberly. Derivative Approximation by Finite Differences. Magic Software, Inc.,
January 21, 2003.
4. K. Erleben, H. Dohlmann, J. Sporring, and K. Henriksen. The OpenTissue project.
Department of Computer Science, University of Copenhagen, DIKU, November
2003, http://www.opentissue.org
5. B. Heidelberger, M. Teschner, and M. Gross. “Detection of Collisions and Self-
collisions Using Image-space Techniques”. In Proceeding of WSCG'04, University
of West Bohemia, Czech Republic, pp. 145-152, 2004.
6. G. Irving, J. Teran, and R. Fedkiw. “Invertible Finite Elements for Robust
Simulation of Large Deformation”. ACM SIGGRAPH/Eurographics Symposium on
Computer Animation (SCA), pp. 131-140, 2004.
7. T. Jakobsen. “Advanced Character Physics”. In Proceedings of Game Developer’s
Conference 2001, 2001.
8. D. L. James and D. K. Pai. “BD-Tree: Output-Sensitive Collision Detection for
Reduced Deformable Models”. In Proceedings ACM SIGGRAPH 2004, pp. 393-
298, 2004.
9. T. Larsson and T. Akenine-Möller. “Collision Detection for Continuously
Deforming Bodies”. In Eurographics 2001, pp. 325-333, 2001.
10. M. Müller, J. Dorsey, L. McMillan, R. Jagnow, and B. Cutler. “Stable Real-Time
Deformations”. In Proceedings of ACM SIGGRAPH 2002, pp 49-54, 2002.
11. M. Müller and M. Gross. “Interactive Virtual Materials”. In Proceedings of
Graphics Interface (GI 2004), pp 239-246, 2004.
12. M. Müller and M. Teschner. “Volumetric Meshes for Real-Time Medical
Simulations”. In Proceedings of BVM, pp. 279-283, 2003.
13. Provot, X. “Deformation constraints in a mass-spring model to describe rigid cloth
behavior”. In Graphics Interface 1995, pp. 147–154, 1995.
14. J. R. Shewchuk. An Introduction to the Conjugate Gradient Method Without the
Agonizing Pain. Carnegie Mellon University, 1994.
15. D. Terzopoulos and K. Fleischer. “Modeling Inelastic Deformation: Viscoelasticity,
Plasticity, Fracture”. In Computer Graphics, Volume 22, Number 4, August 1988,
pp. 269-278, 1988.
16. D. Terzopoulos, J. C. Platt, A. H. Barr, and K. Fleischer. “Elastically Deformable
Models”. Computer Graphics, volume 21, Number 4, July 1987, pp 205-214, 1987.
400 M. Kelager, A. Fleron and K. Erleben
CHAPTER 22
The proposed work is part of a project that aims for the control of a videogame
based on hand gesture recognition. This goal implies the restriction of real-time
response and unconstrained environments. In this paper we present a new algo-
rithm to track and recognise hand gestures for interacting with videogames. This
algorithm is based on three main steps: hand segmentation, hand tracking and
gesture recognition from hand features. For the hand segmentation step we use
the colour cue due to the characteristic colour values of human skin, its invariant
properties and its computational simplicity. To prevent errors from hand segmen-
tation we add a second step, hand tracking. Tracking is performed assuming a
constant velocity model and using a pixel labeling approach. From the tracking
process we extract several hand features that are fed to a finite state classifier
which identifies the hand configuration. The hand can be classified into one of
the four gesture classes or one of the four different movement directions. Finally,
the system’s performance evaluation results are used to show the usability of the
algorithm in a videogame environment.
22.1. Introduction
401
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
time instant.
To facilitate this process many gesture recognition applications resort to the
use of uniquely coloured gloves or markers on hands or fingers.2 In addition,
using a controlled background makes it possible to locate the hand efficiently,
even in real-time.3 These two conditions impose restrictions on the user and on
the interface setup. We have specifically avoided solutions that require coloured
gloves or markers and a controlled background because of the initial requirements
of our application. It must work for different people, without any complement on
them and also for unpredictable backgrounds.
Our application uses images from a low-cost web camera placed in front of
the work area, where the recognised gestures act as the input for a computer 3D
videogame. The players, rather than pressing buttons, must use different hand
gestures that our application should recognise. This fact, increases the complexity
since the response time must be very fast. Users should not appreciate a significant
delay between the instant they perform a gesture or motion and the instant the
computer responds. Therefore, the algorithm must provide real-time performance
for a conventional processor. Most of the known hand tracking and recognition
algorithms do not meet this requirement and are inappropriate for visual interface.
For instance, particle filtering-based algorithms can maintain multiple hypotheses
at the same time to robustly track the hands but they need high computational
demands.4 Recently, several contributions for reducing the complexity of particle
filters have been presented, for example, using a deterministic process to help the
random search.5 Also in Bretzner et al.,6 we can see a multi-scale colour feature
for representing hand shape and particle filtering that combines shape and colour
cues in a hierarchical model. The system has been fully tested and seems robust
and stable. To our knowledge the system runs at about 10 frames/second and
does not consider several hand states. However, these algorithms only work in
real-time for a reduced size hand and in our application, the hand fills most of
the image. In Ogawara et al.,7 shape reconstruction is quite precise, a high DOF
model is considered, and in order to avoid self-occlusions infrared orthogonal
cameras are used. The authors propose to apply this technique using a colour skin
segmentation algorithm.
In this paper we propose a real-time non-invasive hand tracking and gesture
recognition system. In the next sections we explain our method which is divided
in three main steps. The first step is hand segmentation, the image region that
contains the hand has to be located. In this process, the use of the shape cue it
is possible, but they vary greatly during the natural hand motion.8 Therefore, we
choose skin-colour as the hand feature. The skin-colour is a distinctive cue of
hands and it is invariant to scale and rotation. The next step is to track the position
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
and orientation of the hand to prevent errors in the segmentation phase. We use a
pixel-based tracking for the temporal update of the hand state. In the last step we
use the estimated hand state to extract several hand features to define a determin-
istic process of gesture recognition. Finally, we present the system’s performance
evaluation results that prove that our method works well in unconstrained envi-
ronments and for several users.
The hand must be located in the image and segmented from the background before
recognition. Colour is the selected cue because of its computational simplicity, its
invariant properties regarding to the hand shape configurations and due to the hu-
man skin-colour characteristic values. Also, the assumption that colour can be
used as a cue to detect faces and hands has been proved useful in several publi-
cations.9,10 For our application, the hand segmentation has been carried out using
a low computational cost method that performs well in real time. The method is
based on a probabilistic model of the skin-colour pixels distribution. Then, it is
necessary to model the skin-colour of the user’s hand. The user places part of
his hand in a learning square as shown in Fig. 22.1. The pixels restricted in this
area will be used for the model learning. Next, the selected pixels are transformed
from the RGB-space to the HSL-space and the chroma information is taken: hue
and saturation.
We have encountered two problems in this step that have been solved in a
pre-processing phase. The first one is that human skin hue values are very near
to red colour, that is, their value is very close to 2π radians, so it is difficult to
learn the distribution due to the hue angular nature that can produce samples on
both limits. To solve this inconvenience the hue values are rotated π radians. The
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
second problem in using HSL-space appears when the saturation values are close
to 0, because then the hue is unstable and can cause false detections. This can be
avoided discarding saturation values near 0.
Once the pre-processing phase has finished, the hue, h, and saturation, s, val-
ues for each selected pixel are used to infer the model, that is, X = (x1 , . . . , xn ),
where n is the number of samples and a sample is xi = (hi , si ). A Gaussian
model is chosen to represent the skin-colour probability density function. The
values for the parameters of the Gaussian model (mean, μ, and covariance matrix,
Σ) are computed from the sample set using standard maximum likelihood meth-
ods.11 Once they are found, the probability that a new pixel, x, is skin can be
calculated as
1 1
P (x) = exp − (x − μ)Σ−1 (x − μ)T . (22.1)
(2π) |Σ|
2 2
Fig. 22.2. Hand contours for different backgrounds (1st row) and different light conditions (2nd row).
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
USB cameras are known for the low quality images they produce. This fact can
cause errors in the hand segmentation process. In order to make the application
robust to these segmentation errors we add a tracking algorithm. This algorithm
tries to maintain and propagate the hand state over time.
We represent the hand state in time t, st , by means of a vector, st =
(pt , wt , αt ), where p = (px , py ) is the hand position in the 2D image, the
hand size is represented by w = (w, h), where w is the hand width and h is
the hand height in pixels, and, finally, α is the hand’s angle in the 2D image
plane. First, from the hand state in time t we built an hypothesis of the hand state,
h = (pt+1 , wt , αt ), for time t + 1 applying a simple second-order autoregressive
process to the position component
pt+1 − pt = pt − pt−1 (22.2)
Equation (22.2) expresses a dynamical model of constant velocity. Next, if we
assume that at time t, M blobs have been detected, B = {b1 , . . . , bj , . . . , bM },
where each blob bj corresponds to a set of connected skin-colour pixels, the track-
ing process has to set the relation between the hand hypothesis, h, and the obser-
vations, bj , over time.
In order to cope with this problem, we define an approximation to the distance
from the image pixel, x = (x, y), to the hypothesis h. First, we normalize the
image pixel coordinates
n = R · (x − pt+1 ), (22.3)
where R is a standard 2D rotation matrix about the origin, α is the rotation angle,
and n = (nx , ny ) are the normalized pixel coordinates. Then, we can find the
crossing point, c = (cx , cy ), between the hand hypothesis ellipse and the normal-
ized image pixel as follows
cx = w · cos(θ),
(22.4)
cy = h · sin(θ),
where θ is the angle between the normalized image pixel and the hand hypothesis.
Finally, the distance from an image pixel to the hand hypothesis is
d(x, h) = n − c. (22.5)
This distance can be seen as the approximation of the distance from a point
in the 2D space to a normalized ellipse (normalized means centred in origin and
not rotated). From the distance definition of Eq. (22.5) it turns out that its value
is equal or less than 0 if x is inside the hypothesis h, and greater than 0 if it is
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
Our gesture alphabet consists in four hand gestures and four hand directions in
order to fulfil the application’s requirements. The hand gestures correspond to a
fully opened hand (with separated fingers), an opened hand with fingers together, a
fist and the last gesture appears when the hand is not visible, in part or completely,
in the camera’s field of view. These gestures are defined as Start, Move, Stop and
the No-Hand gesture respectively. Also, when the user is in the Move gesture,
he can carry out Left, Right, Front and Back movements. For the Left and Right
movements, the user will rotate his wrist to the left or right. For the Front and Back
movements, the hand will get closer to or further from the camera. Finally, the
valid hand gesture transitions that the user can carry out are defined in Fig. 22.3.
The process of gesture recognition starts when the user’s hand is placed in
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
front of the camera’s field of view and the hand is in the Start gesture, that is, the
hand is fully opened with separated fingers. In order to avoid fast hand gesture
changes that were not intended, every change should be kept fixed for a number
of predefined frames, if not the hand gesture does not change from the previous
recognised gesture.
To achieve this gesture recognition, we use the hand state estimated in the
tracking process, that is, s = (p, w, α). This state can be viewed as an ellipse ap-
proximation of the hand where p = (px , py ) is the ellipse centre and w = (w, h)
is the size of the ellipse in pixels. To facilitate the process we define the major
axis length as M and the minor axis length as m. In addition, we compute the
hand’s blob contour and its corresponding convex hull using standard computer
vision techniques. From the hand’s contour and the hand’s convex hull we can
calculate a sequence of contour points between two consecutive convex hull ver-
tices. This sequence forms the so-called convexity defect (i.e., a finger concavity)
and it is possible to compute the depth of the ith-convexity defect, di . From these
depths it is possible to compute the depth average, d, as a global hand feature, see
Eq. 22.6, where n is the total number of convexity defects in the hand’s contour,
see Fig. 22.4.
Fig. 22.4. Extracted features for the hand gesture recognition. In the right image, u and v indicate
the start and the end points of the convexity defect, the depth, d, is the distance from the farthermost
point of the convexity defect to the convex hull segment.
1
n
d= di . (22.6)
n i=0
The first step of the gesture recognition process is to model the Start gesture.
The average of the depths of the convexity defects of an opened hand with sep-
arated fingers is larger than in an open hand with no separated fingers or in a
fist. This feature is used for differentiating the next hand gesture transitions: from
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
Stop to Start; from Start to Move; and from No-Hand to Start. However, first it is
necessary to compute the Start gesture feature, Tstart . Once the user is correctly
placed in the camera’s field of view with the hand widely opened the skin-colour
learning process is initiated. The system also computes the Start gesture feature
for the n first frames,
1
n
Tstart = d(t). (22.7)
2n t=0
Once the Start gesture is identified, the most probable valid gesture change
is the Move gesture. Therefore, if the current hand depth is less than Tstart the
system goes to the Move hand gesture. If the current hand gesture is Move the
hand directions will be enabled: Front, Back, Left and Right.
If the user does not want to move in any direction, he should set his hand in
the Move state. The first time that the Move gesture appears, the system computes
the Move gesture feature, Tmove , that is an average of the approximated area of
the hand for n consecutive frames,
1
n
Tmove = M (t) · m(t). (22.8)
n t=0
In order to recognise the Left and Right directions, the calculated angle of
the fitted ellipse is used. To prevent non desired jitter effects in orientation, we
introduce a predefined constant Tjitter . Then, if the angle of the ellipse that cir-
cumscribes the hand, α, satisfies α > Tjitter , Left orientation will be set. If the
angle of the ellipse that circumscribes the hand, α, satisfies α < −Tjitter , Right
orientation will be set.
In order to control the Front and Back orientations and to return to the Move
gesture the hand must not be rotated and the Move gesture feature is used to differ-
entiate these movements. If Tmove ·Cf ront < M ·m succeeds the hand orientation
will be Front. The Back orientation will be achieved if Cback > m/M .
The Stop gesture will be recognised using the ellipse’s axis. When the hand is
in a fist, the fitted ellipse is almost like a circle and m and M are practically the
same, that is, when Cstop > M − m.
Cf ront , Cback and Cstop are predefined constants established during the algo-
rithm performance evaluation. Finally, the No-Hand state will appear when the
system does not detect the hand, the size of the detected hand is not large enough
or when the hand is in the limits of the camera’s field of view. The next possi-
ble hand state will be the Start gesture and it will be detected using the transition
procedure from Stop to Start explained earlier on.
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
Some examples of gesture transitions and the recognised gesture results can
be seen in Fig. 22.5. These examples are chosen to show the algorithm robustness
for different lighting conditions, hand configurations and users. We realize that a
correct learning of the skin-colour is very important. If not, some problems with
the detection and the gesture recognition can be encountered. One of the main
problems with the use of the application is the hand control, maintaining the hand
in the camera’s field of view and without touching the limits of the capture area.
This problem has been shown to disappear with user’s training.
Fig. 22.5. Gesture recognition examples for different lighting conditions, users and hand configura-
tions.
In this section we describe the accuracy of our hand tracking and gesture recog-
nition algorithm. The application has been implemented in Visual C++ using the
OpenCV libraries.13 The application has been tested on a Pentium IV running at
1.8 GHz. The images have been captured using a Logitech Messenger WebCam
with USB connection. The camera provides 320x240 images at a capture and
processing rate of 30 frames per second.
For the performance evaluation of the hand tracking and gesture recognition,
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
the system has been tested on a set of 40 users. Each user has performed a pre-
defined set of 40 gestures and therefore we have 1600 gestures to evaluate the
application results. It is natural to think that the system’s accuracy will be mea-
sured controlling the performance of the desired user movements for managing the
videogame. This sequence included all the application’s possible states and tran-
sitions. Figure 22.6 shows the performance evaluation results. These results are
represented using a bidimensional matrix with the application states as columns
and the number of appearances of the gesture as rows. The columns are paired
for each gesture: the first column is the number of tests of the gesture that has
been correctly identified; the second column is the total number of times that the
gesture has been carried out. As it can be seen in Fig. 22.6, the hand recognition
gesture works fine for a 98% of the cases.
22.6. Conclusions
In this paper we have presented a real-time algorithm to track and recognise hand
gestures for human-computer interaction within the context of videogames. We
have proposed an algorithm based on skin colour hand segmentation and tracking
for gesture recognition from extracted hand morphological features. The system’s
performance evaluation results have shown that the users can substitute traditional
interaction metaphors with this low-cost interface.
The experiments have confirmed that continuous training of the users results
in higher skills and, thus, better performances. Also the system has been tested in
indoor laboratory with changing background scenario and low light conditions. In
these cases the system run well, with the logical exception of similar skin back-
ground situations or several hands intersecting in the same space and time. The
system must be improved to discard bad classifications situations due to the seg-
mentation procedure. But, in this case, the user can restart the system only going
to the Start hand state.
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
Acknowledgements
References
1. V.I. Pavlovic, R. Sharma, T.S. Huang. Visual interpretation of hand gestures for
human-computer interaction: a review, IEEE Pattern Analysis and Machine Intelli-
gence, 19(7), 677–695, (1997).
2. R. Bowden, D. Windridge, T. Kadir, A. Zisserman, M. Brady. A Linguistic Feature
Vector for the Visual Interpretation of Sign Language. In: Proc. European Conference
on Computer Vision (ECCV04), vol. 1, pp. 391–401, LNCS3022, Springer-Verlag,
(2004).
3. J. Segen, S. Kumar. Shadow gestures: 3D hand pose estimation using a single camera.
In: Proc. of the Computer Vision and Pattern Recognition Conference (CVPR99), vol.
1, 485, (1999).
4. M. Isard, A. Blake. ICONDENSATION: Unifying low-level and high-level track-
ing in a stochastic framework. In: Proc. European Conference on Computer Vision
(ECCV98), pp. 893–908, (1998).
5. C. Shan, Y. Wei, T. Tan, F. Ojardias. Real time hand tracking by combining particle
filtering and mean shift. In: Proc. Sixth IEEE Automatic Face and Gesture Recognition
(FG04), pp. 229–674, (2004).
6. L. Bretzner, I. Laptev, T. Lindeberg. Hand Gesture Recognition using Multi-Scale
Colour Features, Hierarchical Models and Particle filtering. In: Proc. Fifth IEEE In-
ternational Conference on Automatic Face and Gesture Recognition (FRG02), (2002).
7. K. Ogawara, K. Hashimoto, J. Takamtsu, K. Ikeuchi. Grasp Recognition using a 3D
Articulated Model and Infrared Images. In: Proc. Intelligent Robots and Systems
(IROS03), vol. 2, pp. 1590–1595, (2003).
8. T. Heap, D. Hogg. Wormholes in shape space: tracking through discontinuous changes
in shape. In: Proc. Sixth International Conference on Computer Vision (ICCV98), pp.
344–349, (1998).
9. G.R. Bradski. Computer video face tracking for use in a perceptual user interface. Intel
Technology Journal, Q2’98, (1998).
10. D. Comaniciu, V. Ramesh. Robust detection and tracking of human faces with an
active camera. In: Proc. of the Third IEEE International Workshop on Visual Surveil-
lance, pp. 11–18, (2000).
11. C.M. Bishop. Neural Networks for Pattern Recognition. Clarendon Press, (1995).
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
12. J. Varona, J.M. Buades, F.J. Perales. Hands and face tracking for VR applications.
Computers & Graphics, 29(2), 179–187, (2005).
13. G.R. Bradski, V. Pisarevsky. Intel’s Computer Vision Library. In: Proc of IEEE Con-
ference on Computer Vision and Pattern Recognition (CVPR00), vol. 2, pp. 796–797,
(2000).
June 25, 2009 8:55 World Scientific Review Volume - 9in x 6in ws-rv9x6
CHAPTER 23
23.1. Introduction
The JPEG20001–8 started its standard formalization in 1997 and became an ISO
standard in late 2000, confirming itself as the new reference point for researches
in the field of still image compression. Among several innovations, the use of
wavelets instead of DCT (Discrete Cosine Transform) (first appeared on,9 based
on Fourier analysis, which was used by JPEG10,11 standard) allows multiresolu-
tion processing, preservation of spatial locality information and adaptivity on the
image content. JPEG2000 obtains better results in terms of compression ratios,
image quality and flexibility according to user demands. The JPEG2000 cod-
ing scheme is quite similar in philosophy to the EZW12 and SPIHT13 algorithms,
even if it uses different data structures. Furthermore, the architectural design of
the JPEG2000 allows several degrees of freedom aimed at tailoring the processing
toward specific needs.
In literature, there are several proposal of approaches that modify only the
413
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
coder (thus preserving the standard compatibility14–26 ) or both the coder and the
decoder. Our approach belongs to the second group and the gain over the standard
will be motivated from a theoretical point of view (by definition of a gain function,
see Section 4.2) and by the experimental results. A very brief and preliminary ver-
sion of our algorithms has been described in;27 here we give a fully review of new
experiments to validate our approach, and we add new considerations about com-
parison to JPEG 2000, PNG, and JPEG-LS. In fact we present concepts, theory
and results about two novel algorithms which can be used to enhance performance
(in terms of compression ratio) of lossless compression of images28–32 coded with
JPEG2000 standard. In particular, the proposed algorithms modify the compres-
sion chain in the bit-plane encoding step, allowing adaptations particularly suited
for sparse histogram images, for which the gain is at its best. For completeness
sake, a brief introduction of JPEG2000 is given in Section 23.2; Section 23.3
gives the basis to understand the theory of the work through an original definition
of histogram sparsity concept. Section 23.4 describes the two algorithms. Ex-
perimental results are fully reported in Section 23.4.1.1 and 23.4.1.3 for the first
proposal, and in Section 23.4.2.2 for the second one. Conclusions (Section 23.5)
and suggestions for further works end the paper.
Fig. 23.1. JPEG2000 compression chain at a system level: in the proposed algorithms, modifications
occur in bit-plane encoding block.
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
A Novel Approach to Sparse Histogram Image Lossless Compression using JPEG2000 415
23.2.1. Preprocessing
By referring to Figure 23.1, the first stage of JPEG2000 compression chain in-
volves a resampling of pixel values to obtain symmetry around zero: if si is the
ith sample of an input image (of size D = V ∗ H, where V and H are the vertical
and horizontal dimensions, respectively), and bps is the bit per sample rate, this is
done through relation in Equation 23.1:
∀i ∈ [0, D − 1] si = si − 2bps−1 (23.1)
so that ∀i ∈ [0, D − 1] the old samples si are in the range [0, 2bps − 1] while
the new ones si in [−2bps−1 , 2bps−1 − 1]. This operation enhances the decorre-
lation among samples and helps in making statistical assumption on the sample
distribution.
The two wavelet transforms used in JPEG2000 coding chain are LeGall 5/3
and Daubechies 9/7.38 The characteristics of the former (good approximation
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
property and the shortest biorthogonal filters available) lead it to be used for loss-
less compression, while the latter (good orthogonality, despite of a non optimal
smoothness) is more suitable for the lossy one. In order to reduce computational
load, instead of using the original Mallat multiresolution scheme39 the wavelet
decomposition is carried out using a lifting scheme.40 Starting from the origi-
nal image I, the coefficients produced by the multiresolution computation of the
wavelet transform are organized in bands (conventionally referred to as LL, HL,
LH and HH) and levels, depicted in Figure 23.2.
The coefficient coding is composed into several functional units, described in the
current section (Figure 23.3). The first one, called Tier-1, processes wavelet co-
Significance Tier-1
Quantized
Coefficients Refinement MQ-Coder Tier-2 Final Stream
Cleanup
efficients and generates a bitstream; the second one, called Tier-2, organizes the
bitstream according to user specifications. In Tier-1, after the wavelet coefficients
have been computed, the whole subsequent processing is organized into non over-
lapping code-blocks of (typically 64 by 64) coefficients; each code block is passed
to the coding chain independently. This creates the input for EBCOT 41 algorithm.
The main idea behind this algorithm is to code coefficients “by difference” be-
tween two sets: significant and non significant ones, the big effort is toward local-
izing the non significant (which occurs with a higher probability) area of blocks:
in this way the position of significant coefficients is easily determined. This fun-
damental idea of EZW is implemented in JPEG2000 by bit-plane coding (Figure
23.4(a)) in which a particular scan order of coefficient inside each code block is
applied (Figure 23.4(b)).
The implementation is slightly different from the original EZW because
thresholding is abandoned and the significance of a coefficient is determined in
bit-plane scanning according to the bare value of the bit and the relationship with
neighborhood coefficients. The bit-plane scan unit is called stripe, which is four
bits high and as wide as the code block (Figure 23.4(b)). For simplicity, we call a
single column of a stripe a quadruplet: this term will be used further in the paper.
The bit-plane coding is implemented in a sequence of three passes called signif-
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
A Novel Approach to Sparse Histogram Image Lossless Compression using JPEG2000 417
bit-plane 3 - MSB
bit-plane 2
bit-plane 1
bit-plane 0 - LSB
Wavelet
coefficient
icance, refinement and cleanup: each sequence is executed on every bit plane,
starting from the MSB to the LSB.
In this pass each stripe is scanned, bit by bit, to decide if a coefficient is significant
or not. For clarity sake, the symbols that will be used are presented in Table 23.1
(here, x and y identify the current position in the block). If the current coefficient
b(x, y) has not already been marked as significant (σ(x, y) = 0) and its neigh-
borhood (Figure 23.5) contains at least one significant coefficient, the value v is
conveyed to the MQ-coder (see Figure 23.3) and the corresponding σ(x, y) is up-
dated (one if v is equal to one, zero otherwise). If v is equal to one, the sign of the
coefficient will be conveyed too. If the coefficient b(x, y) has already been marked
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
Symbol Meaning
σ(x, y) The significance state matrix
b(x, y) The current coefficient being processed
bp The current bit-plane being processed
v (b(x, y) and 1 << bp) The current bit of coefficient
b(x, y) on bit-plane bp
as significant (σ(x, y) = 1), the bit v will be processed in the refinement pass. If
the current coefficient has not already been marked as significant (σ(x, y) = 0)
and its neighborhood does not contain any significant coefficient, the cleanup pass
will take care of it.
This second pass is used to convey the magnitude of coefficients that have already
been found significant (σ(x, y) = 1) in previous passes, conveying v.
This pass takes care of processing all the coefficients discarded by the two pre-
vious passes. It scans for every quadruplet: if all the four quadruplet coefficients
have (σ=0) and their neighborhoods do not contain any significant coefficient and
they do not become significant in the current bit-plane (all v are equal to zero) a
bit zero is conveyed together with the corresponding context (this configuration
has been called non significant quadruplet). If all the four quadruplet coefficients
have σ = 0 and their neighborhoods do not contain any significant coefficient but
at least one v value is equal to one, a string is conveyed to identify the position of
the first v = 1 and the remaining coefficients are coded according to the standard
significance pass policy. This particular configuration will be referred to as half
quadruplet. If at least one of the four quadruplet coefficients have σ = 1, the
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
A Novel Approach to Sparse Histogram Image Lossless Compression using JPEG2000 419
23.2.4. MQ-coder
Referring to Figure 23.3, this part takes the values produced either by signifi-
cance, refinement or the cleanup passes and generates the stream accordingly to
a half context, half arithmetic coder. Instead of using a totally arithmetic coder,42
some previously defined probability context are used to empower the codeword
production. The contexts have been defined for each pass according to probability
model of the bitstream. For example, in the significance and refinement passes the
choice of the context depends upon the corresponding band of the coefficient and
the eight connected neighborhood configuration of the currently processed bit (a
reference regarding the approach of MQ-coder can be found in43 ).
Tier-2 coder is the second part of EBCOT algorithm. The bitstream produced by
each code block from the previous step is now organized into layers: this is done
with a sort of multiplexing and ordering the bitstreams associated to code blocks
and bit-planes. A layer is a collection of some consecutive bit-plane coding passes
from all code blocks in all subbands and in all components.1 The quality of the
image can be tuned in two ways: with the same levels of resolution but varying the
number of layers, one can perceive an image in its original size with different level
of degradation (typically in the form of blurring artifacts); with the same number
of layers, but varying the levels of resolution, the image is perceived always at the
same quality, but in different sizes, obviously smaller for less resolution levels.
The compressed data, now organized in layers, are then organized into pack-
ets, each of them composed by a header, containing important information about
how the packet has been built, and a body. This kind of organization of the final
codestream is done for a better control of the produced quality and for an easier
parsing that will be done by a decoder.
The first of the two proposed algorithms makes modifications on the significance
and refinement passes, while the second one works only on the cleanup pass;
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
before explaining those in details, the concept of sparse histogram images is ad-
dressed in order to understand the approach of the work. The aim of this study
is to find an intuitive relation between the term “sparse histogram” and its visual
meaning; therefore, we propose a novel sparsity index definition. For clarity, the
index is composed by two terms, which are defined as follows. Let us take the set
T defined as:
T ≡ {t | t ∈ [0, 2bps − 1]} (23.2)
where bps is the bit per sample of the image and T will denote the cardinality
of T . In this work, 256 gray levels images are taken into account, so bps will
be normally set to 8. We define the function H(t) as the histogram of a given
image, that is, for each image tone t, H(t) is equal to the number of occurrences
of the tone t in the image. Let’s define a threshold th as the mean value of a
normalized version of H(t) (the normalization of H(t) is done on the maximum
value of H(t), so that for every image the histogram values are in the range [0, 1]),
which visually is a discriminant between an image having some gray tones more
prevalent than other, or having all the gray tones playing more or less the same
role. Basing upon the histogram function, we define a sparsity index as the sum
of two terms:
I = A+B (23.3)
The computation of A and B is explained in the following paragraphs.
A Novel Approach to Sparse Histogram Image Lossless Compression using JPEG2000 421
and t as
t | H(t) = m1 (23.7)
From 23.6 and 23.7, we modify H(t) to produce H defined as:
*
H(t) ∀t = t
H = (23.8)
0 if t = t
The same operations are performed on H , so (Figure 23.6(b)):
m2 = max(H (t)) (23.9)
and t
t | H (t) = m2 (23.10)
If there is an ambiguity in the choice of t and t, we consider the values of t and
t that minimize the distance |t − t| (Figure 23.6(c)). So, the second term (B) is
defined as follows:
|t − t|
B= (23.11)
T − 1
In order to better highlight the sparsity index boundaries, two synthetic images
have been created: in image 23.7(a) the number of pixels is equal to the number
of gray scale tones, in image 23.7(b) only the two most distant tones (e.g. black
and white) have been used.
When the sparsity index is calculated for the first image (23.7(a)), called
“Smooth”, its histogram is a constant function of value 1 over all the image tones:
so we have
th = 1 (the only histogram value) (23.12)
and, from Equation 23.5
0
A= = 0 (no tone is below the threshold) (23.13)
T
For B term, we have (see Equation 23.11):
1
B= (each maximum is one-tone far from its neighbor) (23.14)
T − 1
therefore, from Eqs. 23.3, 23.13, and 23.14 we have
1
I =0+ = 0.003922 (23.15)
T − 1
Taking into account the second image (23.7(b)), called “Hard”, its histogram
consists of only two peaks at the edge of the graph (t = 0 and t = T − 1), both of
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
(a) The image histogram H(t) with m1 high- (b) The image histogram H (t) with m2 high-
lighted by a rounded dot. lighted by a rounded dot.
Fig. 23.6. The steps of the computation of the second term B as in Equation 23.11.
Fig. 23.7. Test images for sparsity index boundaries evaluation. (a) sparsity index is equal to 0 and
(b) is equal to 2.
value T /2 (this height of the peaks is valid only in this case for a “Hard” image
measuring 16 by 16 pixels; in the most general case, for a V*H image, the peak
height is (V*H)/2). For image ”Hard” we have (using the normalized version of
H(t), the two T /2 peaks normalize to 1):
1 2
th = (1 + 0 + . . . + 0 + 1) = (23.16)
T T
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
A Novel Approach to Sparse Histogram Image Lossless Compression using JPEG2000 423
T − 2
A= (23.17)
T
(all the tones but the two at the very edge of the histogram are below the threshold,
which is small but strictly positive.) For B term, we have (Equation 23.11):
|0 − (T − 1)|
B= (the two tones are at the maximum distance) (23.18)
T − 1
T − 2
I= + 1 = 1.992188 (23.19)
T
Table 23.2. Test images in increasing order according to the sparsity index (Equation
23.3).
mountain
mandrill
mehead
peppers
camera
zelda
bride
barb
boat
lena
bird
Name
Id 21 13 4 2 17 5 1 14 16 3 11
Unique Colors 187 226 256 145 230 247 221 256 110 224 230
I 0.47 0.52 0.55 0.61 0.65 0.66 0.66 0.72 0.72 0.75 0.96
frog
squares
montage
slope
fractalators
text
library-1
circles
crosses
fpf
Id 10 19 15 18 9 20 12 6 7 8
Unique Colors 102 4 251 248 6 2 221 4 2 2
I 0.97 1.18 1.61 1.67 1.69 1.69 1.69 1.83 1.84 1.84
for sparsity index I we present (Figure 23.8), three examples from the tested
images, showing also, with a solid line, the corresponding threshold th: the first
one is zelda (Figure 23.8(a)), the least sparse (I=0.473), with its histogram, then
an intermediate case with lena (Figure 23.8(b)) (I=0.955), and finally, the most
sparse case with crosses (Figure 23.8(c)) (I=1.835).
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
Fig. 23.8. Reading the figures columnwise, for each column the couple picture-histogram is pre-
sented.
The first algorithm proposed in this paper addresses the first part of Tier-1 en-
coder, the significance and the refinement passes: recalling that bit coefficients are
encoded in groups of four by four (the quadruplets), the aim is to reduce chang-
ing context overhead by making quadruplets (and, consequently, stripes) longer;
therefore, we call it “stripe lengthening”. We expect that, by lengthening the
stripe, the number of context changes is reduced; this is particularly true for im-
ages with a higher sparseness, because their characteristics are likely to change
less often than other kind of images. For this reason, we expect better result for
images with a higher index I. The lengthening is the same in significance and
refinement pass (while the cleanup pass is performed with the original length 4):
in fact, it would not be coherent to group coefficients in two different ways in the
same encoding process. The reason why we choose to leave the cleanup pass with
its original stripe length of 4 bits is that we prefer to have the two modifications
distinguished; moreover, it does not make sense using longer stripe in the cleanup
pass when the second modification is active: the longer the stripe, the shorter the
quadruplet series.
In order to perform the modification of the algorithm to the encoding process,
two variables (pointers) have been used to keep track of the current quadruplets
and to identify the beginning of the next one and they are properly initialized ac-
cording to the new stripe length. The original JPEG2000 quadruplet length was
fixed to 4 (hence the name). In our algorithm, we test the encoder over various
quadruplet lengths; the range varies from 5 to the block height, but in the ex-
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
A Novel Approach to Sparse Histogram Image Lossless Compression using JPEG2000 425
perimental results here reported only significant values are considered, namely 8,
16 and 32. We have implemented the algorithm in a way that it is fully com-
patible with the source code Jasper44 in which the piece of code regarding each
quadruplet bit was repeated four times. As we have parameterized the length,
we embodied such code in a counter based loop which cycles until the quadru-
plet length counter becomes zero. The implementation of the stripe lengthening
requires a corresponding modification of the decoder structure, in order to cou-
ple correctly coder and decoder. Various combinations of resolution levels and
stripe lengths have been evaluated; first we present the results of the experiments
for single combinations (stripe length-resolution level - Figure 23.9(a)-23.11(b)),
then three overall graphs will be shown in order to give a general overview of the
combinations.
0.08
0.07
0.06
0.05
0.04
R
0.03
0.02
0.01
−0.01
21 4 1 3 13 2 11 18 17 10 15 14 19 5 16 12 9 20 6 7 8
Image ID
10
6
R
−2
21 4 1 3 13 2 11 18 17 10 15 14 19 5 16 12 9 20 6 7 8
Image ID
Fig. 23.9. Single combination graphs for stripe length equal to 8 bit for each image, the R value
(Equation 23.20) shows the gain of our coder compared to JPEG2000. Images are ordered increasingly
according to sparsity.
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
A Novel Approach to Sparse Histogram Image Lossless Compression using JPEG2000 427
0.08
0.06
0.04
0.02
R
−0.02
−0.04
21 4 1 3 13 2 11 18 17 10 15 14 19 5 16 12 9 20 6 7 8
Image ID
0.025
0.02
0.015
R
0.01
0.005
−0.005
21 4 1 3 13 2 11 18 17 10 15 14 19 5 16 12 9 20 6 7 8
Image ID
Fig. 23.10. Single combination graphs for stripe lengths equal to 16 bit: for each image, the R value
(Equation 23.20) shows the gain of our coder compared to JPEG2000. Images are ordered increasingly
according to sparsity.
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
0.06
0.04
0.02
0
R
−0.02
−0.04
−0.06
−0.08
21 4 1 3 13 2 11 18 17 10 15 14 19 5 16 12 9 20 6 7 8
Image ID
0.03
0.025
0.02
0.015
R
0.01
0.005
−0.005
21 4 1 3 13 2 11 18 17 10 15 14 19 5 16 12 9 20 6 7 8
Image ID
Fig. 23.11. Single combination graphs for stripe length equal to 32 bit: for each image, the R value
(Equation 23.20) shows the gain of our coder compared to JPEG2000.
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
A Novel Approach to Sparse Histogram Image Lossless Compression using JPEG2000 429
0.015
3 Level of resolution
4 Level of resolution
5 Level of resolution
6 Level of resolution
Mean
0.01
0.005
R
−0.005
−0.01
21 4 1 3 13 2 11 18 17 10 15 14 19 5 16 12 9 20 6 7 8
Image Id
Fig. 23.12. Overall graph for stripe length equal to 8 bit: for each image, the R value (Equation
23.20) shows the gain of our coder compared to JPEG2000. The best gain is in term of 8% of the file
size coded with standard JPEG2000. The images are sorted by their sparsity value.
From the several experiments one thing is clearly noticeable: almost all of the
image files with a high sparsity index are significantly smaller when processed by
our algorithm at a low level of resolution: the behavior does not follow the trend
of JPEG2000 encoder, which obtains better results with more resolution levels.
It is otherwise important to notice that in each graph the best performances are
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
3 Level of resolution
0.025 4 Level of resolution
5 Level of resolution
0.02 6 Level of resolution
Mean
0.015
0.01
0.005
0
R
−0.005
−0.01
−0.015
−0.02
−0.025
21 4 1 3 13 2 11 18 17 10 15 14 19 5 16 12 9 20 6 7 8
Image Id
Fig. 23.13. Overall graph for stripe length equal to 16 bit: for each image, the R value (Equation
23.20) shows the gain of our coder compared to JPEG2000. The best gain is in term of 8% of the file
size coded with standard JPEG2000. The images are sorted by their sparsity value.
A Novel Approach to Sparse Histogram Image Lossless Compression using JPEG2000 431
3 Level of resolution
4 Level of resolution
0.02 5 Level of resolution
6 Level of resolution
Mean
−0.02
R
−0.04
−0.06
−0.08
21 4 1 3 13 2 11 18 17 10 15 14 19 5 16 12 9 20 6 7 8
Image Id
Fig. 23.14. Overall graph for stripes height equal to 32 bit: for each image, the R value (Equation
23.20) shows the gain of our coder compared to JPEG2000. The best gain is in term of 6% of the file
size coded with standard JPEG2000. The images are sorted by their sparsity value.
Several articles in literature6,45 show that JPEG2000 now is one of the most
promising and well performing format for compression of still images; even if
some formats perform better in some cases, JPEG2000 characteristics and versa-
tility justify our choice of having JPEG2000 as our main comparison term. More-
over, as we consider lossless compression only, we consider also JPEG-LS46 as
the “optimum” comparison term. However, we want to give an idea of possible
comparisons to other approaches by taking also into consideration another com-
pression algorithm: the choice has been directed to PNG47 format, because it is
one of the most recent image format that has achieved popularity on the Web. In
this section, a comparison table (Table 23.3) among the obtained compression ra-
tios of Standard JPEG2000, stripe lengthening, JPEG-LS, and the PNG format is
given. From the experimental results (summarized in Table 23.3), we point out
that the performance of our algorithm, when applied to sparse histogram images
(from text to fpf ), are between standard JPEG2000 and JPEG-LS. It is interesting
to study the effect of multiresolution and stripe length choice over the compres-
sion ratio; so, in order to find the best stripe length and level of resolution, we
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
define an improving factor F (which is a function of the stripe length and the level
of resolution) computed over the entire set of 21 images:
21
sizestandard − sizestripe,level
modif ied
F (stripe, level) = (23.21)
i=1
sizeuncompressed
From the computation of all the reasonable values of F , we obtained the best
result for 32 bit stripe length and 5 levels of resolution. We underline that the
factor F is a objective measure only, and does not take into account whatever
image characteristic. F computation mediates between highly positive and highly
negative results in each image; therefore, there is no contradiction between this
result and the ones presented in Figures 23.12-23.14.
A Novel Approach to Sparse Histogram Image Lossless Compression using JPEG2000 433
0.03
8 bit stripe
16 bit stripe
32 bit stripe
0.02
0.01
mean of R
−0.01
−0.02
21 4 1 3 13 2 11 18 17 10 15 14 19 5 16 12 9 20 6 7 8
Image Id
3 levels
4 levels
0.02 5 levels
6 levels
0.015
0.01
0.005
mean of R
−0.005
−0.01
−0.015
−0.02
21 4 1 3 13 2 11 18 17 10 15 14 19 5 16 12 9 20 6 7 8
Image Id
Fig. 23.15. On top, the mean of R value is computed over different stripe lengths of 8, 16, and 32
pixels with 3-6 Level of resolution; on bottom, the mean is computed over 3 - 6 Resolution Level with
a different stripe lengths of 8, 16, and 32 pixels.
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
0.45
0.4
0.35
Times (sec)
0.3
0.25
0.2
0.15
0.1
0.05
21 4 1 3 13 2 11 18 17 10 15 14 19 5 16 12 9 20 6 7 8
Image Id
(a) Time comparison. The computation time is always lower (at most equal) than the one obtained with
the standard encoder, and this behavior is more prominent in the high sparsity region.
5
x 10
2.5
2
File Size (bytes)
1.5
0.5
0
21 4 1 3 13 2 11 18 17 10 15 14 19 5 16 12 9 20 6 7 8
Image ID
(b) Size comparison. Our algorithm works better on images with a high sparsity index.
Fig. 23.16. Comparison between a 5 level standard JPEG2000 against a 2 level (stripe 8) modified
algorithm.
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
A Novel Approach to Sparse Histogram Image Lossless Compression using JPEG2000 435
The second proposal focuses on the last part of Tier-1 coder, which is the cleanup
pass. Recalling the concepts exposed in section 23.2.3.3, pointing at the fact that
each completely non significant quadruplet is conveyed with one bit, a possible
improvement is the following: instead of coding quadruplets with one bit each,
we use a binary word for every set of consecutive non significant quadruplets. The
binary word length is fixed throughout the whole encoding process. The gain is
higher as the number of consecutive non significant quadruplets increases. At the
end of the cleanup pass, it is possible to identify on the current bit-plane different
series of non significant quadruplet, each of them has a different length.
Fig. 23.17. Different groups of quadruplets. From this block it can be figured out eight single quadru-
plets, s1 through s8 . Here we have S3 with S3 equal to 2, and S2 with S2 equal to 1.
Sj ∀j). The following equations describe the process in detail referring to how
S is modified. Using the remainder function
+ ,
x
r(x, y) = x− y (23.22)
y
Applying the S computation to Lena image, Figure 23.18 presents the results
of the histogram of the S computation. The explanation of a high value in the
case j = 31 is quite straightforward: the last bin of the histogram counts all not
significant sequences composed by 31 quadruplets. Hence, all sequences com-
posed by more than 31 (in this case) quadruplets will score a point for the bin in
the r position, and q point for the bin in the 31st position. Due to this reason, the
last bin takes a value higher than the others.
Figure 23.20(a) shows the same histogram computation for image Crosses.
The S refers to a particular choice of kMAX , depending on W . Its compu-
tation ends the implementation of modified cleanup pass. However, as we are in-
terested into an evaluation of the actual gain, (referring to the standard JPEG2000
cleanup pass) we can study the behavior of S varying the upper limit of l(nscq).
Therefore, we have recalculated S values for k varying from 1 to kMAX . To do
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
A Novel Approach to Sparse Histogram Image Lossless Compression using JPEG2000 437
So now S contains the probability distribution as if jMAX (that is, the maximum
of l(nscq)) could assume all the values in the range 1,. . . kMAX . Knowing from
Equation 23.24 the entire probability distribution, it is interesting to give an esti-
mation of the acquired gain versus the original coder. We defined a gain function
G as
k
G(k, kMAX ) = (23.25)
1 + log2 (kMAX + 1)
in which the numerator represents the bits used by the original coder and the
denominator the bits used by the modified one. Function G (Equation 23.25) is
used to weight the quadruplets probability distribution leading to (assuming that
kMAX is fixed):
S (k) = G(k)Sk − 1 (23.26)
In Figure 23.19 is presented the final graph that shows S (Equation 23.26). The
meaning of this function S is an evaluation of S (k) weighted by the gain func-
tion G, thus giving an estimation of the gain toward JPEG2000 standard coder for
each admissible value of consecutive quadruplet series. We point out that in Equa-
tion 23.26, the term -1 introduces a shift; therefore if values of S (Figure 23.19)
are greater than zero it means a positive gain. Apart from an initial irregular shape,
the function in Figure 23.19 shows a stable gain for a maximum l(nscq) greater
than 16. From this point on, our algorithm performs better. This Figure refers to
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
Fig. 23.19. The weighted histogram S (k) (Equation 23.26) shows the gain in terms of bit number
with respect to the original coder JPEG2000 as a function of quadruplet series length (Image Lena).
Lena, but the same behavior is confirmed (or even better) for the other images:
as an example, in Figure 23.20(b) the function in Equation 23.26 is plotted for
image “crosses”. For completeness, we have computed the gain for all the other
test images. The final measure of how much using the SuperRLC is convenient is
given by:
k
M AX
M= S i
(23.27)
i=1
Equation 23.27 computes in fact the integral function of the weighted histogram
(Equation 23.26), measuring the positive gain area. M values are reported in Table
23.4. It is important to remind that this is not an overall gain, but it is just set up
for bits sent by the cleanup pass to the final stage of Tier-1, the MQ-coder.
A Novel Approach to Sparse Histogram Image Lossless Compression using JPEG2000 439
Fig. 23.20. Cardinality computation and weighted S (k) for image Crosses.
following, with few exceptions, the sparsity order. More generally, for all the im-
ages, there is always a value k for which our algorithm outperforms JPEG2000
∀k > k.
fractalators
montage
mehead
squares
crosses
camera
circles
slope
barb
bird
fpf
Name
Image ID 7 6 19 9 18 8 15 5 14 2 1
M 10.23 9.45 8.78 7.78 6.72 5.08 2.63 0.93 0.51 0.49 0.18
lena
boat
peppers
zelda
library-1
mandrill
mountain
text
bridge
frog
Image ID 11 3 17 21 12 13 16 20 4 10
M -0.37 -0.84 -0.89 -2.32 -2.89 -3.28 -3.32 -3.84 -4.05 -5.08
shown. For each graph (Figure 23.21(a) and 23.21(b)), the X-axis will report the
usual image identifier (refer to Table 23.2) and the Y-axis will report the quantity
R defined as:
|BP − BO|
R = (23.28)
D
where BP are the bits sent by the Tier-1 encoder to the MQ-coder in the RLC
method, BO refers to the JPEG2000 original encoder and D is the size of the
original uncompressed image file expressed in bit. As it can be seen, the gain
goes from a minimum of 2% to a maximum of 15%.
For these values of W , the experiments show always a positive gain, which is
more impressive for sparse histogram images. The only really relevant exceptions
are for images 12 and 20 (text and library-1, respectively), which do not follow the
behavior dictated by their sparseness; as it could be clearly seen (Figure 23.22(a)
and 23.22(b)), these two images have nothing in common with the other images
with sparse histogram, that is, their structure is so significantly different (a text
document and a compound image) which justifies their anomalous behavior. We
underline the fact that also image 8 is a text image, but it performs quite well.
Moreover, the simple fact that an image represents a text is not discriminant. The
reason is that image 8 is really a sparse image with a constant white background
and a foreground (the text) which is spatially distributed in a very sparse way. We
mean that in the case of image 8, “sparsity” is not only on grey level distribution,
but also on the geometrical disposition inside the image. This does not hold for the
second text image (12), where the background is not constant and the text is an au-
tomated list. By carefully observing the plots of Figure 23.21, a further comment
about image 10 (frog) is mandatory. Image 10, despite its index I, has a value of
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
A Novel Approach to Sparse Histogram Image Lossless Compression using JPEG2000 441
0.14
0.12
0.1
0.08
R’
0.06
0.04
0.02
0
21 13 4 2 17 5 1 14 16 3 11 10 19 15 18 9 20 12 6 7 8
Image ID
0.15
0.1
R’
0.05
0
21 13 4 2 17 5 1 14 16 3 11 10 19 15 18 9 20 12 6 7 8
Image ID
Fig. 23.21. Ratio between the bits sent by the original and modified encoder.
R (Equation 23.28) significantly smaller than image 11 and 19, and we explain
this fact by observing that the “frog” image is a highly textured one, because the
subject (the animal) is so close to the camera that the spatial disposition of pixels
is more similar to a texture than a tipical object on a background.
We prefer to consider R (Equation 23.28) as a performance index, rather that
comparing the bare file sizes at the end of the entire compress chain, because: a)
we are interested into exploring the positive effects of our modifications on the
significance, refinement, and cleanup group (where they effectively occur); b) we
want to exclude any possible effects of the arithmetic coder, which follows in the
chain.
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
Fig. 23.22. Figures for which the behavior does not follow the general trend.
The project has been developed on a Celeron 1.5Ghz, 512Mb RAM, using Mi-
crosoft Visual C49 for JPEG2000 codec and Matlab50 for presenting the results in
a convenient way. The codec is Jasper version from 1.400 to 1.600, of which a
guide can be found in Jasper User Manual44 and implements a JPEG2000 coder
fully reviewed by the creator of Jasper M.D. Adams.5
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
A Novel Approach to Sparse Histogram Image Lossless Compression using JPEG2000 443
23.5. Conclusions
In this paper, we have presented two algorithms for modifying all the three passes
of JPEG2000 Tier-1 coder, that is significance, refinement, and cleanup. The-
oretical studies have been reported in order to classify the images according to
the concept of sparsity and to compute the saving of conveyed bits. The best re-
sults are for high sparsity images: assuming the spatial preservation of wavelet
coefficients, it has come to mind that longer stripe would have grouped better the
(scarce) significant zones in this kind of images. The first algorithm, stripe length-
ening, shows significant gain at low levels of resolution on the overall file size
respect to the standard. The second algorithm, SuperRLC, has the advantage of
being independent on the levels of resolution; experiments confirm that it gives a
relevant gain on bits conveyed to the MQ-coder respect to the standard JPEG2000.
Future works will regard a modification of the arithmetic coder in order to adapt
the probability context to the statistics of the new conveyed symbols. We would
like to use the same approach as51 to extend the arithmetic coder contexts in order
to evaluate if it is possible to capture the efficiency of the new bitstream generated
by our algorithm.
References
9. N. Ahmed, T. Natarajan, and K. R. Rao, Discrete cosine transform, IEEE Trans. Com-
puters. C-23, 90–93 (Jan., 1974).
10. G. K. Wallace, The JPEG still picture compression standard, Communications of the
ACM. 34(4), 30–44, (1991).
11. ISO/ITU. ISO 10918 information technology – digital compression and coding of
continuous-tone still images: Requirements and guidelines, (1994).
12. J. M. Shapiro, Embedded image coding using zerotree of wavelet coefficients, IEEE
Trans. Signal Processing. 41, 3445–3462 (Dec., 1993).
13. A. Said and W. A. Pearlman, A new fast and efficient image codec based on set par-
titioning in hierarchical trees, IEEE Transactions on Circuits and Systems for Video
Technology. 6, 243–250 (June, 1996).
14. E. Ardizzone, M. La Cascia, and F. Testa. A new algorithm for bit rate allocation in
JPEG 2000 tile encoding. In Proc. IEEE International Conference on Image Analysis
and Processing (ICIAP’03), pp. 529–534, Mantua, Italy (Sept., 2003).
15. T.-H. Chang, L.-L. Chen, C.-J. Lian, H.-H. Chen, and L.-G. Chen. Computation re-
duction technique for lossy JPEG 2000 encoding through ebcot tier-2 feedback pro-
cessing. In Proc. IEEE International Conference on Image Processing (ICIP’02), pp.
85–88, Rochester, NY (June, 2002).
16. S. Battiato, A. Buemi, G. Impoco, and M. Mancuso. Content - dependent optimiza-
tion of JPEG 2000 compressed images. In Proc. IEEE International Conference on
Consumer Electronics (ICCE’02), pp. 46–47, Los Angeles, CA (June, 2002).
17. K. C. B. Tan and T. Arslan. An embedded extension algorithm for the lifting based
discrete wavelet transform in JPEG 2000. In Proc. IEEE Internation Conference on
Acoustic, Speech, and Signal Processing (ICASSP’02), pp. 3513–3516, Orlando, FL
(May, 2002).
18. M. Kurosaki, K. Munadi, and H. Kiya. Error concealment using layer structure for
JPEG 2000 images. In Proc. IEEE Asia-Pacific Conference on Circuit and Systems
(APCCS’02), pp. 529–534 (Oct., 2002).
19. L. Aztori, A. Corona, and D. Giusto. Error recovery in JPEG 2000 image transmis-
sion. In Proc. IEEE International Conference on Acoustic, Speech, and Signal Pro-
cessing (ICASSP’01), pp. 364–367, Salt Lake City, UT (May, 2001).
20. Y. M. Yeung, O. C. Au, and A. Chang. Successive bit - plane allocation technique
for JPEG2000 image coding. In Proc. IEEE International Conference on Acoustic
Speech, pp. 261–264, Hong Kong, China (Apr., 2003).
21. G. Pastuszak. A novel architecture of arithmetic coder in JPEG2000 based on parallel
symbol encoding. In Proceedings of the IEEE International Conference on Parallel
Computing in Electrical Engineering, pp. 303–308 (Sept., 2004).
22. T. Tillo and G. Olmo, A novel multiple description coding scheme compatible with the
JPEG2000 decoder, IEEE Signal Processing Letters. 11(11), 908–911 (Nov., 2004).
23. W. Du, J. Sun, and Q. Ni, Fast and efficient rate control approach for JPEG2000, IEEE
Transactions on Consumer Electronics. 50(4), 1218 – 1221 (Nov., 2004).
24. K. Varma and A. Bell. Improving JPEG2000’s perceptual performance with weights
based on both contrast sensitivity and standard deviation. In Proceedings of the IEEE
International Conference on Acoustics, Speech, and Signal Processing, vol. 3, pp.
17–21 (May, 2004).
25. T. Kim, H. M. Kim, P.-S. Tsai, and T. Acharya, Memory efficient progressive rate-
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
A Novel Approach to Sparse Histogram Image Lossless Compression using JPEG2000 445
distortion algorithm for JPEG 2000, IEEE Transaction on Circuits and Systems for
Video Technology. 15(1), 181–187 (Jan., 2005).
26. K. Vikram, V. Vasudevan, and S. Srinivasan, Rate-distortion estimation for fast
JPEG2000 compression at low bit-rates, IEE Electronics Letters. 41(1), 16–18 (Jan.,
2005).
27. M. Aguzzi. Working with JPEG 2000: two proposals for sparse histogram images. In
Proc. IEEE International Conference on Image Analysis and Processing (ICIAP’03),
pp. 408–411, Mantua, Italy (Sept., 2003).
28. P. J. Ferreira and A. J. Pinho, Why does histogram packing improve lossless compres-
sion rates?, IEEE Signal Processing Letters. 9(8), 259–261 (Aug., 2002).
29. A. J. Pinho, An online preprocessing technique for improving the lossless compression
of images with sparse histograms, IEEE Signal Processing Letters. 9(1), 5–7 (Jan.,
2002).
30. A. J. Pinho. On the impact of histogram sparseness on some lossless image com-
pression techniques. In Proceedings of the IEEE International Conference on Image
Processing, pp. 442–445, (2001).
31. A. J. Pinho. A comparison of methods for improving the lossless compression of im-
ages with sparse histogram. In Proceedings of IEEE International Conference on Im-
age Processing, pp. 673–676, (2002).
32. A. J. Pinho, An online preprocessing technique for improving the lossless compression
of images with sparse histograms, IEEE Signal Processing Letters. 9, 5–7, (2002).
33. C. K. Chui, A. K. Chan, and C. S. Liu, An Introduction To Wavelets. (Academic Press,
San Diego, CA, 1991).
34. M. Vetterli and J. Kovačević, Wavelets And Subband Coding. (Prentice Hall, 1995).
35. G. Strang and T. Nguyen, Wavelets and filters banks. (Wellsley-Cambridge Press,
Wellsley, MA, 1997).
36. I. Daubechies, Ten lectures on wavelets. Number 61 in CBMS Lecture, (SIAM, 1992).
37. P. N. Topiwala, Wavelet Image and Video Compression. (Kluwer Academic Publisher,
Norwell, MA, 1998).
38. M. Unser and T. Blu, Mathematical properties of the JPEG2000 wavelet filters, IEEE
Trans. Image Processing. 12, 1080–1090 (Sept., 2003).
39. S. G. Mallat, A theory for multiresolution signal decomposition: the wavelet repre-
sentation, IEEE Transaction on Pattern Analysis and Machine Intelligence. 11(7),
674–693 (July, 1989).
40. K. Andra, C. Chakrabarti, and T. Acharya. Efficient implementation of a set of lifting
based wavelet filters. In Proc. IEEE International Conference on Acoustic, Speech,
and Signal Processing (ICASSP’01), pp. 1101–1104, Salt Lake City, UT (May, 2001).
41. D. Taubman, High performance scalable image compression with EBCOT, IEEE
Trans. Image Processing. 9, 1158–1170 (July, 2000).
42. I. H. Witten, R. M. Neal, and J. G. Cleary, Arithmetic coding for data compression,
Communication of the ACM. 30(6), 520–540 (June, 1987).
43. M. J. Slattery and J. L. Mitchell, The qx-coder, IBM Journal of Research and Devel-
opment, Data compression technology in ASIC cores. 42(6), (1998).
44. M. D. Adams. Jasper software reference manual (Oct., 2002). URL http://www.
ece.uvic.ca/˜mdadams/jasper/jasper.pdf.
45. D. Santa-Cruz and T. Ebrahimi. An analitical study of JPEG2000 functionalities. In
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
CHAPTER 24
This paper describes two innovations that improve the efficiency and effective-
ness of a genetic programming approach to object detection problems. The ap-
proach uses genetic programming to construct object detection programs that are
applied, in a moving window fashion, to the large images to locate the objects
of interest. The first innovation is to break the GP search into two phases with
the first phase applied to a selected subset of the training data, and a simplified
fitness function. The second phase is initialised with the programs from the first
phase, and uses the full set of training data with a complete fitness function to
construct the final detection programs. The second innovation is to add a pro-
gram size component to the fitness function. This approach is examined and
compared with a neural network approach on three object detection problems of
increasing difficulty. The results suggest that the innovations increase both the
effectiveness and the efficiency of the genetic programming search, and also that
the genetic programming approach outperforms a neural network approach for
the most difficult data set in terms of the object detection accuracy.
24.1. Introduction
Object detection and recognition tasks arise in a very wide range of applica-
tions,1–7 such as detecting faces from video images, finding tumours in a database
of x-ray images, and detecting cyclones in a database of satellite images. In many
cases, people (possibly highly trained experts) are able to perform the classifica-
tion task well, but there is either a shortage of such experts, or the cost of people is
too high. Given the amount of data that needs to be detected, automated object de-
tection systems are highly desirable. However, creating such automated systems
∗ Corresponding author.
† Corresponding email address.
447
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
that have sufficient accuracy and reliability turns out to be very difficult.
Genetic programming (GP) is a relatively recent and fast developing approach
to automatic programming.8,9 In GP, solutions to a problem are represented as
computer programs. Darwinian principles of natural selection and recombina-
tion are used to evolve a population of programs towards an effective solution to
specific problems. The flexibility and expressiveness of computer program repre-
sentation, combined with the powerful capabilities of evolutionary search, makes
GP an exciting new method to solve a great variety of problems.
There have been a number of reports on the use of genetic programming in
object detection.10–16 The approach we have used in previous work15,16 is to use a
single stage approach (referred to as the basic GP approach here), where the GP
is directly applied to the large images in a moving window fashion to locate the
objects of interest. Past work has demonstrated the effectiveness of this approach
on several object detection tasks.
While showing promise, this genetic programming approach still has some
problems. One problem is that the training time was often very long, even for
relatively simple object detection problems. A second problem is that the evolved
programs are often hard to understand or interpret. We have identified two causes
of these problems: the programs are usually quite large and contain much redun-
dancy, and the cost of the fitness function is high. We believe that the size and
redundancy of the programs contributes to the long training times and may also
reduce the quality of the resulting detectors by unnecessarily increasing the size
of the search space and reducing the probability of finding an optimal detector
program. Evaluating the fitness of a candidate detector program in the basic GP
approach involves applying the program to each possible position of a window on
all the training images, which is quite expensive. An obvious solution is to apply
the program to only a small subset of the possible window positions, but it is not
obvious how to choose the subset. A poor choice could bias the evolution towards
programs that are sub-optimal on the real data.
The goal of this paper is to investigate a study on improving GP techniques
for object detection (rather than investigate an application of GP for object detec-
tion). Specifically, we investigate two innovations on the basic GP approach to
address the problems described above. The first is to split the GP evolution into
two phases, using a different fitness function and just a subset of the training data
in the first phase. The second is to augment the fitness function in the second
phase by a component that biases the evolution towards smaller, less redundant
programs. We consider the effectiveness and efficiency of this approach by com-
paring it with the basic GP approach. We also examine the comprehensibility of
the evolved genetic programs.
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
The rest of the paper is organised as follows. Section 24.2 gives some essential
background of object detection and recognition and GP related work to object
detection. Section 24.3 describes the main aspects of this approach. Section 24.4
describes the three image data sets and section 24.5 presents the experimental
results. Section 24.6 draws the conclusions and gives future directions.
24.2. Background
The term object detection here refers to the detection of small objects in large
images. This includes both object classification and object localisation. Object
classification refers to the task of discriminating between images of different kinds
of objects, where each image contains only one of the objects of interest. Object
localisation refers to the task of identifying the positions of all objects of interest
in a large image. The object detection problem is similar to the commonly used
terms automatic target recognition and automatic object recognition.
Traditionally, most research on object recognition involves four stages: pre-
processing, segmentation, feature extraction and classification.17,18 The prepro-
cessing stage aims to remove noise or enhance edges. In the segmentation stage, a
number of coherent regions and “suspicious” regions which might contain objects
are usually located and separated from the entire images. The feature extraction
stage extracts domain specific features from the segmented regions. Finally, the
classification stage uses these features to distinguish the classes of the objects of
interest. The features extracted from the images and objects are generally domain
specific such as high level relational image features. Data mining and machine
learning algorithms are usually applied to object classification.
Object detection and recognition has been of tremendous importance in many
application domains. These domains include military applications,10,19,20 shape
matching,2 human face and visual recognition,1,5,21,22 natural scene recognition,4
agricultural product classification,23 handwritten character recognition,24,25 medi-
cal image analysis,26 postal code recognition,27,28 and texture classification.29
Since the 1990s, many methods have been employed for object recogni-
tion. These include different kinds of neural networks,28,30–33 genetic algo-
rithms,34,35 decision trees,36 statistical methods such as Gaussian models and
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
classes of images are involved. While GP has been widely applied to binary clas-
sification problems,10,38,44,45 it has also been applied to multi-class classification
problems.15,16,22,46–48
In terms of the representation of genetic programs, different forms of genetic
programs have been developed in GP systems for object classification and image
recognition. The main program representation forms include tree or tree-like or
numeric expression programs,8,46,48,49 graph based programs,8 linear GP,50 linear-
graph GP,51 and grammar based GP.52
The use of GP in object/image recognition and detection has also been inves-
tigated in a variety of application domains. These domains include military appli-
cations,10,45 English letter recognition,24 face/eye detection and recognition,22,39,53
vehicle detection13,38 and other vision and image processing problems.9,12,14,54–56
Since the work to be presented in this paper focuses on the use of genetic
programming techniques for object detection, table 24.1 lists the recent research
to overview the GP related work based on the applications and the first authors.
Figure 24.1 shows an overview of this approach, which has two phases of learning
and a testing procedure. In the first learning phase, the evolved genetic programs
were initialised randomly and trained on object examples cut out from the large
images in the training set. This is just an object classification task, which is sim-
pler than the full object detection task. This phase therefore uses a fitness function
which maximises classification accuracy on the object cutouts.
In the second phase, a second GP process is initialised with the programs
generated by the first phase, and trained on the full images in the training set by
applying the programs to a square input field (“window”) that was moved across
the images to detect the objects of interest. This phase uses a fitness function
that maximises detection performance on the large images in the training set. In
the test procedure, the best refined genetic program is then applied to the entire
images in the test set to measure object detection performance. The process of the
second phase and the GP testing procedure are shown in figure 24.2.
Because the object classification task is simpler than the object detection task,
we expect the first phase to be able to find good genetic programs much more
rapidly and effectively than the second phase. Also, the fitness function is much
easier to evaluate, so that a more extensive evolution can be performed in the
same time. Although simpler, the object classification task is closely related to
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
Table 24.1. Object recognition and detection related work based on genetic program-
ming.
Object Koza 49
Winkeler et al. 39
Poli 67
the detection task, so we believe that the genetic programs generated by the first
phase are likely to be very good starting points for the second phase, allowing the
more expensive evolutionary process to concentrate its effort in the more optimal
part of the search space.
Since the number of possible programs increases exponentially with the size
of the programs, the difficulty of finding an optimal program also increases with
the size of the programs. In the second phase, we added a program size compo-
nent to the fitness function to bias the search towards simpler functions, which we
expected would increase both the efficiency and the effectiveness of the evolution-
ary search. It will also have a tendency to remove redundancy (since a program
with redundancy will be less fit than an equivalent program with the redundancy
removed), making the programs more comprehensible.
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
Initial Random
Object Cutouts
Genetic Programs
GP Testing
(Object Detection)
Trained Large Images
Genetic Programs (Training Set)
Phase 2: GP Refinement
Final Results
(GP−refine)
Terminals
(features) GP learning/
Evolutionaryprocess
Evolved
Genetic Programs
Terminals
Object Detection
Test Set (features) (GP Testing)
(Full Image)
Fig. 24.2. The second phase of GP training (GP-refine) and the GP testing procedure.
11111111
00000000
S1
00000
11111
00000000
11111111
S2
00000
11111
00000000
11111111 1111
0000
00000
11111
S3
00000000
11111111
00000
11111
S4
0000
1111
C1 C2 C3
00000000
11111111
00000
11111 0000
1111
0000
1111
00000000
11111111
00000
11111
00000000
11111111
Fig. 24.3. Local square and circular features as terminals.
Fig. 24.4. Sample object detection maps. (a) Original image; (b) Detection map produced by Pro-
gram 1; (c) Detection map produced by program 2.
(1) Apply the program as a moving n×n window template (n is the size of the
input image window) to each of the training images and obtain the output
value of the program at each possible window position, as shown in Figure
24.2. Label each window position with the ‘detected’ object according to the
object classification strategy. Call this data structure a detection map.
(2) Find the centres of objects of interest only by the following clustering process:
• Scan the detection map from the up-left corner “pixel by pixel” for de-
tected objects of interest (those “pixels” marked as the “background” class
are skipped). When an object of a class of interest at a particular location
is encountered, mark that location point as the centre of the object and
skip pixels in n/2 × n/2 square to right and below this location. In this
way, all the locations (“pixels”) considered “detected objects” by the ge-
netic program within the square of then n/2×n/2 size will be “clustered”
as a single object. The square size n/2 × n/2 was chosen as half of the
moving sweeping window size in order not to miss any detected object.
This process will continue in the right cross and down directions until all
the locations in the detection map are scanned or skipped. The locations
marked by this process are considered the centres of the objects for the
classes of interest detected by the genetic program.
(3) Match these detected objects with the known locations of each of the de-
sired/target objects and their classes. Here, we allow location error of
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
We expect that the new fitness function can reflect both small and large im-
provement in genetic programs and can bias the search towards simpler functions.
We also expected this would increase both the efficiency and the effectiveness of
the evolutionary search. It will also have a tendency to reduce redundancy, making
the programs more comprehensible.
Notice that adding the program size constrain to the fitness function is a kind
of parsimony pressure technique.71–73 Early work on this issue resulted in diverse
opinions: some researchers think using parsimony pressure could improve per-
formance,72 while some others thinks this could lead to premature convergence.73
Although our approach is different from the early work, it might still face a risk of
early convergence. Therefore, we used a very small weight (K4 ) for the program
size in our fitness function relative to K1 and K2 (see table 24.2).
Table 24.2. Parameters used for GP training for the three databases.
detection rate and no false alarms), or there is no increase in the fitness for 10
generations, at which point the evolution is terminated early.
We used three data sets in the experiments. Example images are given in
figure 24.5. These data sets provide object detection problems of increasing dif-
ficulty. Data set 1 (Shape) was generated to give well defined objects against a
uniform background. The pixels of the objects were generated using a Gaussian
generator with different means and variances for different classes. There are two
classes of small objects of interest in this database: circles and squares. Data set
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
The detection results of the two phase GP approach for the three image data
sets are shown in table 24.3. These results are compared with the basic GP ap-
proach15,74 and a neural network approach75,76 using the same set of features. The
basic GP approach is similar to the new GP approach described in this paper, ex-
cept that it uses the old fitness function without considering the program size and
false alarm areas (equation 24.2) and that genetic programs are learned from the
full training images directly, which is a single stage approach.15,74 In the neu-
ral network approach,75,76 a three layered feed forward neural network is trained
by the back propagation algorithm77 without momentum using an online learn-
ing scheme and fan-in factors.78 For all the three approaches, the experiments
are repeated 50 times and the average results on the test set are presented in this
section.
As can be seen from table 24.3, all the three approaches achieved ideal results
for the shape and the Coins data sets, reflecting the fact that the detection problems
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
in the two data sets are relatively easy and that the two terminal sets are appropriate
for the two data sets (note that other terminal sets did not achieve ideal results,74
but this is beyond the scope of this paper). For the difficult Heads/tails data set,
none of the three methods resulted in ideal performance. However, the two phase
GP approach described in this paper achieved the best performance.
Notice also that both GP approaches achieved better results than the neural
network approach on this data set using the same set of features. However, this
might be partially because the features used here carried intrinsic bias towards the
neural network approach and/or partially because the neural networks were not
tuned, pruned or optimised.30,69 While further discussion here on this topic is be-
yond the goal of this paper, we are interested in carrying out further investigation
in the future.
Although both of the GP approaches achieved better results than the neural net-
works overall, the time spent on the training/refining process are quite different.
For the Coins data set, for example, the basic GP approach used 17 hours on av-
erage to find a good genetic program, whereas the two phase GP approach used
only 11 hours on average. For the Heads/tails data set, the two phase GP approach
found good programs after 23 hours on average (of which the first phase only took
only two to three minutes). The basic GP approach, on the other hand, took an
average of 45 hours. The first phase is so fast because the size of the training data
set is small, and the task of discriminating the classes of objects (when centered
in the input window) is quite simple. However, the programs it finds appear to be
very good starting points for the more expensive second phase, which enables the
evolution in the second phase to concentrate its search in a much more promising
part of the search space.
In addition, the sizes of the programs (the number of terminals plus the number
of functions in a program) evolved by the two phase GP approach were also found
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
to be shorter than those evolved by the basic GP approach. For the Coins data set,
for example, the program size in the two phase GP approach averages 56 nodes, in
contrast to 107 nodes for the basic GP approach. Both the good initial programs
and the bias towards smaller programs would contribute to this result; we will
investigate which of the factors is the most important for object detection in the
future.
To check the effectiveness of the new fitness function at improving the compre-
hensibility of the programs, an evolved genetic program in the shape data set is
shown below:
where Fiμ and Fiσ are the mean and standard deviation of region i (see figure 24.3,
left) of the window, respectively, and T is a predefined threshold. This program
can be translated into the following rule:
if (F4μ > T) then
value = F3μ ;
else
value = (F4μ - F2μ ) * F1σ ;
If the sweeping window is over the background only, F4μ would be smaller
than the threshold (100 here), the program would execute the “else” part. Since
F4μ is equal to F2μ in this case, the program output will be zero. According to the
classification strategy — object classification map, this case would be correctly
classified as background. If the input window contains a portion of an object of
interest and some background, F4μ would be smaller than F2μ , which results in
a negative program output, corresponding to class background. If F4μ is greater
than the threshold T, then the input window must contain an object of interest,
either for class1 or for class2, depending the value of F3μ .
While this program detector can be relatively easily interpreted and under-
stood, the programs obtained using the old fitness function are generally hard
to interpret due to the length of the programs and the redundancy. By carefully
designing the fitness function to constrain the program size, the evolved genetic
programs appear to be more comprehensible.
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
24.6. Conclusions
Rather than investigating an application of GP for object detection, the goal of this
paper is to investigate a study on improving GP techniques for object detection.
The goal has been successfully achieved by developing a two phase GP approach
and a new fitness function with constraints on program size. We investigated the
effectiveness and efficiency of the two phase GP approach and the comprehensi-
bility of genetic programs evolved using the new fitness function. The approach
was tested on three object detection problems of increasing difficulty and achieved
good results.
We developed a two phase approach to object detection using genetic pro-
gramming. Our results suggest that the two phase approach is more effective and
more efficient than the basic GP approach. The new GP approach also achieved
better detection accuracy than a neural network approach on the second coin data
set using the same set of features. While a detailed comparison between the two
approaches is beyond the goal of this paper, we are interested in doing further
investigation in the future.
We modified the fitness function by including a measure of program size. This
resulted in genetic program detectors that were better quality and more compre-
hensible. It also reduced the search computation time.
Although this approach considerably shortens the training times, the training
process is still relatively long. We intend to explore better classification strategies
and add more heuristics to the genetic beam search to the evolutionary process.
While the programs evolved by the two phase GP approach with the new fit-
ness function are considerably shorter than the basic GP approach, they usually
still contain some redundancy. Although we suspect that this redundancy reduces
the efficiency and the effectiveness of the evolutionary search, it is also possible
that redundancy plays an important role in the search. We are experimenting with
simplification of the programs during the evolutionary process to remove the re-
dundancy, and will be exploring whether it reduces training speed and improves
program quality.
This paper was focused on improving GP techniques rather than investigating
applications of GP on object detection. However, it would be interesting to test
the new GP approach developed in this paper on some more difficult, real world
object detection tasks such as those in the Caltech 101 data set and the retina data
set in the future.
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
Acknowledgements
This work was supported in part by the national Marsden Fund of Royal Society of
New Zealand (05-VUW-017) and the University Research Fund (7/39) at Victoria
University of Wellington.
References
automatic target detection within SAR imagery. In Proceedings of the 2000 Congress
on Evolutionary Computation CEC00, pp. 1543–1549, La Jolla Marriott Hotel La
Jolla, California, USA (6-9 July, 2000). IEEE Press. ISBN 0-7803-6375-2.
12. C. T. M. Graae, P. Nordin, and M. Nordahl. Stereoscopic vision for a humanoid robot
using genetic programming. In eds. S. Cagnoni, R. Poli, G. D. Smith, D. Corne,
M. Oates, E. Hart, P. L. Lanzi, E. J. Willem, Y. Li, B. Paechter, and T. C. Fogarty,
Real-World Applications of Evolutionary Computing, vol. 1803, LNCS, pp. 12–21,
Edinburgh (17 Apr., 2000). Springer-Verlag. ISBN 3-540-67353-9.
13. D. Howard, S. C. Roberts, and C. Ryan. The boru data crawler for object detection
tasks in machine vision. In eds. S. Cagnoni, J. Gottlieb, E. Hart, M. Middendorf,
and G. Raidl, Applications of Evolutionary Computing, Proceedings of EvoWork-
shops2002: EvoCOP, EvoIASP, EvoSTim, vol. 2279, LNCS, pp. 220–230, Kinsale,
Ireland (3-4 Apr., 2002). Springer-Verlag.
14. F. Lindblad, P. Nordin, and K. Wolff. Evolving 3d model interpretation of images
using graphics hardware. In Proceedings of the 2002 IEEE Congress on Evolutionary
Computation, CEC2002, Honolulu, Hawaii, (2002).
15. M. Zhang and V. Ciesielski. Genetic programming for multiple class object detection.
In ed. N. Foo, Proceedings of the 12th Australian Joint Conference on Artificial Intel-
ligence (AI’99), pp. 180–192, Sydney, Australia (December, 1999). Springer-Verlag
Berlin Heidelberg. Lecture Notes in Artificial Intelligence (LNAI Volume 1747).
16. M. Zhang, P. Andreae, and M. Pritchard. Pixel statistics and false alarm area in ge-
netic programming for object detection. In ed. S. Cagnoni, Applications of Evolution-
ary Computing, Lecture Notes in Computer Science, LNCS Vol. 2611, pp. 455–466.
Springer-Verlag, (2003).
17. T. Caelli and W. F. Bischof, Machine Learning and Image Interpretation. (Plenum
Press, New York and London, 1997). ISBN 0-306-45761-X.
18. E. Gose, R. Johnsonbaugh, and S. Jost, Pattern Recognition and Image Analysis.
(Prentice Hall PTR, Upper Saddle River, NJ 07458, 1996). ISBN 0-13-236415-8.
19. A. Howard, C. Padgett, and C. C. Liebe. A multi-stage neural network for auto-
matic target detection. In 1998 IEEE World Congress on Computational Intelligence
– IJCNN’98, pp. 231–236, Anchorage, Alaska, (1998). 0-7803-4859-1/98.
20. Y. C. Wong and M. K. Sundareshan. Data fusion and tracking of complex target
maneuvers with a simplex-trained neural network-based architecture. In 1998 IEEE
World Congress on Computational Intelligence – IJCNN’98, pp. 1024–1029, Anchor-
age, Alaska (May, 1998). 0-7803-4859-1/98.
21. D. Valentin, H. Abdi, and O’Toole, Categorization and identification of human face
images by neural networks: A review of linear auto-associator and principal compo-
nent approaches, Journal of Biological Systems. 2(3), 413–429, (1994).
22. A. Teller and M. Veloso. A controlled experiment : Evolution for learning difficult
image classification. In eds. C. Pinto-Ferreira and N. J. Mamede, Proceedings of the
7th Portuguese Conference on Artificial Intelligence, vol. 990, LNAI, pp. 165–176,
Berlin (3–6 Oct., 1995). Springer Verlag. ISBN 3-540-60428-6.
23. P. Winter, W. Yang, S. Sokhansanj, and H. Wood. Discrimination of hard-to-pop pop-
corn kernels by machine vision and neural network. In ASAE/CSAE meeting, Saska-
toon, Canada (Sept., 1996). Paper No. MANSASK 96-107.
24. D. Andre. Automatically defined features: The simultaneous evolution of 2-
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
dimensional feature detectors and an algorithm for using them. In ed. K. E. Kinnear,
Advances in Genetic Programming, pp. 477–494. MIT Press, (1994).
25. Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to
document recognition. In Intelligent Signal Processing, pp. 306–351. IEEE Press,
(2001).
26. B. Verma. A neural network based technique to locate and classify microcalcifications
in digital mammograms. In 1998 IEEE World Congress on Computational Intelligence
– IJCNN’98, pp. 1790–1793, Anchorage, Alaska, (1998). 0-7803-4859-1/98, IEEE.
27. Y. LeCun, L. D. Jackel, B. Boser, J. S. Denker, H. P. Graf, I. Guyon, D. Henderson,
R. E. Howard, and W. Hibbard, Handwritten digit recognition: application of neural
network chips and automatic learning, IEEE Communications Magazine. pp. 41–46
(November, 1989).
28. D. de Ridder, A. Hoekstra, and R. P. W. Duin. Feature extraction in shared weights
neural networks. In Proceedings of the Second Annual Conference of the Advanced
School for Computing and imaging, ASCI, pp. 289–294, Delft (June, 1996).
29. A. Song, T. Loveard, and V. Ciesielski:. Towards genetic programming for texture
classification. In Proceedings of the 14th Australian Joint Conference on Artificial
Intelligence, pp. 461–472. Springer Verlag, (2001).
30. R. D. Reed and R. J. M. II, Neural Smithing: Supervised Learning in Feedforward
Artificial Neural Networks. (Cambridge, MA: The MIT Press, 1999). ISBN 0-262-
18190-8.
31. M. R. Azimi-Sadjadi, D. Yao, Q. Huang, and G. J. Dobeck, Underwater target clas-
sification using wavelet packets and neural networks, IEEE Transactions on Neural
Networks. 11(3), 784–794 (May, 2000).
32. C. Stahl, D. Aerospace, and P. Schoppmann. Advanced automatic target recognition
for police helicopter missions. In ed. F. A. Sadjadi, Proceedings of SPIE Volume 4050,
Automatic Target Recognition X (April, 2000). [4050-30].
33. T. Wessels and C. W. Omlin. A hybrid system for signature verification. In Proceed-
ings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks
(IJCNN’00), Volume V, Como, Italy (July, 2000).
34. J. Bala, K. D. Jong, J. Huang, H. Vafaie, and H. Wechsler, Using learning to facilitate
the evolution of features for recognising visual concepts, Evolutionary Computation.
4(3), 297–312, (1997).
35. J.-S. Huang and H.-C. liu, Object recognition using genetic algorithms with a Hop-
field’s neural model, Expert Systems with Applications. 13(3), 191–199, (1997).
36. S. Russell and P. Norvig, Artificial Intelligence, A modern Approach. (Prentice Hall,
2003), 2nd edition.
37. M. H. Dunham, Data Mining: Introductory and Advanced Topics. (Prentice Hall,
2003).
38. D. Howard, S. C. Roberts, and R. Brankin, Target detection in SAR imagery by genetic
programming, Advances in Engineering Software. 30, 303–311, (1999).
39. J. F. Winkeler and B. S. Manjunath. Genetic programming for object detection. In eds.
J. R. Koza, K. Deb, M. Dorigo, D. B. Fogel, M. Garzon, H. Iba, and R. L. Riolo,
Genetic Programming 1997: Proceedings of the Second Annual Conference, pp. 330–
335, Stanford University, CA, USA (13-16 July, 1997). Morgan Kaufmann.
40. P. G. Korning, Training neural networks by means of genetic algorithms working
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
53. G. Robinson and P. McIlroy. Exploring some commercial applications of genetic pro-
gramming. In ed. T. C. Fogarty, Evolutionary Computation, Volume 993, Lecture Note
in Computer Science. Springer-Verlag, (1995).
54. D. Howard, S. C. Roberts, and C. Ryan. Evolution of an object detection ant for image
analysis. In ed. E. D. Goodman, 2001 Genetic and Evolutionary Computation Confer-
ence Late Breaking Papers, pp. 168–175, San Francisco, California, USA (9-11 July,
2001).
55. P. Nordin and W. Banzhaf. Programmatic compression of images and sound. In eds.
J. R. Koza, D. E. Goldberg, D. B. Fogel, and R. L. Riolo, Genetic Programming 1996:
Proceedings of the First Annual Conference, pp. 345–350, Stanford University, CA,
USA, (1996). MIT Press.
56. R. Poli. Genetic programming for feature detection and image segmentation. In ed.
T. C. Fogarty, Evolutionary Computing, number 1143 in Lecture Notes in Computer
Science, pp. 110–125. Springer-Verlag, University of Sussex, UK (1-2 Apr., 1996).
ISBN 3-540-61749-3.
57. V. Ciesielski, A. Innes, S. John, and J. Mamutil. Understanding evolved genetic pro-
grams for a real world object detection problem. In eds. M. Keijzer, A. Tettamanzi,
P. Collet, J. I. van Hemert, and M. Tomassini, Proceedings of the 8th European Con-
ference on Genetic Programming, vol. 3447, Lecture Notes in Computer Science, pp.
351–360, Lausanne, Switzerland (30 Mar. - 1 Apr., 2005). Springer. ISBN 3-540-
25436-6.
58. S. Isaka. An empirical study of facial image feature extraction by genetic program-
ming. In ed. J. R. Koza, the Genetic Programming 1997 Conference, pp. 93–99. Stan-
ford Bookstore, Stanford University, CA, USA (July, 1997). Late Breaking Papers.
59. S. A. Stanhope and J. M. Daida. Genetic programming for automatic target classifica-
tion and recognition in synthetic aperture radar imagery. In eds. V. W. Porto, N. Sara-
vanan, D. Waagen, and A. E. Eiben, Evolutionary Programming VII: Proceedings of
the Seventh Annual Conference on Evolutionary Programming, vol. 1447, LNCS, pp.
735–744, Mission Valley Marriott, San Diego, California, USA (25-27 Mar., 1998).
Springer-Verlag. ISBN 3-540-64891-7.
60. A. Song. Texture Classification: A Genetic Programming Approach. PhD thesis, De-
partment of Computer Science, RMIT University, Melbourne, Australia, (2003).
61. A. Song and V. Ciesielski. Texture analysis by genetic programming. In Proceedings
of the 2004 IEEE Congress on Evolutionary Computation, pp. 2092–2099, Portland,
Oregon (20-23 June, 2004). IEEE Press. ISBN 0-7803-8515-2.
62. A. Song and V. Ciesielski. Fast texture segmentation using genetic programming.
In eds. R. Sarker, R. Reynolds, H. Abbass, K. C. Tan, B. McKay, D. Essam,
and T. Gedeon, Proceedings of the 2003 Congress on Evolutionary Computation
CEC2003, pp. 2126–2133, Canberra (8-12 Dec., 2003). IEEE Press. ISBN 0-7803-
7804-0.
63. T. Loveard. Genetic Programming for Classification Learning Problems. PhD thesis,
RMIT University, School of Computer Science and Information Technology, (2003).
64. M. Zhang and W. Smart. Multiclass object classification using genetic programming.
In eds. G. R. Raidl, S. Cagnoni, J. Branke, D. W. Corne, R. Drechsler, Y. Jin, C. John-
son, P. Machado, E. Marchiori, F. Rothlauf, G. D. Smith, and G. Squillero, Applica-
tions of Evolutionary Computing, EvoWorkshops2004: EvoBIO, EvoCOMNET, Evo-
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
HOT, EvoIASP, EvoMUSART, EvoSTOC, vol. 3005, LNCS, pp. 367–376, Coimbra,
Portugal (5-7 Apr., 2004). Springer Verlag.
65. B. J. Lucier, S. Mamillapalli, and J. Palsberg. Program optimisation for faster genetic
programming. In Genetic Programming – GP’98, pp. 202–207, Madison, Wisconsin
(July, 1998).
66. J. R. Koza. Simultaneous discovery of reusable detectors and subroutines using genetic
programming. In ed. S. Forrest, Proceedings of the 5th International Conference on
Genetic Algorithms, ICGA-93, pp. 295–302, Morgan Kauffman, (1993).
67. R. Poli. Genetic programming for image analysis. In eds. J. R. Koza, D. E. Goldberg,
and D. B. F. a nd Rick L. Riolo, Genetic Programming 1996: Proceedings of the First
Annual Conference, pp. 363–368, Stanford University, CA, USA (28–31, 1996). MIT
Press.
68. D. G. Lowe, Distinctive image features from scale-invariant keypoints., International
Journal of Computer Vision. 60(2), 91–110, (2004).
69. A. Gepperth and S. Roth, Applications of multi-objective structure optimization., Neu-
rocomputing. 69(7-9), 701–713, (2006).
70. M. E. Roberts and E. Claridge. Cooperative coevolution of image feature construc-
tion and object detection. In eds. X. Yao, E. Burke, J. A. Lozano, J. Smith, J. J.
Merelo-Guervós, J. A. Bullinaria, J. Rowe, P. T. A. Kabán, and H.-P. Schwefel,
Parallel Problem Solving from Nature - PPSN VIII, vol. 3242, LNCS, pp. 902–911,
Birmingham, UK (18-22 Sept., 2004). Springer-Verlag. ISBN 3-540-23092-0. doi:
doi:10.1007/b100601. URL http://www.cs.bham.ac.uk/˜mer/papers/
ppsn-2004.pdf.
71. P. W. H. Smith. Controlling code growth in genetic programming. In eds. R. John and
R. Birkenhead, Advances in Soft Computing, pp. 166–171, De Montfort University,
Leicester, UK, (2000). Physica-Verlag. ISBN 3-7908-1257-9. URL http://www.
springer-ny.com/detail.tpl?ISBN=3790812579.
72. R. Dallaway. Genetic programming and cognitive models. Technical Report CSRP
300, School of Cognitive & Computing Sciences, University of Sussex,, Brighton, UK,
(1993). In: Brook & Arvanitis, eds., 1993 The Sixth White House Papers: Graduate
Research in the Cognitive & Computing Sciences at Sussex.
73. T. Soule and J. A. Foster, Effects of code growth and parsimony pressure on popu-
lations in genetic programming, Evolutionary Computation. 6(4), 293–309 (Winter,
1998).
74. U. Bhowan. A domain independent approach to multi-class object detection using
genetic programming. Master’s thesis, BSc Honours research project/thesis, School of
Mathematical and Computing Sciences, Victoria University of Wellington, (2003).
75. M. Zhang and V. Ciesielski. Using back propagation algorithm and genetic algo-
rithm to train and refine neural networks for object detection. In eds. T. Bench-
Capon, G. Soda, and A. M. Tjoa, Proceedings of the 10th International Conference on
Database and Expert Systems Applications (DEXA’99), pp. 626–635, Florence, Italy
(August, 1999). Springer-Verlag. Lecture Notes in Computer Science, (LNCS Volume
1677).
76. B. Ny. Multi-cclass object classification and detection using neural networks. Master’s
thesis, BSc Honours research project/thesis, School of Mathematical and Computing
Sciences, Victoria University of Wellington, (2003).
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
CHAPTER 25
25.1. Introduction
471
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
chitectural scenes and general objects is that the former contains easily detectable
man-made features such as parallel lines, orthogonal lines, corners, etc. These
features are important cues for finding the 3D structure of a building. Most re-
search on architectural scene reconstruction in the photogrammetry community
has concentrated on 3D reconstruction from aerial images.2,8 Due to long-range
photography, aerial images are usually modeled as orthographic projection. Al-
though the orthographic projection model is easier for aerial images, one major
drawback is that most of the 3D reconstruction of architectural scenes can only
be done on the roofs of the buildings. On the other hand, the perspective pro-
jection model is usually needed for close-range photography, which is capable of
reconstructing the complete (360 degrees) 3D model of an architectural scene.
3D models of architectural scenes have important application areas such as
virtual reality (VR) and augmented reality (AR). Both applications require photo-
realistic 3D models as input. A photo-realistic model of a building consists not
only the 3D shape of the building (geometric information) but also the image tex-
ture on the outer visible surface of the building (photometric information). The
geometric and photometric information can be acquired either by range data and
intensity images, or by the intensity images recorded by a camera. Allen et al1,9
created 3D models of historic sites using both range and image data. They first
built the 3D models from range data using a volumetric set intersection method.
The photometric information was then mapped onto those models by registering
features from both the 3D and 2D data sets. To accurately register the range and
intensity data, and reduce the overall complexity of the models, they developed
range data segmentation algorithms to identify planar regions and determine lin-
ear features from planar intersections. Dick et al5 recovered 3D models from
uncalibrated images of architectural scenes. They proposed a method which ex-
ploited the rigidity constraints usually seen in the indoor and outdoor architectural
scenes such as parallelism and orthogonality. These constraints were then used to
calibrate the intrinsic and extrinsic parameters of the cameras through projection
matrix using vanishing points.3 The Euclidean models of the scene were recon-
structed from two images from arbitrary viewpoints.
In this work, we develop a system for 3D model reconstruction of architectural
scenes from one or more uncalibrated images. The input images can be taken from
off-the-shelf digital cameras and the camera parameters for 3D reconstruction are
estimated from the structure of the architectural scene. The feature points (such as
corners) in the images are selected by a user interactively through a graphical user
interface. The selected image points are then refined automatically using Hough
transform to obtain more accurate positions in subpixel resolution. For a given set
of corner points, various constraints such as parallelism, orthogonality, coplanarity
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
The most commonly used camera model is the pinhole camera model. In this
model the projection from a point (Xi , Yi , Zi ) in Euclidean 3-space to a point
(xi , yi ) in the image plane can be represented in homogeneous coordinates by
⎡ ⎤
⎡ ⎤ ⎡ ⎤ Xi
xi m11 m12 m13 m14 ⎢ ⎥
Yi ⎥
s ⎣ yi ⎦ = ⎣ m11 m22 m23 m24 ⎦ ⎢ ⎣ Zi ⎦ (25.1)
1 m31 m32 m33 m34
1
where s is an arbitrary scale factor, and the 3 × 4 matrix
⎡ ⎤
m11 m12 m13 m14
M = ⎣ m11 m22 m23 m24 ⎦ (25.2)
m31 m32 m33 m34
is the perspective projection matrix of the camera. The perspective projection
matrix can be further decomposed into the intrinsic camera parameters and the
relative pose of the camera:
M = K[R t] (25.3)
The 3 × 3 matrix R and 3 × 1 vector t are the relative orientation and translation
with respect to the world coordinate system, respectively. The intrinsic parameter
matrix K of the camera is a 3 × 3 matrix and usually modeled as
⎡ ⎤
fx γ u 0
K = ⎣ 0 fy v0 ⎦ (25.4)
0 0 1
where (u0 , v0 ) is the principal point (the intersection of the optical axis with the
image plane), γ is a skew parameter related to the characteristic of the CCD array,
and fx and fy are scale factors. Thus, Eq. (25.1) can be rewritten as
sp = K[R t]P (25.5)
where P is a 3D point and p is the corresponding image point (in homogeneous
coordinates).
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
As shown in,4 for a given parallelepiped in 3D space, if the three angles be-
tween its adjacent edges, θab , θbc , θca , and the image points of the six points of
its two adjacent faces are available, then the pose of the parallelepiped, intrinsic
parameters of the camera and the size of the parallelepiped can be determined by
solving polynomial equations of at most fourth degree. For a special case that θab ,
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
θbc and θca are right angles, the equation can be further simplified to a linear sys-
tem. Thus, the camera parameters can be found if the corner points of a building
are identified. Furthermore, the focal length of the camera can be estimated and
used for 3D model reconstruction in the next section.
P0 ∼ s0 p0 (25.10)
P1 ∼ s1 p1 (25.11)
P1 s1 P1 s1 [u1 v1 1]T
≈ = (25.12)
P0 s0 P0 s0 [u0 v0 1]T
In the above equation, the z direction information is lost in the image coordinate
system. To recover the 3D shape of the object, the z coordinate of P0 is fixed as
the focal length of the camera (in pixel), i.e., z = f or P0 = [u0 v0 f ]T . Now, if
P0 is used as a reference point as shown in Fig. 25.1, then we have
⎡ ⎤
u
si ⎣ i ⎦
Pi = vi (25.13)
s0
f
for i = 1, 2, 4.
The goal of model registration is to combine two or more partial 3D models ac-
quired from different viewpoints to a complete 3D models. Usually the registra-
tion or pose estimation involve finding the rotation matrix and translation vector
for the transformation between two different coordinate systems. For any given
two partial 3D models of an object, the overlapping parts are used to identify the
corresponding 3D points for the two models. The corresponding 3D points are
then used to find the rotation matrix and translation vector.
Suppose there are two sets of 3D points to be registered. More precisely,
if we want to find the rotation matrix and translation vector for the data sets,
{x1 , x2 , ..., xn } and {y1 , y2 , ..., yn }, where xi and yi are the corresponding
points, for i = 1, ..., n. Then the relationship between xi and yi can be writ-
ten as
yi = Rxi + t (25.14)
used equidistance constraints include the same length of the left and right sides of
a triangle, and the same length of the diagonals of a trapezoid.
The described algorithms are tested on a number of objects for the indoor envi-
ronment and outdoor architectural scenes. As shown in Fig. 25.2, a graphics user
interface is developed to assist users to select approximate corner points interac-
tively for 3D model reconstruction. The first experiment is the 3D reconstruction
of a building. Fig. 25.2 shows the three images taken from different viewpoints.
The images are used to create the partial 3D shapes of the object individually. For
each image, the corner points are selected manually by a user and the positions are
automatically refined by Hough transform. The reconstructed partial 3D models
with texture information using the acquired images are shown in Fig. 25.3. The
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
complete 3D model after registration and integration is shown in Fig. 25.4. Al-
though plane fitting has been done on the object surface, rendering with triangular
mesh still cause some visual distortions.
For the structures of the objects containing non-rectangular surface patches,
camera parameters and the “base-models” are first obtained from the lower part of
the objects. Additional image points associated with the upper part of the object
are then added with coplanarity and equidistance constraints. Fig. 25.6 shows
the reconstructed partial 3D models from the corresponding single input images
shown in Fig. 25.5. We have tested the proposed 3D model reconstruction ap-
proach on six outdoor architectural scenes and four indoor objects. The small
scale objects in the laboratory environment usually give better reconstruction re-
sults mainly because of the controlled illumination conditions and the larger focal
length used for image acquisition. For the outdoor building reconstruction, care-
ful selections of the initial corner points are mandatory since the images might
contain more complicate background scenes. Furthermore, the lens distortion has
to be modeled for even close-range photography with short focal length. Since the
evaluation of the final reconstruction result is usually based on the texture infor-
mation of the object, novel views are best synthesized on the viewpoints closer to
the original acquired images.
Acknowledgments
The support of this work in part by the National Science Council of Taiwan,
R.O.C. under Grant NSC-93-2218-E-194-024 is gratefully acknowledged.
References
CHAPTER 26
A method is proposed for the construction of descent directions for the minimiza-
tion of energy functionals defined for plane curves. The method is potentially
useful in a number of image analysis problems, such as image registration and
shape warping, where the standard gradient descent curve evolutions are not al-
ways feasible. The descent direction is constructed by taking a weighted average
of the three components of the gradient corresponding to translation, rotation,
and deformation. Our approach differs from previous work in the field by the use
of implicit representation of curves and the notion of normal velocity of a curve
evolution. Thus our theory is morphological and well suited for implementation
in the level set framework.
26.1. Introduction
483
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
A simple closed curve Γ can be represented as the zero level set of a function
φ : R2 → R as
Γ = {x ∈ R2 ; φ(x) = 0} . (26.1)
The sets Ωint = {x ; φ(x) < 0} and Ωext = {x ; φ(x) > 0} are called the interior
and the exterior of Γ, respectively. Geometric quantities such as the outward unit
normal n and the curvature κ can be expressed in terms of φ as
∇φ ∇φ
n= and κ = ∇ · . (26.2)
|∇φ| |∇φ|
The function φ is usually called the level set function for Γ, cf. e.g.7
A curve evolution, that is, a time dependent curve t → Γ(t), can be repre-
sented by a time dependent level set function φ : R2 × R → R as Γ(t) = {x ∈
R2 ; φ(x, t) = 0}. Let us consider the kinematics of curve evolutions. In the
implicit representation, it does not make sense to “track” points on an evolving
curve, as there is no way of knowing the tangential motion of points on Γ(t). The
important notion is instead that of normal velocity. The normal velocity of a curve
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
where v, w are normal velocities and dσ is the curve length element. In the fol-
lowing we therefore denote the linear space of normal velocities at Γ by L2 (Γ).
The scalar product (26.4) is important in the construction of gradient descent
flows for functionals E(Γ) defined on a “manifold” M of admissible curves Γ.
Let the Gâteaux derivative of E(Γ) at Γ is denoted by dE(Γ)v, for any normal
velocity v, and suppose that there exists a vector ∇E(Γ) ∈ L2 (Γ) such that
Let us mention that in3 the kinematic entity corresponding to our normal ve-
locity v in (26.3) is a vector valued function v : Γ → R2 given by v = vn.
Consequently the L2 -scalar product
1 used there is defined, via the Euclidean scalar
product in R2 , as (v, w)Γ = Γ vT w dσ. While v, wΓ = (v, w)Γ , for any pair
of normal velocities, the difference in choice of scalar products actually makes a
difference when rigid motions are considered, as we shall in the following sec-
tions.
It is easy to see that the normal velocity of the evolution in (26.9) is given by
vT = nT v. (26.10)
Inspired by this we define the following subspace of L2 (Γ):
LT = LT (Γ) := {v ∈ L2 (Γ); v = nT v for some v ∈ R2 }. (26.11)
The elements of LT are exactly the the normal velocities which come from pure
translation motions. Notice that dim LT = 2, because LT has the normal veloci-
ties v1 = nT v1 , v2 = nT v2 as a basis, whenever v1 , v2 is a basis for R2 . Now,
define ΠT = ΠT (Γ) as the orthogonal projection in L2 (Γ) onto LT . Clearly, the
identity
ΠT vT = vT (26.12)
holds because vT , given by (26.10), belongs to LT . We can use this identity to find
an explicit formula for ΠT . Multiply vT by n and integrate over Γ, then (26.10)
implies that
vT n dσ = (nT v)n dσ = nnT dσ v. (26.13)
Γ Γ Γ
1
We call the matrix S := Γ nnT dσ appearing on the right-hand side the structure
tensor for the curve Γ. S is clearly positive semi-definite;
for all normal velocities v ∈ L2 (Γ). This is indeed true, as it is easily checked that
the operator Π defined by the right hand-side of (26.16) is self-adjoint (Π∗ = Π)
and idempotent (Π2 = Π), hence an orthogonal projection. Moreover, (26.15)
shows that LT is contained in the range of Π, and since the dimension of Π’s
range is two, it follows that Π = ΠT as claimed in (26.16).
where the structure tensor for Γ appears again. Since LT and LR are now orthog-
onal, it follows that the residual ΠD = I − ΠT − ΠR (I denoting the identity
operator) is also an orthogonal projection. The range of ΠD is interpreted as the
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
space of normal velocities which are responsible for deformations of the initial
contour.
We end this section with some two important observations. The first obser-
vation implies that the normal velocity constructed in (26.8) is in fact a descent
direction for the functional E(Γ).
Proposition 2 If Π is an orthogonal projection in L2 (Γ), and the normal velocity
v(Γ) = −Π∇E(Γ) is not identically zero on Γ. Then v(Γ) is a descent direction
for E(Γ).
Proof: Let t → Γ(t) be the curve evolution which solves (26.7) with v(Γ) given by
the formula in the proposition, then the claim follows from the following simple
calculation:
d
E(Γ) = ∇E(Γ), v(Γ)Γ
dt
= ∇E(Γ), −Π∇E(Γ)Γ = −Π∇E(Γ)2Γ < 0,
respectively. Since the values of E◦ (Γ) and E• (Γ) are invariant under translation
and rotation, we would not expect these functionals to generate any rigid motion
at all. In other words we expect the orthogonal projections onto LT (Γ) and LR (Γ)
of the L2 -gradients
26.4. Experiments
Fig. 26.1. The figure shows the contours of two copies of the same pigeon. The symmetric difference
of the interiors of these contours is the shaded region, i.e., the set of points belonging to exactly one of
the interiors.
represented implicitly as described in Section 26.2. The shapes are taken from the
Kimia shape database.6
We will use the the gradient flow associated with the area of symmetric differ-
ence, cf.,2 between two shapes Γ = {x ∈ R2 : φ(x) = 0} and Γ0 = {x ∈ R2 :
φ0 (x) = 0} defined as
1
ESD (Γ) = ESD (Γ, Γ0 ) =area(Ωint Ωint
0 ) , (26.21)
2
where AB denotes the symmetric difference of A and B, defined as the set of
points which is contained in exactly one of the sets A of B, cf. Figure 1. To find
the gradient of the functional ESD , we introduce the characteristic functions χΩint
and χΩint of the interiors of Γ and Γ0 respectively, and rewrite E as,
0
1 1
ESD (Γ) = (χΩint − χΩint )2 dx = (χ2Ωint − 2χΩint χΩint + χ2Ωint ) dx
2 R2 0 2 R2 0 0
1
= (χΩint − 2χΩint χΩint + χΩint ) dx
2 R2 0 0
1
= ( − χΩint ) dx + const,
Ωint 2 0
since the target contour Γ0 is held fixed. It is now easy to see that the correspond-
ing L2 -gradient is given by the normal velocity ∇ESD (Γ) = 12 − χΩint defined
0
on Γ. In practice the characteristic functions are represented using continuous
approximations of the Heaviside function, cf. e.g.2
Fig. 26.2. Examples of shape warping generated by minimizing the area of the symmetric difference
between an evolving shape and the fixed target shape. The evolving shape is the black curve and the
red curve is the target shape. The evolution is from left to right with the initial curve to the far left and
the final curve to the far right. For each example, the top row corresponds to the evolution where the
rigid motion projection is weighted higher than the deformation and the bottom row is the standard
gradient descent flow. Notice that with the standard gradient descent flow, the intermediate shapes bear
little or no resemblance to neither the initial nor the target shape. This problem can be solved using
the weighted projected motion. The parameters used were (μ1 , μ2 , μ3 ) = (0.3, 0.7, 0), initially,
switching to (μ1 , μ2 , μ3 ) = (0.1, 0.1, 0.8) at the end of the evolution.
another. This has also been noted for the case of using approximate Hausdorff
distance in.3 If the shapes are not perfectly aligned, the evolution will remove
details of the initial shape to a smooth shape and then grow new details corre-
sponding to the target shape. This gives practically useless intermediate shapes.
If we instead partition the flow as in (26.8) and weight rotation and translation
higher than deformation, we obtain a much more intuitive flow with the desired
intermediate shapes. We illustrate this in Figure 26.2. For each example the top
row corresponds to the evolution where rigid motion projection is weighted higher
than deformation and the bottom row is the unchanged symmetric difference flow.
Fig. 26.3. Registration using the rigid part of the evolution. The initial shapes (left), shapes registered
(right).
Figure 26.3 shows some examples of this procedure where one curve is chosen as
the target shape and all other shapes are evolved towards this curve using (26.22).
26.5. Conclusions
We have presented a method for decomposing any curve evolution into rigid mo-
tion and deformation. The method is applied to shape warping and registration
problems with satisfying results. The theory is developed for use in the level set
framework and is simple to implement. It is our opinion that problems of shape
analysis, shape statistics and shape optimization should be studied in the contin-
uum framework using the language of geometry and mathematical analysis. Many
vision problems can then be formulated as variational problems, which are usu-
ally easy to interpret, and discretizations are introduced only at the point where the
numerical solution of the derived equations are computed. This will facilitate the
understanding and comparison of different methods in the field. The aim of this
paper was to try to apply level set methods to standard problems in shape analysis
of curves. Although the method presented here is far from perfect, and certainly
not competitive with standard tools in the field, it may still be regarded as a small
step in the direction of a continuum formulation of shape analysis.
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
References
1. V. Caselles, R. Kimmel, and G. Sapiro. Geodesic active contours. Int. Journal of Com-
puter Vision, 1997.
2. T. F. Chan and W. Zhu. Level set based prior segmentation. Technical Report UCLA
CAM Report 03-66, University of California at Los Angeles, 2003.
3. G. Charpiat, R. Keriven, J-P. Pons, and O. Faugeras. Designing spatially coherent
minimizing flows for variational problems based on active contours. In ICCV, Beijing,
China, 2005.
4. T. Cootes, C. Taylor, D. Cooper, and J. Graham. Active shape models – their training
and application. Computer Vision and Image Understanding, 61(1):38–59, 1995.
5. D. Cremers and S. Soatto. A pseudo-distance for shape priors in level set segmentation.
In IEEE Workshop, Variational, Geometric and Level Set Methods in Computer Vision,
2003.
6. Benjamin B. Kimia. The Kimia shape data base.
http://www.lems.brown.edu/vision/software/.
7. S. J. Osher and R. P. Fedkiw. Level Set Methods and Dynamic Implicit Surfaces.
Springer Verlag, 2002.
8. N. Chr. Overgaard and J. E. Solem. Separating rigid motion for continuous shape
evolution. In Proc. Int. Conf. on Pattern Recognition, Supplemental volume, pages
1–4, Hong Kong, 2006.
9. N. Chr. Overgaard and J. E. Solem. An analysis of variational alignment of curves in
images. In Scale Space 2005, LNCS 3459, pages 480–491, Springer-Verlag 2005.
10. M. Rousson and N. Paragios. Shape priors for level set representations. In Proc. Euro-
pean Conf. on Computer Vision. Springer, 2002.
11. J. E. Solem and N. Chr. Overgaard. A geometric formulation of gradient descent for
variational problems with moving surfaces. In Scale Space 2005, LNCS 3459, pages
419–430, Springer-Verlag 2005.
12. D. Terzopoulos and A. Witkin. Physically based models with rigid and deformable
components. IEEE Comput. Graph. Appl., 8(6):41–51, 1998.
CHAPTER 27
1. Introduction
Variational methods and partial differential equations (PDEs) are more
and more being used to analyze, understand and exploit properties of
images in order to design powerful application techniques, see for
example [15, 16, 17]. Variational methods formulate an image
processing or computer vision problem as an optimization problem
depending on the unknown variables (which are functions) of the
problem. When the optimization functional is differentiable, the calculus
of variations provides a tool to find the extremum of the functional
495
496 M. T. El-Melegy and N. H. Al-Ashwal
leading to a PDE whose steady state gives the solution of the imaging or
vision problem. A very attractive property of these mathematical
frameworks is to state well-posed problems to guarantee existence,
uniqueness and regularity of solutions [16]. More recently, implicit level
set based representations of a contour [9] have become a popular
framework for image segmentation [10, 11, 1].
The integration of shape priors into PDE based segmentation methods
has been a focus of research in past years [2, 3, 4, 5, 6, 7, 8, 12, 13, 14].
Almost all of these variational approaches address the segmentation of
non-parametric shapes in images. They use training sets to introduce the
shape prior to the problem formulation in such a way that only familiar
structures of one given object can be recovered. They typically do not
permit the segmentation of several instances of the given object. This
may be attributed to the fact that a level set function is restricted to the
separation of two regions. As soon as more than two regions are
considered, the level set idea looses parts of its attractiveness. These
level-set methods find their largest area of application in the
segmentation of medical images. After all, none can expect to find two
instances of a human heart in a patient's scanned chest images!
On the other hand, extracting image parametric shapes and their
parameters is an important problem in several computer vision
applications. For example, extraction of a line is a crucial problem in
calculating lens distortion and matching in stereo pairs [18]. As such, our
research has addressed the application of variational methods and PDEs
to the extraction of linear shapes from images. To the best of our
knowledge, we are not aware of any efforts, other than ours, in that
regard. Towards this end, we associate the parameters of the linear shape
within the energy functional of an evolving level set. While existing
approaches do not consider the extraction of more than one object
instance in an image, the case where they would fail, our formulation
allows the segmentation of multiple linear objects from an image.
The basic idea of this paper is inspired by a level set formulation of
Chan-Vese [1]. We introduce line parameters into a level set formulation
of a Chan-Vese like functional in a way that permits the simultaneous
segmentation of several lines in an image. The parameters of the line are
not specified beforehand, they rather evolve in an unsupervised manner
A PDE Method to Segment Image Linear Objects with Application to Lens Distortion Removal 497
in order to automatically select the image regions that are linear and the
parameters of each line are calculated. In particular, we will show that
this approach allows detecting image linear segments while ignoring
other objects. This simple, easy-to-implement method provides noise-
robust results because it relies on a region-based driving flow.
Moreover we apply the proposed PDE-based level set method to the
calibration and removal of camera lens distortion, which can be
significant in medium to wide-angle lenses. Applications that require 3-D
modelling of large scenes typically use cameras with such wide fields of
view [18]. In such instances, the camera distortion effect has to be
removed by calibrating the camera’s lens distortion and subsequently
undistorting the input image. One key feature of our method is that it
integrates the extraction of image features needed for calibration and the
computation of distortion parameters within one energy functional,
which is minimized during level set evolution. Thus our approach, unlike
most other nonmetric calibration methods [21, 22, 23], avoids the
propagation of errors in feature extraction onto the computation stage.
This results in a more robust computation even at high noise levels.
The organization of this paper is as follows: In Section 2, we briefly
review a level set formulation of the piecewise-constant Mumford-Shah
functional, as proposed in [1]. In Section 3, we augment this variational
framework by a parametric term that affects the evolution of the level set
function globally for one object in the image. In Section 4, we extend
this in order to handle more than one parametric object. In Section 5 we
describe several experiments to evaluate the proposed method. We apply
this method to lens distortion removal in Section 6. The conclusions are
presented in Section 7.
⎧C = {(x , y ) ∈ Ω : φ (x , y ) = 0},
⎪
⎨inside (C ) = {(x , y ) ∈ Ω : φ ( x , y ) > 0}, (1)
⎪outside (C ) = {(x , y ) ∈ Ω : φ ( x , y ) < 0}.
⎩
where c1 and c 2 are the mean values of the image f inside and outside
the curve defined as the zero-level set of φ , respectively, and μ ,v , λ1 , λ2
are regularizing parameters to be estimated or chosen a priori. H ε is the
regularized Heaviside function defined as [1]
1⎛ 2 ⎛s ⎞⎞
H ε (s ) = ⎜1 + arctan ⎜ ⎟⎟ . (3)
2⎝ π ⎝ε ⎠⎠
so
dH ε 1 ε
δ ε (s ) = = . (4)
ds π ε + s2
2
∂φ ⎡ ⎛ ∇φ ⎞ ⎤
⎟⎟ − v − λ1 ( f − c1 ) + λ2 (f − c 2 ) ⎥ ,
2
= δ ε (φ ) ⎢ μ div ⎜⎜
2
(5)
∂t ⎢⎣ ⎝ ∇φ ⎠ ⎥⎦
where the scalars c1 and c 2 are updated with the level set evolution and
given by:
c1 =
∫ f (x , y )H ε (φ ) dxdy , (6)
∫ H ε (φ ) dxdy
c2 =
∫ f (x , y )(1 − H ε (φ )) dxdy . (7)
∫ (1 − H ε (φ )) dxdy
Figure 1 illustrates the main advantages of this level set method.
Minimization of the functional (2) is done by alternating the two steps of
iterating the gradient descent for the level set function φ as given by (5)
and updating the mean gray values for the two phases, as given in
equations (6, 7). Implicit representation allows the boundary to perform
splitting and merging.
where θ is the orientation of the normal to the line with the x axis, and
ρ is the distance of the line from the origin. The square distance, r 2 , of a
point ( x 1 , y 1 ) from the line is obtained by plugging the coordinates of
the point into (9):
500 M. T. El-Melegy and N. H. Al-Ashwal
If the points inside the zero level set represent a line, E Line will tend to be
zero.
Keeping ρ and θ constant and minimizing this energy functional
(11) with respect to φ , we deduce the associated Euler-Lagrange
equation for φ as
∂E Line
= δ ε (φ ) ⎡⎣( ρ − x cos θ − y sin θ ) 2 ⎤⎦ . (12)
∂φ
∂E Line ∂E Line
Keeping φ fixed and setting = 0, and = 0 , it is
∂ρ ∂θ
straightforward to solve for the line’s ρ and θ parameters as:
where x and y represent the centroid of the region inside the zero level
set and given by [26, 27]:
x =
∫ x H ε (φ )dxdy , y = ∫ yH ε (φ )dxdy
Ω Ω
, (14)
∫ H ε (φ ) dxdy
Ω ∫ H ε (φ )dxdy
Ω
and
1 ⎛ a2 ⎞
θ = arctan ⎜ ⎟, (15)
2 ⎝ a1 − a3 ⎠
a1 = ∫ ( x − x ) 2 H ε (φ )dxdy , (16)
Ω
A PDE Method to Segment Image Linear Objects with Application to Lens Distortion Removal 501
Figure 1. Evolution of the boundary for the Chan-Vese level set (with a single level set
function). Due to the implicit level set representation, the topology is not constrained,
which allows for splitting and merging of the boundary.
a2 = 2 ∫ ( x − x )( y − y ) H ε (φ )dxdy , (17)
Ω
a3 = ∫ ( y − y ) 2 H ε (φ )dxdy . (18)
Ω
The Euler-Lagrange equation for the total functional (8) can now be
implemented by the following gradient descent:
∂φ ⎡ ⎛ ∇φ ⎞
= δ ε (φ ) ⎢ μ div ⎜ ( ) ⎟ − v − λ1 ( f − c1 ) 2
⎜ ⎟
∂t ⎣⎢ ⎝ ∇φ ⎠ (19)
+ λ2 ( f − c2 ) 2 − α ( ρ − x cos θ − y sin θ ) 2 ⎤⎦ ,
where the scalars c1 , c 2 , ρ , and θ are updated with the level set
evolution according to Eqs. (6,7,13-18).
The weights λ1 and λ2 can be used to speed up the evolution towards
the object boundaries, while μ and v regulate the zero level set. For
502 M. T. El-Melegy and N. H. Al-Ashwal
4. Multi-Object Segmentation
The previous method works only if there is one object in the image. If
this object is linear, it will be detected, whereas other shapes are ignored.
If there are more than one object, H (φ ) will represent all those objects
and Equations (13-18) will not be applicable. In this section we extend
our method in order to perform multiple region segmentation based on
fuzzy memberships that are computed by a Fuzzy C-mean algorithm
(FCM) [19].
∑u
i =1
ik = 1, ∀1 ≤ k ≤ n . (20)
A PDE Method to Segment Image Linear Objects with Application to Lens Distortion Removal 503
∑
n m
k =1 ik
u xk
vi = ∀i . (23)
∑
n m
u
k =1 ik
Figure 2. Evolution of the boundary for the level set under the functional (8). Due to the
ELine term, the final shape of the boundary is the second moment axis of the object.
Increasing v causes smaller part of the object axis be detected. Further increase in v
leaves the nonlinear object undetected.
The term E Line of the energy functional (8) is still given by (11),
whereas the term E seg is now based on minimizing several level set
functions {φi } :
N −1 ⎡
E seg (φ ) = ∑ ⎢ λ ∫ [ (1 − u i )H ε (φi ) + u i (1 − H ε (φi )) ] dx
i =1 ⎣ Ω
(25)
⎤
+ μ ∫ ∇H ε (φi ) dx + v ∫ H ε (φi )dx ⎥ ,
Ω Ω ⎦
N −1 ⎡
E seg (φ ) = ∑ ⎢ λ ∫ (1 − 2u i ) H ε (φi )dx
i =1 ⎣ Ω
(26)
⎤
+ μ ∫ ∇H ε (φi ) dx + v ∫ H ε (φi )dx ⎥ ,
Ω Ω ⎦
∂φi ⎡ ⎛ ∇φi ⎞
= δ ε (φi ) ⎢ μ div ⎜⎜ ( ) ⎟ − v − λ (1 − 2u i )
⎟
∂t ⎢⎣ ⎝ ∇φi ⎠ (27)
− α ( ρi − x cosθi − y sin θ i ) 2 ⎤⎦ ,
where the scalars ρi , and θi are updated with the level set evolution
according to (13)-(18). The overall level set representation is eventually
obtained from the final {φi } as max(φi ) , for all i.
One problem however may arise if multiple disjoint objects belong to
the same cluster (e.g., if they have the same color). Therefore after the
initial clustering by the FCM algorithm, connected-component labeling
is carried out on a hardened version of the result so objects within the
same cluster are separated. Each object part is represented by an image
that contains the membership information but with other twin object
replaced by 0 . Note that N in (26) will thus be increased accordingly.
5. Experimental Results
In order to evaluate the performance of the proposed technique on line
segmentation, several experiments using synthetic and real images have
been carried out. In the experiments, we choose the regularizing
parameters as follows: α = 1 , λ = 10 , μ = 0.5 , and v = 10 . As our
method is region-based segmentation it is robust in noisy images. This is
demonstrated in Fig. 3. All lines have been successfully extracted from
an image artificially corrupted with high noise with standard deviation
σ = 45 . Note that due to the shape constraints, our method again extracts
only the lines and ignores other objects. For the sake of comparison, the
result of the classical Hough transform applied to the same test image
without noise, is shown in Fig. 3(c). Apparently, the level set method
extracts only linear objects in the image, whereas Hough transform can
also detect linear boundaries of objects (e.g., the box). However once the
noise level in the image increases, Hough transform will face some
problems. This is, because it depends largely on edge detection, it is
A PDE Method to Segment Image Linear Objects with Application to Lens Distortion Removal 507
(a) (b)
(c) (d)
Figure 3. Extracted lines from highly noisy image (noise standard deviation = 45),
(a) Input image, (b) The final result using our method, (c) Result of Hough transform on
the noise-free image, (d) Result of Hough transform on the noisy image.
(a) (b)
(c)
Figure 4. Extraction of intersected lines with different intensities. (a) The input image, (b)
Initial level set based on FCM clustering , (c) Final result.
A PDE Method to Segment Image Linear Objects with Application to Lens Distortion Removal 509
(a)
(b)
(c)
(d)
Figure 5. A real image "Birds on power lines". (a) Input image, (b) Hardened output of
FCM algorithm, (c) Initial level set based on the output of FCM algorithm, (d) Final level
set showed how our method excluded the birds.
510 M. T. El-Melegy and N. H. Al-Ashwal
(a)
(b)
(c)
Figure 6. A real image "A running track". (a) Input image, (b) Initial level set based on
the output of FCM algorithm, (c) The final level set imposed on the image.
A PDE Method to Segment Image Linear Objects with Application to Lens Distortion Removal 511
The closest work to ours is that of Kang [24]. He used the traditional
snake to calculate the radial lens distortion parameters. However, his
method is sensitive to the location of the initial contour, so the user
should specify the position of the initial contour. In contrast, our level-set
based method has some global convergence property that makes it not
sensitive to the initial level set.
We start by giving briefly a standard model for lens distortion in
camera lenses, and then we formulate our approach.
The standard model for the radial and decentering distortion [20, 21, 28]
is mapping from the observable, distorted image coordinates, (x , y ) , to
the unobservable, undistorted image plan coordinates, (x u , y u ) .
Neglecting all coefficients other than the first radial distortion term, the
model becomes:
x u = x + x (κ r 2 ),
(28)
y u = y + y (κ r 2 ),
Our goal here is to use the energy functional (8) in order to force the
level set to segment linear, or should-to-be-linear, objects from the image
and simultaneously solve for the lens distortion parameter. The algorithm
outlined in Section 4 is used here. However the E Line term of the energy
functional becomes
512 M. T. El-Melegy and N. H. Al-Ashwal
which measures how well a level set presents a line in the undistorted
image coordinates (x u , y u ) , with θi being the orientation of the normal
to the line, and ρ i being the distance to the line from the origin. Note
that the undistorted coordinates are related to the given distorted image
coordinates (x , y ) via the distortion parameter κ as in (28). As for κ
that minimizes the total energy functional E, we start with an initial
guess κ 0 (in our implementation, we take it 0). Introducing an artificial
time, t , κ is then updated according to the gradient decent rule
∂κ ∂E
=− ,
∂t ∂κ
where
∂E N −1
= 2α ∑ ∫ (x u cosθi + y u sin θi − ρi )
∂κ i =1
Ω
(30)
⎡⎣(x − c x )r 2 cosθi + ( y − c y )r 2 sin θi ] H (φi ) dx dy
Note that κ is updated based on all level sets, but on the other hand each
level set is updated by deducing the associated Euler-Lagrange equation
for φi :
∂φi ∂E ⎡ ⎛ ∇φi ⎞
=− = δ ε (φi ) ⎢ μ div ⎜⎜ ( ) ⎟ − v − λ (1 − 2u i )
∂φi ⎟
∂t ⎣⎢ ⎝ ∇φi ⎠ (31)
− α ( ρi − x u cosθ i − y u sin θ i ) 2 ⎤⎦ ,
where the scalars ρi , θi , and κ are updated with the level set evolution
according to (13, 15, 31). In the steady state the value of κ is the
required lens distortion coefficient.
image of a group of straight lines on a white paper; see Fig. 7(a). Such a
calibration pattern is easily prepared (e.g., with just a printer) without
any special construction overhead. Another sample image captured by
the same camera is shown in Fig. 7(b). Both acquired images are
160 ×120 and have noticeable lens distortion. Our approach is then
applied to the calibration image to recover the value of lens distortion
parameter. Figs. 7(c-d) show the initial and final zero-level sets,
respectively. Our method took less than a minute on P4 2.8GHz pc. The
(a) (b)
(c) (d)
(e) (f)
Figure 7. Lens distortion removal from a real images: (a) The calibration image which is
used to get κ , (b) An input distorted image, (c) Initial zero level set, (d) Final zero level
set (e) Calibration image undistorted, (f) Image in (b) undistorted using the obtained κ .
514 M. T. El-Melegy and N. H. Al-Ashwal
7. Conclusions
We have presented a new variational approach to integrate parametric
shapes into level set-based segmentation. In particular, we addressed the
problem of extracting linear image objects, selectively, while other
image objects are ignored. Our method is inspired by ideas introduced by
Chan and Vese by formulating a new energy functional taking into
account the line parameters. By simultaneously minimizing the proposed
energy functional with respect to the level set function and the line
parameters, the linear shapes are detected while the line parameters are
obtained. This method is extended using Fuzzy memberships to segment
simultaneous lines of different intensities. This method is shown
experimentally to segment simultaneous lines of different intensities,
even in images of large noise.
We have also applied the proposed approach to calibrate camera lens
distortion. In order to achieve this, the formulated energy functional
depends on the parameters of lens distortion parameters as well. By
evolving the level functions minimizing that energy functional, the image
lines and lens distortion parameters are obtained. All this approach
needs is an image captured by the camera for a group of straight lines on
a white paper. Such a calibration pattern is easily prepared (e.g., with just
a printer) without any special construction overhead. One key advantage
of our method is that it integrates the extraction of image features needed
for calibration and the computation of distortion parameters; thus
A PDE Method to Segment Image Linear Objects with Application to Lens Distortion Removal 515
References
1. T. Chan and L. Vese, “Active contours without edges”, IEEE Trans. Image
Processing, 2001, pp. 266–277.
2. L. Staib and J. Duncan. “Boundary finding with parametrically deformable
models”, IEEE Trans. on Patt. Anal. and Mach. Intel., pp. 1061–1075, 1992.
3. T. Cootes, A. Hill, C. Taylor, and J. Haslam, “Use of active shape models for
locating structures in medical images”, Image and Vision Computing, 1994, pp.
355–365.
4. M. Leventon, W. L. Grimson, and O. Faugeras, “Statistical shape influence in
geodesic active contours”, in Proc. Conf. Computer Vis. and Pattern Recog.,
volume 1, Hilton Head Island, SC, June 13–15, 2000, pp. 316–323.
5. A. Tsai, A. Yezzi, W. Wells, C. Tempany, D. Tucker, A. Fan, E. Grimson, and A..
Willsky, “Model–based curve evolution technique for image segmentation”, in
Conf. on Comp. Vision and Patt. Recog., Kauai, Hawaii, 2001, pp. 463–468.
6. D. Cremers, F. Tischhäuser, J. Weickert, and C. Schnörr, “Diffusion snakes:
introducing statistical shape knowledge into the Mumford–Shah functional”, Int. J.
of Comp. Vision, 2002, pp.295–313.
7. D. Cremers, T. Kohlberger, and C. Schnörr, “Nonlinear shape statistics in
Mumford–Shah based segmentation”, in A. Heyden et al., editors, Proc. of the
Europ. Conf. on Comp. Vis., Copenhagen, May 2002, volume 2351 of LNCS, pp.
93–108.
8. M. Rousson and N. Paragios, “Shape priors for level set representations”, in A.
Heyden et al., editors, Proc. of the Europ. Conf. on Comp. Vis., Copenhagen, May
2002, volume 2351 of LNCS, pp. 78–92.
9. S. Osher and J. Sethian, “Fronts propagation with curvature dependent speed:
Algorithms based on Hamilton–Jacobi formulations”, J. of Comp. Phys., 1988, pp.
12–49.
10. V. Caselles, R. Kimmel, and G. Sapiro, “Geodesic active contours”, in Proc. IEEE
Internat. Conf. on Comp. Vision, Boston, USA, 1995, pp. 694–699.
516 M. T. El-Melegy and N. H. Al-Ashwal
27. D. Cremers, S. Osher, S. Soatto, “Kernel density estimation and intrinsic alignment
for shape priors in level set segmentation”, International Journal of Computer
Vision, 69(3), pp. 335-351, 2006.
28. M. Ahmed and A. Farag, “Nonmetric calibration of camera lens distortion:
differential methods and robust estimation”, IEEE Trans. on image processing, vol.
14, no. 8, pp. 1215-1230, 2005.
29. M. El-Melegy and N. Al-Ashwal, “Lens distortion calibration using level sets”,
Lecture Notes in Computer Science, N. Paragios et al. (Eds.)., Springer-Verlag,
Berlin, LNCS 3752, pp. 356 – 367, 2005.
June 25, 2009 8:55 World Scientific Review Volume - 9in x 6in ws-rv9x6
CHAPTER 28
In this paper, we discuss common colour models for background subtraction and
problems related to their utilisation are discussed. A novel approach to repre-
sent chrominance information more suitable for robust background modelling
and shadow suppression is proposed. Our method relies on the ability to rep-
resent colours in terms of a 3D-polar coordinate system having saturation in-
dependent of the brightness function; specifically, we build upon an Improved
Hue, Luminance, and Saturation space (IHLS). The additional peculiarity of the
approach is that we deal with the problem of unstable hue values at low satura-
tion by modelling the hue-saturation relationship using saturation-weighted hue
statistics. The effectiveness of the proposed method is shown in an experimental
comparison with approaches based on RGB, Normalised RGB and HSV.
28.1. Introduction
The underlying step of visual surveillance applications like target tracking and
scene understanding is the detection of moving objects. Background subtraction
algorithms are commonly applied to detect these objects of interest by the use of
statistical colour background models. Many present systems exploit the properties
of the Normalised RGB to achieve a certain degree of insensitivity with respect to
changes in scene illumination.
Hong and Woo1 apply the Normalised RGB space in their background seg-
mentation system. McKenna et al.2 use this colour space in addition to gradient
information for their adaptive background subtraction. The AVITRACK project3
utilises Normalised RGB for change detection and adopts the shadow detection
proposed by Horprasert et al.4
519
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
In this section, the Normalised RGB and IHLS colour spaces used in this paper
are described. It also gives a short overview over circular colour statistics and a
review of saturation weighted hue statistics.
The Normalised RGB space aims to separate the chromatic components from the
brightness component. The red, green and blue channel can be transformed to
their normalised counterpart by using the formulae
Fig. 28.1. Examples of chromatic components. Lexicographically ordered - Image from the
PETS2001 dataset, it’s normalised blue component b, normalised saturation (cylindrical HSV), IHLS
saturation.
The Improved Hue, Luminance and Saturation (IHLS) colour space was intro-
duced in.10 It is obtained by placing an achromatic axis through all the grey
(R = G = B) points in the RGB colour cube, and then specifying the coordinates
of each point in terms of position on the achromatic axis (brightness), distance
from the axis (saturation s) and angle with respect to pure red (hue θH ). The IHLS
model is improved with respect to the similar colour spaces (HLS, HSI, HSV, etc.)
by removing the normalisation of the saturation by the brightness. This has the
following advantages: (a) the saturation of achromatic pixels is always low and (b)
the saturation is independent of the brightness function used. One may therefore
choose any function of R, G and B to calculate the brightness.
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
s = max(R, G, B) − min(R, G, B)
y = 0.2125R + 0.7154G + 0.0721B
√
G+B 3
crx = R − , cry = (B − G)
$ 2 2
cr = crx2 + cry2 (28.2)
⎧
⎪
⎪
⎨undefined crx
if cr = 0
θH = arccos cr elseif cry ≤ 0
⎪
⎩360◦ − arccos crx else
⎪
cr
where crx and cry denote the chrominance coordinates and cr ∈ [0, 1] the chroma.
The saturation assumes values in the range [0, 1] independent of the hue angle (the
maximum saturation values are shown by the circle on the chromatic plane in Fig-
ure 28.2). The chroma has the maximum values shown by the dotted hexagon in
Figure 28.2. When using this representation, it is important to remember that the
hue is undefined if s = 0, and that it does not contain much useable information
when s is low (i.e. near to the achromatic axis).
0° Red
c1
ș1
ș2 ||c1|| = s1
c2 ||c || = s
2 2
Blue Green
where
n
n
C= cos θiH , S = sin θiH (28.4)
i=1 i=1
V = 1−R (28.6)
a Note that, when using the IHLS space (Eq. 28.3), no costly trigonometric functions are involved in
the calculation of hi , since cos(θiH ) = crx /cr and sin(θiH ) = −cry /cr.
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
While the circular variance differs from the linear statistical variance in being
limited to the range [0, 1], it is similar in the way that lower values represent less
dispersed data. Further measures of circular data distribution are given in .14
with co = so ho . Here, ho and so denote the observed hue vector and saturation
respectively.
With the foundations laid out in Section 28.2.4 we proceed with devising a simple
background subtraction algorithm based on the IHLS colour model and saturation-
weighted hue statistics. Specifically, each background pixel is modelled by its
mean luminance μy and associated standard deviation σy , together with the mean
chrominance vector cn and the mean Euclidean distance σD between cn and the
observed chrominance vectors (see Eq. 28.10).
On observing the luminance yo , saturation so , and a Cartesian hue vector ho
for each pixel in a newly acquired image, the pixel is classified as foreground if:
|(yo − μy )| > ασy ∨ cn − so ho > ασD (28.11)
where α is the foreground threshold, usually set between 2 and 3.5.
In order to decide whether a foreground detection was caused by a moving
object or by its shadow cast on the static background, we exploit the chrominance
information of the IHLS space. A foreground pixel is considered as shaded back-
ground if the following three conditions hold:
yo < μy ∧ |yo − μy | < βμy , (28.12)
so − Rn < τds (28.13)
ho Rn − cn < τh , (28.14)
where Rn = cn (see Eq. 28.8,).
These equations are designed to reflect the empirical observations that cast
shadows cause a darkening of the background and usually lower the saturation
of a pixel, while having only limited influence on its hue. The first condition
(Eq. 28.12) works on the luminance component, using a threshold β to take into
account the strength of the predominant light source. Eq. 28.13 performs a test
for a lowering in saturation, as proposed by Cucchiara et al.6 Finally, the lowering
in saturation is compensated by scaling the observed hue vector ho to the same
length as the mean chrominance vector cn and the hue deviation is tested using
the Euclidean distance (Eq. 28.14). This, in comparison to a check of angular de-
viation (see Eq. 28.31 or6 ), also takes into account the model’s confidence in the
learned chrominance vector. That is, using a fixed threshold τh on the Euclidean
distance relaxes the angular error-bound in favour of stronger hue deviations at
lower model saturation value Rn , while penalising hue deviations for high satu-
rations (where the hue is usually more stable).
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
FP
FR = (28.16)
N − (F N + T P )
where T P denotes the number of true positives, F N the number of false
negatives, F P the number of false positives, and N the total number of pixels
in the image.
• Misclassification penalty (MP)
The obtained segmentation is compared to the reference mask on an object-
by-object basis; misclassified pixels are penalized by their distances from the
reference objects border.15
M P = M Pf n + M Pf p (28.17)
with
Nf n
j=1 djf n
M Pf n = (28.18)
D
Nf p
k=1 dkf p
M Pf p = (28.19)
D
Here, djf n and dkf p stand for the distances of the j th false negative and k th
false positive pixel from the contour of the reference segmentation. The nor-
malised factor D is the sum of all pixel-to-contour distances in a frame.
• Rate of misclassifications (RM)
The average normalised distance of detection errors from the contour of a
reference object is calculated using:16
with
1 df n
Nf n j
RMf n = (28.21)
Nf n j=1 Ddiag
1 df p
Nf p k
RMf p = (28.22)
Nf p Ddiag
k=1
Nf n and Nf p denote the number of false negative and false positive pixels
respectively. q Ddiag is the diagonal distance within the frame.
• Weighted quality measure (QMS)
This measure quantifies the spatial discrepancy between estimated and refer-
ence segmentation as the sum of weighted effects of false positive and false
negative pixels.17
QM S = QM Sf n + QM Sf p (28.23)
with
1
Nf n
QM Sf n = wf n (djf n )djf n (28.24)
N j=1
1
Nf p
QM Sf p = wf p (dkf p )dkf p (28.25)
N
k=1
N is the area of the reference object in pixels. Following the argument that
the visual importance of false positives and false negatives is not the same,
and thus they should be treated differently, the weighting functions wf p and
wf n were introduced:
B2
wf p (df p ) = B1 + (28.26)
df p + B3
wf n (df n ) = C · df n (28.27)
In our work for a fair comparison of the change detection algorithms with re-
gard to their various decision parameters, receiver operating characteristics (ROC)
based on detection rate (DR) and false alarm rate (FR) were utilised.
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
We compared the proposed IHLS method with three different approaches from
literature. Namely, a RGB background model using either NRGB- (RGB+NRGB),
or HSV-based (RGB+HSV) shadow detection, and a method relying on NRGB for
both background modelling and shadow detection (NRGB+NRGB).
All methods were implemented using the Colour Mean and Variance approach
to model the background.18 A pixel is considered foreground if |co − μc | > ασc
for any channel c, where c ∈ {r, g, l} for the Normalised RGB and c ∈ {R, G, B}
for the RGB space respectively. oc denotes the observed value, μc its mean, σc
the standard deviation, and α the foreground threshold.
The tested background models are maintained by means of exponentially
weighted averaging18 using different learning rates for background and fore-
ground pixels. During the experiments the same learning and update parameters
were used for all background models, as well as the same number of training
frames.
For Normalised RGB (RGB+NRGB, NRGB+NRGB), shadow suppression was
implemented based on Horprasert’s approach.3,4 Each foreground pixel is classi-
fied as shadow if:
where β and τc denote thresholds for the maximum allowable change in the inten-
sity and colour channels, so that a pixel is considered as shaded background.
In the HSV-based approach (RGB+HSV) the RGB background model is con-
verted into HSV (specifically, the reference luminance μv , saturation μs , and hue
μθ ) before the following shadow tests are applied. A foreground pixel is classified
as shadow if:
vo
β1 ≤ ≤ β2 (28.29)
μv
so − μs ≤ τs (28.30)
|θoH − μθ | ≤ τθ (28.31)
The first condition tests the observed luminance vo for a significant darkening in
the range defined by β1 and β2 . On the saturation so a threshold on the difference
is performed. Shadow lowers the saturation of points and the difference between
images and the reference is usually negative for shadow points. The last condition
takes into account the assumption that shading causes only small deviation of the
hue θoH .6
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
For the evaluation of the algorithms, three video sequences were used. As an
example for a typical indoor scene Test Sequence 1, recorded by an AXIS-211
network camera, shows a moving person in a stairway. For this sequence, ground
truth was generated manually for 35 frames. Test Sequence 2 was recorded with
the same equipment and shows a person waving books in front of a coloured back-
ground. For this sequence 20 ground truth frames were provided. Furthermore in
Test Sequence 3 the approaches were tested on 25 ground truth frames from the
PETS2001 dataset 1 (camera 2, testing sequence). Example pictures of the dataset
can be found in Figure 28.3.
their insufficient sensitivity for bright colours. RGB+HSV gave better results, but
could not take full advantage of the colour information. Similar hue values for
the books and the background resulted in incorrectly classified shadow regions.
Figure 28.5 shows output images from Test Sequence 2. Especially the lower left
part of the images (c), (d), (e), and (f) visualizes a better performance of the IHLS
approach.
the presence of noise in this scene, the chromatic components are unstable and
therefore the motion detection resulted in an significantly increased number of
false positives. RGB+NRGB and our approach exhibit similar performance (our
approach having the slight edge), mostly relying on brightness checks, since there
was not much useable information in shadow regions. RGB+HSV performed less
well, having problems to cope with the unstable hue information in dark areas.
Figure 28.6 shows output images Test Sequence 3.
28.6. Conclusion
We proposed the usage of the IHLS colour space for change detection and shadow
suppression in visual surveillance tasks. In the proposed framework, we advocate
the application of saturation-weighted hue statistics to deal with the problem of
the unstable hue channel at weakly saturated colours.
We have shown that our approach outperforms the approaches using Nor-
malised RGB or HSV in several challenging sequences. Furthermore, our experi-
ments have shown that it is not advisable to use NRGB for background modelling
due to its unstable behaviour in dark areas.
One problem of our approach, however, is the fact that due to the use of satu-
ration weighted hue statistics, it is impossible to tell whether a short chrominance
vector in the background model is the result of unstable hue information or of
a permanent low saturation. Although in the conducted experiments no impair-
ments were evident, it is subject of further research in which cases this shortcom-
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
1 1
0.95
0.9
0.9
0.85
0.8
0.8
Detection Rate
Detection Rate
0.75 0.7
0.7
0.6
0.65
0.6
0.5
RGB+NRGB RGB+NRGB
0.55 NRGB+NRGB NRGB+NRGB
RGB+HSV RGB+HSV
Our Approach Our Approach
0.5 0.4
0 0.002 0.004 0.006 0.008 0.01 0.012 0.014 0.016 0.018 0.02 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07
False Alarm Rate False Alarm Rate
(a) (b)
1
0.95
0.9
Detection Rate
0.85
0.8
0.75
RGB+NRGB
NRGB+NRGB
RGB+HSV
Our Approach
0.7
0 0.5 1 1.5 2 2.5 3 3.5 4
False Alarm Rate x 10
−3
(c)
Fig. 28.7. Experimental results: ROCs for Test Sequence 1 (a), Test Sequence 2 (b), and Test Se-
quence 3 (c).
ing poses a problem. Other fields of interest are the examination of alternatives
to the Euclidean distance for the comparison of the chrominance vectors and an
experimental in-depth-investigation of the shadow classification.
References
CHAPTER 29
Interactive techniques for extracting the foreground object from an image have
been the interest of research in computer vision for a long time. This paper ad-
dresses the problem of an efficient, semi-interactive extraction of a foreground
object from an image. Snake (also known as Active contour) and GrabCut are
two popular techniques, extensively used for this task. Active contour is a de-
formable contour, which segments the object using boundary discontinuities by
minimizing the energy function associated with the contour. GrabCut provides
a convenient way to encode color features as segmentation cues to obtain fore-
ground segmentation from local pixel similarities using modified iterated graph-
cuts. This paper first presents a comparative study of these two segmentation
techniques, and illustrates conditions under which either or both of them fail.
We then propose a novel formulation for integrating these two complimentary
techniques to obtain an automatic foreground object segmentation. We call our
proposed integrated approach as “SnakeCut”, which is based on a probabilistic
framework. To validate our approach, we show results both on simulated and
natural images.
29.1. Introduction
Interactive techniques for extracting the foreground object from an image have
been the interest of research in computer vision for long time. Snake (Active
contour)1 and GrabCut2 are two popular semi-automatic techniques, extensively
used for foreground object segmentation. Active contour is a deformable con-
tour, which segments the object using boundary discontinuities by minimizing the
energy function associated with the contour. Deformation in contour is caused
because of internal and external forces acting on it. Internal force is derived from
the contour itself and external force is invoked from the image. The internal and
external forces are defined so that the snake will conform to object boundary or
535
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
other desired features within the image. Snakes are widely used in many applica-
tions such as segmentation,3,4 shape modeling,5 edge detection,1 motion tracking6
etc. Active contours can be classified as either parametric active contours1,7 or
geometric active contours,8,9 according to their representation and implementa-
tion. In this work, we focus on using parametric active contours, which synthesize
parametric curves within the image domain and allow them to move towards the
desired image features under the influence of internal and external forces. The
internal force serves to impose piecewise continuity and smoothness constraint,
whereas external force pushes the snake towards salient image features like edges,
lines and subjective contours.
GrabCut2 is an interactive tool based on iterative graph-cut for foreground
object segmentation in still images. GrabCut provides a convenient way to encode
color features as segmentation cues to obtain foreground segmentation from local
pixel similarities and global color distribution using modified iterated graph-cuts.
GrabCut extends graph-cut to color images and to incomplete trimaps. GrabCut
has been applied in many applications for the foreground extraction.10–12
Since Active Contour uses gradient information (boundary discontinuities)
present in the image to estimate the object boundary, it can detect the object
boundary efficiently but cannot penetrate inside the object boundary. It cannot
remove any pixel present inside the object boundary which does not belong to a
foreground object. Example of such case is the segmentation of an object with
holes. On the other hand, GrabCut works on the basis of pixel color (intensity)
distribution and so it has the ability to remove interior pixels which are not the
part of the object. Major problem with the GrabCut is: if some part of the fore-
ground object has color distribution similar to the image background, that part
will also be removed in GrabCut segmentation. In the GrabCut algorithm,2 miss-
ing foreground data is recovered by user interaction. This paper first presents
a comparative study of these two segmentation techniques. We then present a
semi-automatic technique based on the integration of Active Contour and Grab-
Cut which can produce correct segmentation in cases where both Snake and Grab-
Cut fail. We call our technique as “SnakeCut”, which is based on integrating the
outputs of Snake and GrabCut using a probabilistic framework. In SnakeCut, user
needs to only specify a rectangle (or polygon) enclosing the foreground object.
No post corrective editing is required in our approach. Proposed technique is used
to segment a single object from an image.
Rest of the paper is organized as follows. In section 29.2, we briefly present
Active Contour and GrabCut techniques which provides the theoretical basis for
the paper. Section 29.3 compares the two techniques and discusses the limitations
of both. In section 29.4, we present the SnakeCut algorithm, our proposed seg-
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
29.2. Preliminaries
where, ∇ is gradient operator. For color images, we estimate the intensity gradient
which takes the maximum of the gradients of R, G and B bands at every pixel,
using:
Figure 29.1(b) shows an example of intensity gradient estimation using the Eq.
29.3 for the image shown in Figure 29.1(a). Figure 29.1(d) shows the intensity
gradient for the same input image estimated from its gray scale image (Figure
29.1(c)). The gradient obtained using Eq. 29.3 gives better edge information. A
snake that minimizes Esnake must satisfy the following Euler equation13
η1 v (s) − η2 v (s) − ∇Eext = 0 (29.4)
where, v (s) and v (s) are the second and fourth order derivatives of v(s). Eq.
29.4 can also be viewed as a force balancing equation, Fint + Fext = 0 where,
Fint = η1 v (s) − η2 v (s) and Fext = −∇Eext . Fint , the internal force, is
responsible for stretching and bending and Fext , the external force, attracts the
snake towards the desired features in the image. To find the object boundary,
Active Contour deforms so it can be represented as a time varying curve v(s, t) =
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
[x(s, t), y(s, t)] where s ∈ [0, 1] is arc length and t ∈ R+ is time. Dynamics
of the contour in presence of external and internal forces can be governed by the
following equation
29.2.2. GrabCut
where,
D(αn , kn , θ, zn ) = − log p(zn |αn , kn , θn ) − log π(αn , kn ) (29.8)
Here, p(.) is a Gaussian probability distribution, and π(.) are mixture weight-
ing coefficients. Therefore, the parameters of the model are now θ =
{π(α, k), μ(α, k), Σ(α, k); α = 0, 1; k = 1..K}, where π, μ and Σ’s represent
the weights, means and covariances of the 2K Gaussian components for the back-
ground and the foreground distributions. In Equation 29.6, the term V is called
the smoothness term and is given as follows:
1
V (α, z) = γ [αn = αm ]exp(−β(zm − zn 2 )) (29.9)
dist(m, n)
(m,n)∈R
where, [φ] denotes the indicator function taking values 0, 1 for a predicate φ, γ is
a constant, R is the set of neighboring pixels, and dist(.) is the Euclidian distance
of neighboring pixels. This energy encourages coherence in the regions of similar
color distribution.
Once the energy model is defined, segmentation can be estimated as a global
minimum: α 2 = arg min E(α, θ). Energy minimization in GrabCut is done by
α
using standard minimum cut algorithm.14 Minimization follows an iterative pro-
cedure that alternates between estimation and parameter learning.
the undesired parts, say holes, present inside the object boundary. If an object has
a hole in it, Active Contour will detect the hole as a part of the object. Figure
29.2(c) shows one such segmentation example of Active Contour for a synthetic
image shown in Figure 29.2(a). Input image (Figure 29.2(a)) contains a fore-
ground object with rectangular hole at the center, through which gray color back-
ground is visible. Segmentation result for this image (shown in Figure 29.2(c)),
contains the hole included as a part of the detected object which is incorrect. Since
Snake could not go inside, it has converted the outer background into white but
retained the hole as gray. Similar erroneous segmentation result of Active Con-
tour for a real image (shown in Figure 29.3(a)) is shown in Figure 29.3(b). One
can see that segmentation output contains a part of the background region (e.g.
grass patch between legs) along with the foreground object. Figure 29.4(b) shows
one more erroneous Active Contour segmentation result for the image shown in
Figure 29.4(a). Segmentation output contains some pixels in the interior part of
the foreground object from the background texture region.
On the other hand, GrabCut considers global color distribution (with local pix-
els similarities) of the background and foreground pixels for segmentation. So it
has the ability to remove interior pixels which are not a part of object. To segment
the object using GrabCut, user draws a rectangle enclosing the foreground object.
Pixels outside the rectangle are considered as background pixel and pixels inside
the rectangle are considered as unknown. GrabCut estimates the color distribution
for the background and the unknown region using separate GMMs. Then, it itera-
tively removes the pixels from the unknown region which belong to background.
Major problem with the GrabCut is as follows. If some part of the object has color
distribution similar to the image background then that part of foreground object is
also removed in the GrabCut segmentation output. So GrabCut is not intelligent
enough to distinguish between the desired and unnecessary pixels, while eliminat-
ing some of the pixels from the unknown region. Figure 29.2(d) shows one such
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
(a)
(b) (c)
Fig. 29.4. (a) Image containing wheel, (b) segmentation result of Active Contour, (c) segmentation
result of GrabCut.
segmentation result of GrabCut for the image shown in Figure 29.2(a), where the
objective is to segment the object with a hole present in the image. Segmentation
result does not produce the upper part of the object (shown in Green color in Fig-
ure 29.2(a)) near the boundary. This occurs because, in the original input image
(Figure 29.2(a)), a few pixels with Green color were present as a part of the back-
ground region. Figure 29.3(c) presents a GrabCut segmentation result for a real
world image shown in Figure 29.3(a). The objective in this case is to crop the sol-
dier from the input image. GrabCut segmentation result for this input image does
not produce the soldier’s hat and the legs. In another real world image example
in Figure 29.4(a), where the user targets to crop the wheel present in the image,
GrabCut segmentation output (Figure 29.4(c)) does not produce the wheel’s gray-
ish green rubber part. This happened because of the presence of some objects with
similar color in the background.
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
Active Contour works on the principle of intensity gradient, where the user initial-
izes a contour around or inside the object for it to detect the boundary of the object
easily. GrabCut, on the other hand, works on the basis of the pixel’s color distri-
bution and considers global cues for segmentation. Hence it can easily remove
the unwanted part (parts from the background) present inside the object bound-
ary. These two segmentation techniques use complementary information (edge
and region based) for segmentation. In SnakeCut, we combine these complemen-
tary techniques and present an integrated method for superior object segmenta-
tion. Figure 29.5 presents the overall flow chart of our proposed segmentation
technique. In SnakeCut, input image is segmented using the Active Contour and
GrabCut separately. These two segmentation results are provided to the prob-
abilistic framework of SnakeCut. This integrates the two segmentation results
based on a probabilistic criterion and produces the final segmentation result.
Main steps of the SnakeCut algorithm are provided in Algorithm 29.1. The
probabilistic framework used to integrate the two outputs is as follows. Inside
the object boundary C0 (detected by the Active Contour), every pixel zi is as-
signed two probabilities: Pc (zi ) and Ps (zi ). Pc (zi ) provides information about
the pixel’s nearness to the boundary, and Ps (zi ) indicates how similar the pixel
is to the background. Large value of Pc (zi ) indicates that pixel zi is far from the
boundary and a large value of Ps (zi ) specifies that the pixel is more similar to
the background. To take the decision about a pixel belonging to foreground or
background, we evaluate a decision function p as follows:
where, ρ is the weight which controls the relative importance of the two tech-
niques, and is learnt empirically. Probability Pc is computed from the distance
transform (DT)15 of the object boundary C0 . DT has been used in many computer
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
the SnakeCut algorithm (refer Algorithm 29.1). In the integration process of the
SnakeCut algorithm, segmentation output for a pixel is taken from the GrabCut
result if p > T , otherwise it is taken from the Active Contour result. In our
experiments, we empirically found ρ = 0.5 to give the best result, and T = 0.7,
a = 0.15 and b = 0.2.
We demonstrate the integrated approach to the process of foreground segmen-
tation with the help of a simulated example. Figure 29.6 shows the details of the
SnakeCut technique for the segmentation of a foreground object present in the
simulated image shown in Figure 29.2(a). Intermediate segmentation outputs pro-
duced by Active Contour and GrabCut for this image have been shown in Figures
29.2(c) & 29.2(d). These outputs are integrated by the SnakeCut algorithm. Fig-
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
Fig. 29.7. Effect of interval [a, b] on the non-linearity of the fuzzy distribution function (Eq. 29.12).
When a < b, transition from 0 (at a) to 1 (at b) is smooth. When a ≥ b, we have a step function with
the transition at (a + b)/2.
ure 29.6(a) shows the object boundary obtained by Active Contour for the object
shown in Figure 29.2(a). Active Contour boundary is used to estimate the dis-
tance transform, shown in Figure 29.6(b), using Eq. 29.11. Probability values
Pc and Ps are estimated for all the pixels inside the object boundary obtained by
Active Contour as described above. SnakeCut algorithm is then used to integrate
the outputs of Active Contour and GrabCut. Figure 29.6(c) shows the segmenta-
tion result of SnakeCut after integration of intermediate outputs (Figure 29.2(c) &
29.2(d)) obtained using Active Contour and GrabCut algorithms. Our proposed
method is able to retain a part of the object which appears similar to background
color and simultaneously eliminate the hole within the object.
To demonstrate the impact of the probability values Pc and Ps , and its impact
on the decision making in SnakeCut algorithm, we use the soldier image (Figure
29.3(a)). We compute Pc , Ps and p values for a few points marked in the soldier
image (Figure 29.8(a)) and then use SnakeCut algorithm to obtain the final seg-
mentation decision. Values obtained for Pc , Ps and p are shown in Figure 29.8(b).
Last column of the table shows the final decision taken by SnakeCut based on the
estimated value of p.
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
A. Initial Segmentation
(1) Segment desired object in I using Active Contour. Say, object boundary iden-
tified by the Active Contour is C0 and segmentation output of Active Contour
is Iac .
(2) Segment desired object in I using GrabCut. Say, segmentation output is Igc .
B. Integration using SnakeCut
(1) Find set of pixels Z in image I, which lie inside contour C0 .
(2) For each pixel zi ∈ Z,
(a) Compute p(zi ) using Eq. 29.10.
(b) if p(zi ) ≤ T then
Isc (zi ) = Iac (zi )
else
Isc (zi ) = Igc (zi )
end if
To extract a foreground object using SnakeCut, user draws a rectangle (or poly-
gon) surrounding the object. This rectangle is used in the segmentation process
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
Figure 29.9 shows a result on a synthetic image where Active Contour fails
but GrabCut works, and their integration (i.e. SnakeCut) also produces the correct
segmentation. Figure 29.9(a) shows an image where the object to be segmented
has a rectangular hole (at the center) in it through which gray background is visi-
ble. Segmentation result produced by Active Contour (Figure 29.9(b)) shows the
hole as a part of the segmented object which is incorrect. In this case, Grab-
Cut performs correct segmentation (Figure 29.9(c)) of the object. Figure 29.9(d)
shows the correct segmentation result produced by SnakeCut for this image. Fig-
ure 29.10 shows a result on another synthetic image where Active Contour works
but GrabCut fails, and their integration (i.e. SnakeCut) produces the correct seg-
mentation. Figure 29.10(a) shows an image where the object to be segmented
has a part (upper green region) similar to the background (green flowers). Active
contour, in this example, produces correct segmentation (Figure 29.10(b)) while
GrabCut fails (Figure 29.10(c)). Figure 29.10(d) shows the correct segmentation
result produced by SnakeCut for this image. Figure 29.11 presents a SnakeCut
segmentation result on a real image. In this example, Active Contour fails but
GrabCut performs correct segmentation. We see in Figure 29.11(b) that Active
Contour segmentation result contains the part of the background (visible through
the handles) which is incorrect. SnakeCut algorithm produces correct segmenta-
tion result which is shown in Figure 29.11(d).
In the examples presented so far, we have seen that only one among the two
(Snake and GrabCut) techniques fail to perform correct segmentation. In these
examples, either Snake is unable to remove holes from the foreground object or
GrabCut is unable to retain the parts of the object which are similar to the back-
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
(a) (b)
Fig. 29.12. SnakeCut segmentation results of (a) soldier (for image in Figure 29.3(a)); and (b) wheel
(for image in Figure 29.4(a)).
ground. SnakeCut performs well in all such situations. We now present a few
results on synthetic and real images, where SnakeCut performs well even when
both Snake and GrabCut techniques fail to perform correct segmentation. Figure
29.12 presents two such SnakeCut results on real world images. Figure 29.12(a)
shows the segmentation result produced by SnakeCut for the soldier image shown
in Figure 29.3(a). This result is obtained by integrating the Active Contour and
GrabCut outputs shown in Figures 29.3(b) and 29.3(c), without user interaction.
Figure 29.12(b) shows the wheel segmentation result produced by SnakeCut, for
the image shown in Figure 29.4(a). Intermediate Active Contour and GrabCut
segmentation results for the wheel are shown in Figure 29.4(b) and 29.4(c).
Two more SnakeCut segmentation results are presented in Figures 29.13 and
29.14 for cup and webcam bracket images, where both Snake and GrabCut tech-
niques fail to perform correct segmentation. The objective in the cup example
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
(b)
(a)
(c) (d)
Fig. 29.14. Segmentation of webcam bracket: (a) input real image where the objective is to segment
the lower bracket present in the image, (b) Snake segmentation result (incorrect, as background pixels
visible through the holes present in the object are detected as part of the foreground object), (c) Grab-
Cut segmentation result (incorrect, as large portions of the bracket are removed in the result), and (d)
correct segmentation result produced by SnakeCut.
(Figure 29.13(a)) is to segment the cup in the image. Cup’s handle has some blue
color spots similar to the background color. Snake and GrabCut results for this
image are shown in Figure 29.13(b) and Figure 29.13(c) respectively. We can see
that both these results are erroneous. Result obtained using Snake contains some
part of the background which is visible through the handle. GrabCut has removed
the spots in the handle since their color is similar to the background. Correct seg-
mentation result produced by SnakeCut is shown in Figure 29.13(d). Objective
in the webcam bracket example (Figure 29.14(a)) is to segment the lower bracket
(inside the red contour initialized by the user) present in the image. Snake and
GrabCut results for this image are shown in Figure 29.14(b) and Figure 29.14(c)
respectively. We can see that both these results are erroneous. The result obtained
using Snake contains some part of the background which is visible through the
holes. GrabCut has removed large portions of the bracket. This is due to the simi-
larity of the distribution of the metallic color of a part of another webcam bracket
present in the background (it should be noted that the color distribution of the two
webcam brackets are not exactly same due to different lighting effects). Correct
segmentation result produced by SnakeCut is shown in Figure 29.14(d). We also
observed a similar performance when the initialization was done around the upper
bracket.
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
(e) (f)
Fig. 29.15. Comparison of the results: (a) SnakeCut result for soldier, (b) GrabCut Output of soldier
with user interaction (reproduced from2 ), (c) SnakeCut result for wheel, (d) GrabCut Output of wheel
with user interaction, (e) SnakeCut result for webcam bracket, (f) GrabCut Output of webcam bracket
with user interaction.
(1) Since the SnakeCut relies on Active contours for regions near the object
boundary, it fails when holes of the object (through which the background
is visible) lie very close to the boundary.
(2) Since the Snake cannot penetrate inside the object boundary and detect holes,
the proposed method of SnakeCut has to rely on the response of the GrabCut
algorithm in such cases. This may result in a hazardous situation only when
the GrabCut detects an interior part belonging to the object as a hole due
to its high degree of similarity with the background. Since decision logic
of SnakeCut relies on GrabCut response for interior parts of the object, it
may fail in cases where GrabCut does not detect those parts of the object as
foreground.
Figure 29.16 presents one such situation (using a simulated image) where
SnakeCut fails to perform correct segmentation. Figure 29.16(a) shows a synthetic
image where Active Contour works correctly (see Figure 29.16(b)) but GrabCut
fails (see Figure 29.16(c)). GrabCut removes the central rectangular green part
of the object in the segmented output, which may be perceived as a part of the
object. We see in this case that SnakeCut also does not perform correct segmenta-
tion and removes the object’s central rectangular green part from the segmentation
result. SnakeCut thus fails when parts of the foreground object are far away from
its boundary and very similar to the background.
The heuristic values of some of the parameters used in our algorithm, which
were obtained empirically, were not so critical for accurate foreground object seg-
mentation. The overall computational times required by SnakeCut on a P-IV, 3
GHz machine with 2 GB RAM, are given in Table 29.1 for some of the images.
Table 29.1. Computational times for foreground object segmentation, required by Snake, GrabCut
and SnakeCut for various images.
a time required to integrate Snake and GrabCut outputs using the probabilistic integrator.
29.6. Conclusion
References
CHAPTER 30
Conrad Sanderson, Abbas Bigdeli, Ting Shan, Shaokang Chen, Erik Berglund
and Brian C. Lovell
NICTA, PO Box 10161, Brisbane QLD 4000, Australia
CCTV surveillance systems have long been promoted as being effective in im-
proving public safety. However due to the amount of cameras installed, many
sites have abandoned expensive human monitoring and only record video for
forensic purposes. One of the sought-after capabilities of an automated surveil-
lance system is “face in the crowd” recognition, in public spaces such as mass
transit centres. Apart from accuracy and robustness to nuisance factors such as
pose variations, in such surveillance situations the other important factors are
scalability and fast performance. We evaluate recent approaches to the recog-
nition of faces at large pose angles from a gallery of frontal images and pro-
pose novel adaptations as well as modifications. We compare and contrast the
accuracy, robustness and speed of an Active Appearance Model (AAM) based
method (where realistic frontal faces are synthesized from non-frontal probe
faces) against bag-of-features methods. We show a novel approach where the per-
formance of the AAM based technique is increased by side-stepping the image
synthesis step, also resulting in a considerable speedup. Additionally, we adapt
a histogram-based bag-of-features technique to face classification and contrast
its properties to a previously proposed direct bag-of-features method. We fur-
ther show that the two bag-of-features approaches can be considerably sped up,
without a loss in classification accuracy, via an approximation of the exponential
function. Experiments on the FERET and PIE databases suggest that the bag-of-
features techniques generally attain better performance, with significantly lower
computational loads. The histogram-based bag-of-features technique is capable
of achieving an average recognition accuracy of 89% for pose angles of around
25 degrees. Finally, we provide a discussion on implementation as well as legal
challenges surrounding research on automated surveillance.
557
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
30.1. Introduction
In this section we describe two local feature based approaches, with both ap-
proaches sharing a block based feature extraction method summarised in Sec-
tion 30.2.1. Both methods use Gaussian Mixture Models (GMMs) to model dis-
tributions of features, but they differ in how the GMMs are applied. In the first
approach (direct bag-of-features, Section 30.2.2) the likelihood of a given face
belonging to a specific person is calculated directly using that person’s model. In
the second approach (histogram-based bag-of-features, Section 30.2.3), a generic
model (not specific to any person), representing “face words”, is used to build
histograms which are then compared for recognition purposes. In Section 30.2.4
we describe how both techniques can be sped up.
a While in this work we used the 2D DCT for describing each block (or patch), it is possible to use
1
where N (x|μ, Σ) = (2π)- 2 |Σ|− 2 exp − 21 (x − μ)T Σ-1 (x − μ) is a multi-
d
where wg is the weight for Gaussian g and pg (x) is the probability of vector x
according to Gaussian g.
Comparison of two faces is then accomplished by comparing their correspond-
ing histograms. This can be done by the so-called χ2 distance metric,20 or the
simpler approach of summation of absolute differences:21
G 33 [g] 3
[g] 3
d (hA , hB ) = 3hA − hB 3 (30.2)
g=1
[g]
where hA is the g-th element of hA . As preliminary experiments suggested that
there was little difference in performance between the two metrics, we’ve elected
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
In practice the time taken by the 2D DCT feature extraction stage is negligible
and hence the bulk of processing in the above two approaches is heavily concen-
trated in the evaluation of the exp() function. As such, a considerable speedup
can be achieved through the use of a fast approximation of this function.22 A
brief overview follows: rather than using a lookup table, the approximation is
accomplished by exploiting the structure and encoding of a standard (IEEE-754)
floating-point representation. The given argument is transformed and injected as
an integer into the first 32 bits of the 64 bit representation. Reading the resulting
floating point number provides the approximation. Experiments in Section 30.4
indicate that the approximation does not affect recognition accuracy.
Let us describe a face by a set of N landmark points, where the location of each
point is tuple (x, y). A face can hence be represented by a 2N dimensional vector:
T
f = [ x1 , x2 , · · · , xN , y1 , y2 , · · · , yN ] . (30.3)
f = f + Ps bs (30.4)
where f is the mean face vector, Ps is a matrix containing the k eigenvectors with
largest eigenvalues (of a training dataset), and bs is a weight vector. In a similar
manner, the texture variations can be represented by:
g = g + Pg bg (30.5)
b = Pc c (30.7)
f = f + Qs c (30.8)
g = g + Qg c (30.9)
where
In the above, Qs and Qg are matrices describing the shape and texture variations,
while Pcs and Pcg are shape and texture components of Pc respectively, i.e.:
Pcs
Pc = (30.12)
Pcg
The process of “interpretation” of faces is hence comprised of finding a set of
model parameters which contain information about the shape, orientation, scale,
position, and texture.
Let R−1 c be the left pseudo-inverse of the matrix [ cc cs ]. Eqn. (30.14) can then
be rewritten as:
T
R−1
c c[new ] − c0 = cos(θ[new ] ) sin(θ[new ] ) . (30.15)
Let [ xα yα ] = R−1 c c[new ] − c0 . Then the best estimate of the orientation is
[new ] −1
θ = tan (yα /xα ). Note that the estimation of θ[new ] may not be accurate
due to land mark annotation errors or regression learning errors.
To reconstruct at an alternate angle, θ[alt] , we can add the residual vector to the
mean face for that angle:
c[alt ] = cres + c0 + cc cos(θ[alt ] ) + cs sin(θ[alt ] ) (30.17)
To synthesize the frontal view face, θ[alt ] is set to zero. Eqn. (30.17) hence sim-
plifies to:
c[alt ] = cres + c0 + cc (30.18)
Based on Eqns. (30.8) and (30.9), the shape and texture for the frontal view can
then be calculated by:
f [alt ] = f + Qs c[alt ] (30.19)
g[alt ] = g + Qg c[alt ] (30.20)
Examples of synthesized faces are shown in Fig. 30.2. Each synthesized face can
then be processed via the standard Principal Component Analysis (PCA) tech-
nique to produce features which are used for classification.7
Fig. 30.2. Top row: frontal view and its AAM-based synthesized representation. Bottom row: non-
frontal view as well as its AAM-based synthesized representation at its original angle and θ [alt] = 0
(i.e. synthesized frontal view).
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
As such, the pose-robust features should represent the faces more accurately, lead-
ing to better discrimination performance. We shall refer to this approach as the
pose-robust features method.
30.4. Evaluation
We are currently in the process of creating a suitable dataset for face classifica-
tion in CCTV conditions (part of a separately funded project). As such, in these
experiments we instead used subsets of the PIE dataset26 (using faces at −22.5o,
0o and +22.5o) as well as the FERET dataset27 (using faces at −25o, −15o, 0o ,
+15o and +25o).
To train the AAM based approach, we first pooled face images from 40
FERET individuals at −15o , 0o , +15o . Each face image was labelled with 58
points around the salient features (the eyes, mouth, nose, eyebrows and chin). The
resulting model was used to automatically find the facial features (via an AAM
search) for the remainder of the FERET subset. A new dataset was formed, con-
sisting of 305 images from 61 persons with successful AAM search results. This
dataset was used to train the correlation model and evaluate the performances of
all presented algorithms. In a similar manner, a new dataset was formed from the
PIE subset, consisting of images for 53 persons.
For the synthesis based approach, the last stage (PCA based feature extraction
from synthesized images) produced 36 dimensional vectors. The PCA subsystem
was trained as per.7 The pose-robust features approach produced 43 dimensional
vectors for each face. For both of the AAM-based techniques, Mahalanobis dis-
tance was used for classification.18
For the bag-of-features approaches, in a similar manner to,8 we used face im-
ages with a size of 64×64 pixels, blocks with a size of 8×8 pixels and an overlap
of 6 pixels. This resulted in 784 feature vectors per face. The number of retained
DCT coefficients was set to 15 (resulting in 14 dimensional feature vectors, as
the 0-th coefficient was discarded). The faces were normalised in size so that the
distance between the eyes was 32 pixels and the eyes were in approximately the
same positions in all images.
For the direct bag-of-features approach, the number of Gaussians per model
was set to 32. Preliminary experiments indicated that accuracy for faces at around
25o peaked at 32 Gaussians, while using more than 32 Gaussians provided little
gain in accuracy at the expense of longer processing times.
For the histogram-based bag-of-features method, the number of Gaussians for
the generic model was set to 1024, following the same reasoning as above. The
generic model (representing “face words”) was trained on FERET ba data (frontal
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
Pose
Method
−25o −15o +15o +25o
PCA 23.0 54.0 49.0 36.0
Synthesis + PCA 50.0 71.0 67.4 42.0
pose-robust features 85.6 88.2 88.1 66.8
Direct bag-of-features 83.6 93.4 100.0 72.1
Histogram bag-of-features 83.6 100.0 96.7 73.7
Pose
Method
−22.5o +22.5o
PCA 13.0 8.0
Synthesis + PCA 60.0 56.0
pose-robust features 83.3 80.6
Direct bag-of-features 100.0 90.6
Histogram bag-of-features 100.0 100.0
Table 30.3. Average time taken for two stages of processing: (1) conversion of a probe face
from image to format used for matching (one-off cost per probe face), (2) comparison of one
probe face with one gallery face, after conversion.
The second component, for the case of the direct bag-of-features method,
involves calculating the likelihood using Eqn. (30.1), while for the histogram-
based approach this involves just the sum of absolute differences between two
histograms (Eqn. (30.2)). For the two AAM-based methods, the second compo-
nent is the time taken to evaluate the Mahalanobis distance.
As expected, the pose-robust features approach has a speed advantage over
the synthesis based approach, being about 50% faster. However, both of the bag-
of-features methods are many times faster, in terms of the first component — the
histogram-based approach is about 7 times faster than the pose-robust features
method. While the one-off cost for the direct bag-of-features approach is much
lower than for the histogram-based method, the time required for the second com-
ponent (comparison of faces after conversion) is considerably higher, and might
be a limiting factor when dealing with a large set of gallery faces (i.e. a scalability
issue).
When using the fast approximation of the exp() function, the time required by
the histogram-based method (in the first component) is reduced by approximately
30% to 0.096, with no loss in recognition accuracy. This makes it over 10 times
faster than the pose-robust features method and over 15 times faster than the syn-
thesis based technique. In a similar vein, the time taken by the second component
of the direct bag-of-features approach is also reduced by approximately 30%, with
no loss in recognition accuracy.
30.5. Discussion
Acknowledgements
References
recognition vendor test 2002. In Proceedings of Analysis and Modeling of Faces and
Gestures, p. 44, (2003).
6. V. Blanz, P. Grother, P. Phillips, and T. Vetter. Face recognition based on frontal views
generated from non-frontal images. In Proc. IEEE Int. Conf. Computer Vision and
Pattern Recognition, vol. 2, pp. 454–461, (2005).
7. T. Shan, B. Lovell, and S. Chen. Face recognition robust to head pose from one sam-
ple image. In Proc. 18th Int. Conf. Pattern Recognition (ICPR), vol. 1, pp. 515–518,
(2006).
8. C. Sanderson, S. Bengio, and Y. Gao, On transforming statistical models for non-
frontal face verification, Pattern Recognition. 39(2), 288–302, (2006).
9. F. Cardinaux, C. Sanderson, and S. Bengio, User authentication via adapted statistical
models of face images, IEEE Transactions on Signal Processing. 54(1), 361–373,
(2006).
10. S. Lucey and T. Chen. Learning patch dependencies for improved pose mismatched
face verification. In IEEE Conf. Computer Vision and Pattern Recognition, vol. 1, pp.
909–915, (2006).
11. L. Wiskott, J. Fellous, N. Kuiger, and C. V. Malsburg, Face recognition by elastic
bunch graph matching, IEEE Trans. Pattern Analysis and Machine Intelligence. 19
(7), 775–779, (1997).
12. K. Bowyer, K. Chang, and P. Flynn., A survey of approaches and challenges in 3D and
multi-modal 3D+2D face recognition., Computer Vision and Image Understanding.
101(1), 1–15, (2006).
13. G. Csurka, C. Dance, L. Fan, J. Willamowski, and C. Bray. Visual cetegorization
with bags of keypoints. In Workshop on Statistical Learning in Computer Vision (in
conjunction with ECCV’04), (2004).
14. J. Sivic and A. Zisserman. Video google: A text retrieval approach to object matching
in videos. In Proceedings of 9th International Conference on Computer Vision (ICCV),
vol. 2, pp. 1470–1477, (2003).
15. E. Nowak, F. Jurie, and B. Triggs. Sampling strategies for bag-of-features image clas-
sification. In Computer Vision – ECCV 2006, Lecture Notes in Computer Science
(LNCS), vol. 3954, pp. 490–503. Springer, (2006).
16. T. S. Lee, Image representation using 2D Gabor wavelets, IEEE Trans. Pattern Anal-
ysis and Machine Intelligence. 18(10), 959–971, (1996).
17. R. Gonzales and R. Woods, Digital Image Processing. (Addison-Wesley, 1992).
18. R. Duda, P. Hart, and D. Stork, Pattern Classification. (Wiley, 2001), 2nd edition.
19. Y. Rodriguez, F. Cardinaux, S. Bengio, and J. Mariethoz, Measuring the performance
of face localization systems, Image and Vision Computing. 24, 882–893, (2006).
20. C. Wallraven, B. Caputo, and A. Graf. Recognition with local features: the kernel
recipe. In Proc. 9th International Conference on Computer Vision (ICCV), vol. 1, pp.
257–264, (2003).
21. T. Kadir and M. Brady, Saliency, scale and image description, International Journal
of Computer Vision. 45(2), 83–105, (2001).
22. N. Schraudolph, A fast, compact approximation of the exponential function, Neural
Computation. 11, 853–862, (1999).
23. T. Cootes and C. Taylor. Active shape models - ‘smart snakes’. In Proceedings of
British Machine Vision Conference, pp. 267–275, (1992).
May 21, 2009 10:22 World Scientific Review Volume - 9in x 6in ws-rv9x6
24. T. Cootes, G. Edwards, and C. Taylor, Active appearance models, IEEE Transactions
on Pattern Analysis and Machine Intelligence. 23(6), 681–685, (2001).
25. T. Cootes, K. Walker, and C. Taylor. View-based active appearance models. In Pro-
ceedings of 4th IEEE International Conference on Automatic Face and Gesture Recog-
nition, pp. 227–232, (2000).
26. T. Sim, S. Baker, and M. Bsat, The CMU pose, illumination, and expression database,
IEEE. Trans. Pattern Analysis and Machine Intelligence. 25(12), 1615–1618, (2003).
27. P. Phillips, H. Moon, S. Rizvi, and P. Rauss, The FERET evaluation methodology for
face-recognition algorithms, IEEE Trans. Pattern Analysis and Machine Intelligence.
22(10), 1090–1104, (2000).
June 26, 2009 9:54 World Scientific Review Volume - 9in x 6in ws-rv9x6
AUTHOR INDEX
575
June 26, 2009 9:54 World Scientific Review Volume - 9in x 6in ws-rv9x6
SUBJECT INDEX
577
June 26, 2009 9:54 World Scientific Review Volume - 9in x 6in ws-rv9x6